The Tux and the Pachyderm

Tuesday, 2025-04-29

In the previous blog post we have seen how plenty of ram doesn’t necessary results in a faster instance.

This time we’ll quickly see how PostgreSQL and Linux relate to each other in particular if we want to run our RDBMS at scale.

Then we’ll have a look to the virtual memory in Linux and how to make it more friendly to PostgreSQL.

Let’s dig in.

PostgreSQL and Linux are not friends

This is indeed a bold statement but it’s also true that PostgreSQL and Linux have the bad habit of stepping on each other’s toes.

An example is when in 2012 the kernel 3.6 nearly broke PostgreSQL.

A couple of years later at the 2014 Linux Filesystem, Storage, and Memory Management Summit the Kernel developers and the PostgreSQL developers had a an excellent conversation to find a common ground about the pain points afflicting PostgreSQL on Linux.

From that constructive meeting came out several improvements on both systems that eased the friction points between PostgreSQL on linux.

So what are the things that need attention?

Shared buffer and NUMA

PostgreSQL manages the IO through a single large shared memory segment, the shared buffer, and expects the segment stored in the RAM in contiguous pages. However Linux stores the shared buffer pages into the RAM depending on the free space available and NUMA determines the memory access depending on the relative position to the CPU.

As PostgreSQL is based on multiprocessing with each client having an associated backend process forked from the main postgres process, then NUMA may move the process from one cpu to another based on the distance from the cpu to the the portion of RAM required by the backend.

Filesystem block size

Another issue strictly related to the X86 architecture, is the Linux Filesystem page size limit.

The data page of 8192 bytes (8kb) is the de facto standard with a PostgreSQL installation . However Linux on X86 limits the filesystem page size to 4096 bytes (4k). Therefore there is fragmentation of the PostgreSQL pages when stored on corresponding filesystem layer.

The linux virtual memory

However in this blog post we’ll check on the linux memory that is not well known and may be the cause unpredictable behaviour or misunderstanding.

As PostgreSQL is designed to manage its own data pages the IO operations for accessing the data file containing the data pages rely on the operating system.

In a standard Linux distribution the free memory is dynamically used by the kernel for caching the files in order to improve the performance.

In a short and definitely not exhaustive way, when PostgreSQL needs a data page from the disk reads the data file containing the page via a system call. Linux then loads the data file into the kernel cache that is used by PostgreSQL to retrieve the data page that is copied into the shared buffer.

If the page is modified then the change is flushed to the WAL. If the GUC (Grand Unified Configuration) fsync is set to on then change is flushed immediately and synchronously to corresponding WAL file on the persistent storage.

The dirty buffer left behind is then written to the corresponding data file later (e.g. during a checkpoint). When this happens the dirty buffer is written into the datafile cached in memory as a dirty page. The page is then persisted on disk by Linux when certain conditions are met.

    graph TD
    A[PostgreSQL backend modifies data] --> B{"Dirty Page(s) in the shared buffer"};
    B -- "WAL Flush fsync, Synchronous" -->  D[Write to Disk];
    B -- "Dirty buffer flushed to kernel cache" --> E{Delayed Write to Disk};
    E -- Eventually --> D;
    D --> F[Data Persisted on Disk];

    style A fill:#fff,stroke:#333,stroke-width:2px
    style B fill:#ffc,stroke:#333,stroke-width:2px
    style D fill:#cff,stroke:#333,stroke-width:2px
    style E fill:#fcc,stroke:#333,stroke-width:2px
    style F fill:#9ef,stroke:#333,stroke-width:2px

The kernel parameters that control the behavior of the delayed flush are vm.dirty_ratio and vm.dirty_background_ratio.

For example on a Rocky Linux 9.5 the default values are respectively 30 and 10.

root@farnese:~# sysctl vm.dirty_ratio
vm.dirty_ratio = 30
root@farnese:~# sysctl vm.dirty_background_ratio
vm.dirty_background_ratio = 10

The numbers express the percentage of system memory that need to contain dirty pages before the associated flush to disk is triggered.

In this particular case when the 10% of the system memory becomes dirty then an asynchronous background flush is initiated. The this kind of flush has little impact on the IO operations and can safely work in the background continuously.

However if the amount of dirty kernel cache reaches the 30% of the system memory then a synchronous flush is initiated. This kind of flush halts any IO operation until is finished. The impact on a running database is that all the IO operations stop until the flush is finished.

So, how this relates with the shared buffer settings? We can take as an example the configuration analysed in the previous blog post.

Assuming we set the shared buffer conservatively to the 25% of the ram this value will be safe for avoiding any synchronous flush caused by dirty pages generated by the shared buffer. However if the shared buffer exceeds the percentage of vm.dirty_ratio then the impact on the OS depends greatly from the workload.

If the workload is prevalently read then the shared_buffer will use the kernel cache to maximise the performance for the data access.

However if the workload is mostly write then it may happen that a great amount of dirty buffers will flood the kernel cache during the checkpoint and then, when the synchronous flush to disk is triggered all the IO operations of the instance ,including the flush on WAL, will be halted until the flush’s end.

Then configuring the vm.dirty_ratio to be larger than the percentage of RAM used by the shared buffer will prevent the risk of a synchronous flush is a good idea.

However to find the sweet spot for vm.dirty_ratio and vm.dirty_background_ratio depends from the database workload and the storage subsystem.

The Memory Overcommit

Let’s get straight to the point, the memory overcommit from the PostgreSQL point of view is a bug.

It’s true that the shared buffer is a fixed size memory segment stored in memory but the backends have their own memory that is allocated when the backend becomes active.

Each process in Linux has a special score called oom_score that is calculated, from the kernel 5 onward, on the total memory usage. The proc filesystem holds the score for each process. For example to find the oom_score for the PID 15294 is sufficient to run the following command.

cat /proc/15294/oom_score
666

It’s clear then that the larger is the shared buffer the higher is the score and consequently the risk of having the postgres main process terminated.

The init scripts shipped with the packaged PostgreSQL usually set the score for the main postgres process to -1000 that should make the process immune to the Out Of Memory (OOM) killer termination. However the backends that are forked from the main postgres process normally get a less aggressive adjustment with the score set to 0. With this score the possibility of having a backend terminated is low but if the pressure on the memory is too high then the OOM killer will terminate the processess with the higher score and eventually move to lower scored processes until it may reach a PostgreSQL backend.

When the termination happens, normally we’ll find traces in the system log.

  Out of Memory: Killed process 12345 (postgres).

When a postgres backend crashes or is killed there is a general reset to prevent shared buffer corruption. This is the PostgreSQL when a backend is terminated with the 9 signal.


[17195] LOG:  starting PostgreSQL 17.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.2.0, 64-bit
[17195] LOG:  listening on IPv6 address "::1", port 5432
[17195] LOG:  listening on IPv4 address "127.0.0.1", port 5432
[17195] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
[17199] LOG:  database system was shut down at 2025-04-29 07:45:48 CEST
[17195] LOG:  database system is ready to accept connections
[17195] LOG:  server process (PID 17210) was terminated by signal 9: Killed
[17195] DETAIL:  Failed process was running: select pg_backend_pid();
[17195] LOG:  terminating any other active server processes
[17195] LOG:  all server processes terminated; reinitializing
[17218] LOG:  database system was interrupted; last known up at 2025-04-29 07:45:48 CEST
[17218] LOG:  database system was not properly shut down; automatic recovery in progress
[17218] LOG:  redo starts at 2/F40000A0
[17218] LOG:  invalid record length at 2/F40000D8: expected at least 24, got 0
[17218] LOG:  redo done at 2/F40000A0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
[17219] LOG:  checkpoint starting: end-of-recovery immediate wait
[17219] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.008 s, sync=0.005 s, total=0.035 s; sync files=2, longest=0.003 s, average=0.003 s; distance=0 kB, estimate=0 kB; lsn=2/F40000D8, redo lsn=2/F40000D8
[17195] LOG:  database system is ready to accept connections

It’s true that all the sessions are disconnected to prevent the shared buffer’s corruption but what happens next is a crash recovery.

This may seem a minimum problem but a crash recovery will rollback any open transaction and, more important, it will clean all the data from any unlogged table present in the instance.

Is therefore good practice to disable the memory overcommit on Linux running PostgreSQL adding the following lines to the file sysctl.conf and then loading the new settings with sysctl -p

vm.overcommit_memory=2 # disable the memory overcommit
vm.overcommit_ratio = 100 # percentage of ram available the processes

With the memory overcommit disabled any memory request that may exceed the RAM plus the eventual SWAP will be denied by Linux.

This will cause an error on the affected backend but it will not cause any instance crash recovery.

Wrap up

Linux and PostgreSQL have different architectures that may clash when we try to run the system at scale. With some tuning it’s possible to push them together and avoid unexpected behaviour with a general resilience and performance improvement.

Of course there are still more ways to improve the friendship between Tux and the Pachyderm and we’ll have a look in the future.