Size matter

Size matter

Nowadays it’s not uncommon to deal with machines with hundreds of GB of RAM.

Abundant memory can give PostgreSQL a massive performance boost. However things work slightly different than you may expect.

Let’s find out!

The problem with the shared_buffer

The shared buffer is a segment allocated by PostgreSQL in the server’s memory for managing the flow of data between the data area and the backends.

Whether we are reading or changing the data, the database physical pages are stored into shared buffer.

During the normal database activity it’s quite normal the eviction of stored buffers to make space for data pages not present into the shared buffer.

In the ancient PostgreSQL 7.4 there was a simple last recently used (LRU) algorithm in place. This simplistic approach had many limitations, in particular because it didn’t consider whether the candidate buffer was used frequently.

With PostgreSQL 8.0.0 the memory manager was replaced by the adaptive replacement cache (ARC). However, because the ARC was patented it was replaced by less efficient algorithm in PostgreSQL 8.0.1.

PostgreSQL 8.1 replaced again the inefficient 2q algorithm with a new implementation called the clock sweep that it’s is still in use in the modern PostgreSQL.

The clock sweep

In the shared buffer there is a “free list” of buffers prime candidates for replacement. The buffers that contain no valid page are always in this list.
It should be possible to add buffers into this list, for example if their pages are unlikely to be needed soon but the current algorithm never does that.

The normal buffer replacement strategy is therefore managed by the clock sweep via a circular buffer named nextVictimBuffer protected by a separate system-wide spinlock called buffer_strategy_lock.

This spinlock provides mutual exclusion for operations that access the buffer free list or select buffers for replacement. While buffer_strategy_lock is held then no other locks of any sort should be acquired allowing the buffer replacement to happen in multiple backends with reasonable concurrency.

As explained in src/backend/storage/buffer/README each buffer in memory has an usage counter increased by one unit every time a buffer is pinned by a backend up to a small limited value.

When searching for a buffer to evict the clock sweep algorithm does the following:

  1. Obtain the buffer_strategy_lock.
  2. If there are buffers in the free list then remove its head buffer and release buffer_strategy_lock. If the selected buffer is pinned or has a nonzero usage count ignore it and go back to step 1. Otherwise, pin the buffer, and return it.
  3. When the buffer free list is empty then select the buffer pointed to by nextVictimBuffer circularly advance nextVictimBuffer ready for the next time. Then release the buffer_strategy_lock.
  4. If the selected buffer is pinned or has a nonzero usage count decrement its usage count if nonzero then reacquire buffer_strategy_lock, and return to step 3 to examine the next buffer.
  5. Pin the selected buffer, and return.

Note that if the selected buffer is dirty then it’s necessary to write it out before it can be recycled it. If the buffer is pinned in meanwhile then the entire process will give up and another buffer will be tried.

The first thing that appears clear is that the free list is used only for a fixed period of time after the instance is started. When all the buffers have a valid page then the free list becomes empty and the algorithm will not populate it again. With the free list empty then the nextVictimBuffer will always move through the shared buffer searching for a candidate for eviction.

    ---
id: 7abddde8-f6cd-4e1c-998e-1afa6c2bc327
---
flowchart TD
    A(Start Clock Sweep) --> B{Is Current Buffer a Candidate?}
    D --> B
    B -- No --> D[Advance Clock Hand]
    
    B -- Yes --> C{Is Buffer in Use?}

    
    
    C -- Yes --> D[Advance Clock Hand]
    C -- No --> F{Is Buffer Dirty?}
    F -- Yes --> G[Schedule Write to Disk]
    F -- No --> H[Evict Buffer]
    
    G --> I
    H --> J[Replace with New Buffer]
    I[Wait for Write Completion] --> H
    J --> K(End Clock Sweep)
    style A fill:#fff,stroke:#333,stroke-width:2px
    style B fill:#ccf,stroke:#333,stroke-width:2px
    style C fill:#9cf,stroke:#333,stroke-width:2px
    style D fill:#fcc,stroke:#333,stroke-width:2px
    
    style F fill:#f9c,stroke:#333,stroke-width:2px
    style G fill:#f6c,stroke:#333,stroke-width:2px
    style H fill:#c9f,stroke:#333,stroke-width:2px
    style I fill:#9ff,stroke:#333,stroke-width:2px
    style J fill:#9fc,stroke:#333,stroke-width:2px
    style K fill:#eee,stroke:#333,stroke-width:2px

As the usage counter is incremented on buffer pin it’s also clear that when the free list becomes empty then all the buffers in the shared buffer will have the usage count set at least to 1. Therefore it will be necessary at least one entire shared buffer scan before finding a candidate buffer.

The ring buffer strategy

The normal buffer replacement strategy is overridden under certain conditions that may cause the shared buffer complete rewrite.

Bulk operations as VACUUM or a large sequential scan will use a small ring buffer that will not affect the rest of the shared buffer.

In particular for large sequential scans a 256KB ring is used.

For VACUUM the size of the ring is controlled by the vacuum_buffer_usage_limit GUC.

For bulk writes (currently only COPY IN and CREATE TABLE AS SELECT) the ring size will be of 16MB but not more than 1/8th of shared_buffers size though.

In medio stat virtus

The PostgreSQL documentation suggests to set the shared_buffer to 25% of the system’s RAM a as a starting point discouraging to go up to 40%.

The shared buffer set at 25% RAM works fine when the server’s memory sits between 4 GB and 100 GB with a range between 1GB and 25GB.

However with memory less than 4 GB sizing correctly the shared buffer is not an easy task and depends on many factors including the vm settings on linux.

The interesting aspect of the shared_buffer size is when there is plenty of RAM.

If for example we consider a system with 400 GB a 25% RAM shared buffer should be 100 GB.

The performance of a large shared buffer depends from the underlying data area.

The most favorable scenario is when the data area is smaller than the shared buffer. All the buffers will be cached in the shared buffer using only the free list. No extra work will be then required by the clock sweep to manage the memory. The usage counters will reach the maximum values and will never go down.

However if the data area doesn’t fit in the shared buffer then the free list will become empty and then the normal buffer replacement strategy will kick in. When a buffer is loaded in memory is pinned and therefore the usage counter starts with one. That means that with a full packed shared buffer the nextVictimBuffer should scan all buffers at least one time to find a buffer with usage counter set to zero.

Modern DDR4 memories have a theoretical throughput of 25-30 GB/s. This is more realistically ranging between 5-10 GB/s. With a 100 GB full packed shared buffer the time required to perform one single full scan ranges between 3 and 20 seconds.

Obviously the timing may vary greatly depending on the conditions and the workload.

If for example the nextVictimBuffer sets the usage counter to zero to a buffer but then another backend pins it before it’s scanned again, then the buffer can’t be evicted at the next scan increasing the wait time for a new buffer.

As rule of thumb considering 64GB the upper limit for the shared_buffer before a regression seems reasonable.

Wrap up

Sizing the PostgreSQL’s shared buffer is not an easy task. It’s very important to understand the way the buffer manager deal with the block eviction in order to get the correct setting for the underlying data area.

In any case, unless you are doing local tests it’s very important to change the default value of shared_buffer set by initdb.

The parameter’s value is conservatively set to a small amount of memory (currently 128MB) and the fact that PostgreSQL just works even with the default settings is not a good reason to forget it.