DBA in a nutshell

Saturday, 2015-01-10

I’ve already started writing the second volume of the database administration series.
The first volume was quite basic with enough information to be useful but not too much to scare people.

The second volume is strongly focused on the database administration and I decided to start it with a chapter explaining what it means to be a DBA.

Probably this will dissuade many people to start this exciting career.

So, here we go again.

A database administrator is a strange combination of theory and practice. A mix of strictness and loose rules. It’s quite difficult to explain what exactly a DBA does. Working with databases requires passion, knowledge and a strange combination of empathy and pragmatism in order to try to understand what the DBMS is thinking. With the commercial products the knowledge is limited by the vendor’s documentations and is improved mostly by the personal experience. With a free DBMS like PostgreSQL the source code’s availability facilitates the knowledge acquisition. Also, reading the superb poetry which is the source code, which is C language, creates an intimate relation with the cold binary programs and the geniuses who create them.

A day in the life of a DBA does not have fixed boundaries. It can last one day or several months, for example if there are long procedures to run inside the well controlled maintenance windows. Every day is a different combination of the duties including the routine tasks, the monitoring and the proactive thinking. The latter is probably the distinctive mark between a good and DBA and a cheap professional. The capability of building a mental map for finding any possible issue, projected in the near future, can make the difference between spending the night sleeping or working frantically before the next day begins. Of course when the emergency strikes there is the fourth and most important duty. The emergency handling.

Routine tasks

Under the group of routine tasks fall the procedures which are well consolidated on the documents and in the DBA mind. For example, configuring the PostgreSQL’s memory parameters should be something immediate. Any task in this category is successful if remains completely unnoticed. Is not unlikely for the live systems that the routine tasks are performed in antisocial hours like Sunday night or early morning in the working days.

Monitoring

A system without monitoring is an one way ticket to the disaster. Whether solution is used it should be something simple to configure and with a decently low rate of false positives. Having a nagging monitor is exactly the same like not having at all. The important alerts will pass completely unnoticed. Alongside with the general purpose solutions like nagios there is a new interesting PostgreSQL’s dedicated called http://opm.io/OPM. Configuring a passive error detector like tail_n_mail is a very good idea to trap any possible server’s misbehaviour.

Proactive thinking

Reacting to the issues is fine. Trying to prevent them is much better. This kind of tasks is strictly related with the monitoring. For example let’s consider a master and a slave configured in continuous recovery. Because the slave is made copying the data files we should expect both having the same size. A (not so) small size difference between the master and the standby server can tell us the filesystem have some problems or is configured not correctly on one box. In this example, assuming the filesystem is xfs a bloating box is caused by the dynamic EOF allocation introduced in the newer releases.

Emergency handling

Shit happens, deal with it. Sooner or later a disaster will strike requiring all the experience and competence of the database experts for putting back the system in production state. It doesn’t matter if the issue is caused by an user screwing up the database or a power outage or a hardware failure or a fleet of Vogon spaceships ready to demolish the earth. The rule number zero when in emergency is never guess. Guessing is a one way ticket to the disaster, in particular if what is the action can destroy the database.
As example let’s consider one of the most common causes of outage in the linux world, the distribution’s upgrade. Usually when a linux distribution is upgraded, is very likely PostgreSQL will upgrade the binaries to a different major version. Now the data area is not compatible across the major releases. The immediate effect when a system is upgraded is PostgreSQL will refuse to start because the version’s mismatch. The fix is quite simple. A dump and reload of the databases in the data area, or alternatively the upgrade in place with pg_upgrade, will get the cluster in production state. Few years ago a desperate user posted a message asking for help on a message board. One ``expert’' guessed pg_resetxlog would be the solution. Before reading further spend two minutes to read the pg_resetxlog http://www.postgresql.org/docs/9.3/static/app-pgresetxlog.html documentation. In the emergency, the ``expert’’ guess would transform a simple version’s mismatch in a corrupted data area.