import site.body

Postgres and the tale of the changing defaults

It is not often i have an issue with Postgres, Sure i don't like some of the naming or how it goes about things certain thinks. But the one thing i do love about it is release to release you are generally not going to have problems. Imagine my surprise when a "tested as working" setup in staging failed when deployed to production.

We recently installed Postgres 9.6 side by side Postgres 9.2 on a testing machine to allow us to ensure the codebase worked on newer versions. This is relatively easy to set up with Postgres and Debian and required little more than configuring Postgres to use a diffrent port. THis worked out well as if there ended up being an issue, we where no worse off than a simple uninstall.

Things worked out nicely, The data copied, The code tested and confirmed as working and no real reported issues. With everyone happy we proceeded to upgrade the production stack to the newer Postgres. The only real difference was that we would upgrade production to 9.6 rather than have it run side by side.

This was a straight forward migration. A dump was performed, The database package upgraded and backups restored. People were asked to check and indicated that all seemed fine. After a nice nights sleep and a casual thought to the great uneventful work completed the previous day, We returned to work to discover messages in our inbox with big scare inducing words such as "emergency" and "database" in them. While a suitable tuned spidy sense could have picked up that there could be an problem, The regulation issued sysadmin one sufficed.

As one might imagine, Production was not checked as thoroughly as one would like. In fact all interactive checking had been against pages that would have never hit the database (eg the front login page). While i always wish people would have a comprehensive plan to check the site is working, In practice i have only ever encountered such a thing a handful of times.

After checking the error logs (It mystifies me how even these days, developers are not aware that their error logging goes somewhere that can be retrieved) I discovered numerous authentication issues of the nature:

$FOO could not log into $BAR incorrect peer credentials

one intresting thing about this setup is that the usernames to log in don't match the system usernames. What i was expecting was web$FOO could not log into $BAR. Postgres, Like mysql can be accessed via a unix socket. When using this mode Postgres can "see" the username of the person connecting via socket calls to the kernel. This is great as they are unforgable but don't work as well when your system and db usernames don't line up. In pg_hba.conf this is normally referenced as "peer" authentication (hopefully some of the parts are falling into place now).

Resetting passwords on the off chance they where not migrated correctly and attempting to replicate the problem via the psql cmdline did not help in any way. everything checked out with psql when manually setting up the credentials.

At this point after reviewing pg_hba.conf it became apparent that the Unix socket was being used, Not the MD5 passwords over localhost like i would expect. As the client code should not have changed (an assumption on my part but one i could easily verify as i went) i could only assume that something was not being explicitly set and a default had changed. A very quick search of the Postgres 9.5 and 9.6 change logs did not seem to indicate anything relavant.

In an attempt to sort the issue out a quick grep of the src code was done to look for a "localhost" string to locate the DB settings. This returned zero results and started to confirm my suspicions that this was a default behaviour change. After further refining of the search i managed to locate some oddly named settings that configured how the database was set up. It appears that everything was specified including the port. The only omission was that the host= field to set the path or address of the remote server was not present.

Much to my dismay it appears that the code had no config variable to set the hostname of the remote server in most of the code bases. Luckily someone had hacked that functionality into one of the affected sites and i was able to set the host and confirm this fixed the issue. A simple cut and paste, Then a search replace of Port to Host and the site was back in business and no longer throwing authentication issues.

After the site was up, People rung and emails sent (this part taking significantly longer than diagnosing and fixing the issues), I had a chance to poke the Postgres 9.6 changelog with greater insight as to which entry i may have been affected by. One obvious choice stuck out:

 * Improve libpq's PQhost() function to return useful data for default
   Unix-socket connections (Tom Lane)

 Previously it would return NULL if no explicit host specification had
 been given; now it returns the default socket directory path.

TL;DR: Postgres's client library changed how it handles the default connection string, Causing our code to switch from connecting via IPv4 to connecting via a Unix socket and peer auth. As the user/database are different to the system user, Connection attempts by the code base received an authentication error. The fix was to specify "localhost" explicitly.

After reviewing the change notes it became obvious what had happened, However looking at the wording of the changelog entry it is no surprise that this was missed at my end. I don't have any advice to offer here other than this is an intresting tale from the trenches that others may find entertaining. Note that i was not responsible for the migration, Only the fixing.

If you want to see more of these then let me know via twitter @Blitz_Works

The above is a roughly remembered summary of the situation, Fixing everything when it went bad took approximately 15 minutes form start to finish. Some details omitted to protect the innocent. If pain persists for more than 4 hours, Please see a senior sysadmin.