Monday, August 13, 2012

Streaming Replication, syncing mirrors and PANIC: WAL contains references to invalid pages

I've recently come across a problem with Streaming Replication where on failover the system generates a PANIC: WAL contains references to invalid pages error message.
Aug  9 14:06:22 db01 postgres[11005]: [8-1] user=,db=,host= LOG:  trigger file found: /db/9.1/data/failover
Aug  9 14:06:22 db01 postgres[11012]: [3-1] user=,db=,host= FATAL:  terminating walreceiver process due to administrator command
Aug  9 14:06:23 db01 postgres[11005]: [9-1] user=,db=,host= LOG:  invalid record length at 398/1D0002F0
Aug  9 14:06:23 db01 postgres[11005]: [10-1] user=,db=,host= LOG:  redo done at 398/1D000298
Aug  9 14:06:23 db01 postgres[11005]: [11-1] user=,db=,host= LOG:  last completed transaction was at log time 2012-08-09 09:16:08.497124+00
Aug  9 14:06:23 db01 postgres[11005]: [12-1] user=,db=,host= LOG:  selected new timeline ID: 10
Aug  9 14:06:23 db01 postgres[11005]: [13-1] user=,db=,host= LOG:  archive recovery complete
Aug  9 14:06:23 db01 postgres[11005]: [14-1] user=,db=,host= WARNING:  page 1563020 of relation base/16436/47773 was uninitialized
Aug  9 14:06:23 db01 postgres[11005]: [15-1] user=,db=,host= WARNING:  page 1563023 of relation base/16436/47773 was uninitialized
Aug  9 14:06:23 db01 postgres[11005]: [16-1] user=,db=,host= WARNING:  page 1563021 of relation base/16436/47773 was uninitialized
Aug  9 14:06:23 db01 postgres[11005]: [17-1] user=,db=,host= WARNING:  page 1563022 of relation base/16436/47773 was uninitialized
Aug  9 14:06:23 db01 postgres[11005]: [18-1] user=,db=,host= WARNING:  page 792619 of relation base/16436/47776 was uninitialized
Aug  9 14:06:23 db01 postgres[11005]: [19-1] user=,db=,host= PANIC:  WAL contains references to invalid pages
Aug  9 14:06:49 db01 postgres[11003]: [2-1] user=,db=,host= LOG:  startup process (PID 11005) was terminated by signal 6: Aborted
Aug  9 14:06:49 db01 postgres[11003]: [3-1] user=,db=,host= LOG:  terminating any other active server processes
I believe that this has something to do with the methodology that I use to resync my mirrors after a failover. The environments that I've seen this on have gone through a lot of failover testing. The scenario looks something like:
A[Master] => B[Standby]
...Failover...
B[Master]
A[Down]
rsync A from B
B[Master] => A[Standby]
...Failback...
A[Master]
B[Down]
resync B from A
etc.
The process I use to sync, and resync doesn't change.
#!/bin/ksh

SOURCE=$1

if [ -z "$SOURCE" ]; then
 echo "Usage: $0 <master postgres host>"
 exit 1
fi

. /etc/sysconfig/pgsql/postgresql-9.1

export PGDATA
export PGPORT

rm -f $PGDATA/failover

pg_ctl stop -D $PGDATA -m immediate
psql -h $SOURCE -p $PGPORT <<EOD
checkpoint;
select pg_start_backup('mirror');
EOD
rsync -avv --delete-delay \
  --exclude postgresql.conf  \
  --exclude pg_hba.conf \
  --exclude server.conf \
  --exclude archive.conf \
  --exclude recovery.conf \
  --exclude recovery.done \
  --exclude pg_ident.conf \
  --exclude failover \
         --exclude pg_xlog \
         --exclude postmaster.pid \
   $SOURCE:$PGDATA/ $PGDATA
psql -h $SOURCE -p $PGPORT -c "select pg_stop_backup()"
cp $PGDATA/recovery.done $PGDATA/recovery.conf
pg_ctl start -D $PGDATA 
I don't think anything inherently wrong with this method, so I could be coming across a linux bug or a Postgres bug.

Some investigation with the fine fellows at 2nd Quadrant has shown that the failed pages are all in index relations.

Our next step in debugging the process will be to start the server with DEBUG3 and see if we can get more info. Unfortunately I've blown away that mirror, so I have to wait for it to happen again.

Stay tuned.

Update 10/15/2012
With the release of Postgres 9.1.6 I believe that what we were running into was the critical bug that's fixed in that release regarding WAL replay.
So Mystery Solved!

No comments:

Post a Comment

Web Statistics