Thursday, February 4, 2016

Amazon Maintenance and your RDS Instances

I recently had the pleasure of Amazon telling me that they had to reboot all of my Postgres RDS instances to apply some security patches.

When using RDS you generally expect that Amazon is going to do something like this and I was at least happy that they told me about it and gave me the option to trigger it on a specific maintenance window or else on my own time (up to a drop dead date where they'd just do it for me)

One thing that you can't really know is what the impact of the operation is going to be. You know it's a downtime, but for how long?

My production instances, are of course, Multi-AZ but all of my non-production instances are not.

Fortunately, my non-production instances and my production instances both needed to get rebooted, so I could do some  up-front testing on the timing.

What I found was that the process takes about 10 to 15 minutes and, in this particular case, it was not impacted by database size.
Although it is impacted by the number of instances you're rebooting at the same time. It seems Amazon queues the instances up so that some instances take longer than others.

The pre-reboot security patches took about 5 minutes to load during this time the database was up.
This was followed by a shutdown / reboot during which the database was unavailable.
After the reboot which took less than a minute the database was immediately available while the system did post processing.
After that a backup is performed which doesn't impact the system.

So total downtime was about a minute, but I scheduled 10 minutes just to be safe.

For the Multi-AZ instances the same process is followed but the shutdown / reboot is accompanied by an AZ failover which takes place nearly instantly. This is pretty cool as long as your applications are robust enough to re-connect. (Mine were not, so they required a restart) I timed the reboot to go with a deploy so no additional downtime was required.

In the end it was fairly painless, if you don't trust your applications ability to reconnect it's good to baby sit them. Otherwise kicking it off during a maintenance window and not worrying about it is certainly doable.
