Thursday, February 4, 2016

Amazon Maintenance and your RDS Instances

I recently had the pleasure of Amazon telling me that they had to reboot all of my Postgres RDS instances to apply some security patches.

When using RDS you generally expect that Amazon is going to do something like this and I was at least happy that they told me about it and gave me the option to trigger it on a specific maintenance window or else on my own time (up to a drop dead date where they'd just do it for me)

One thing that you can't really know is what the impact of the operation is going to be. You know it's a downtime, but for how long?

My production instances, are of course, Multi-AZ but all of my non-production instances are not.

Fortunately, my non-production instances and my production instances both needed to get rebooted, so I could do some  up-front testing on the timing.

What I found was that the process takes about 10 to 15 minutes and, in this particular case, it was not impacted by database size.
Although it is impacted by the number of instances you're rebooting at the same time. It seems Amazon queues the instances up so that some instances take longer than others.

The pre-reboot security patches took about 5 minutes to load during this time the database was up.
This was followed by a shutdown / reboot during which the database was unavailable.
After the reboot which took less than a minute the database was immediately available while the system did post processing.
After that a backup is performed which doesn't impact the system.

So total downtime was about a minute, but I scheduled 10 minutes just to be safe.

For the Multi-AZ instances the same process is followed but the shutdown / reboot is accompanied by an AZ failover which takes place nearly instantly. This is pretty cool as long as your applications are robust enough to re-connect. (Mine were not, so they required a restart) I timed the reboot to go with a deploy so no additional downtime was required.

In the end it was fairly painless, if you don't trust your applications ability to reconnect it's good to baby sit them. Otherwise kicking it off during a maintenance window and not worrying about it is certainly doable.

Thursday, January 15, 2015

Introducing MrTuner and RDSTune

Two simple gems to help tune your Postgres databases.

I'm a big fan of PgTune. I think that in many cases you can run PgTune and set-it-and-forget-it for your Postgres parameters. I like it so much that I often wish I had access to it in my code - especially when working with Puppet to provision new databases servers.

When I started looking into RDS Postgres a while back I realized that the default configuration for those instances was lacking and I really wished I could run PgTune on the RDS instances.

It was to solve those problems above that these two projects formed.

  • RDSTune will create a MrTuner-ized RDS Parameter Group 
  • MrTuner is a Ruby gem that follows in the sprit of PgTune if not directly in it's footsteps.

Both will run from the command line but, more importantly, they can be `required` by your ruby projects to allow you to access these values programmatically. 

Both Gems are available on rubygems and source, examples, configuration and docks available at their respective bitbucket pages.

RDSTune -
MrTuner -

Feedback and Pull requests very welcome!

Friday, August 29, 2014

Dockerfile Golf (or optimizing the Docker build process)

I was working with a friend of mine on his startup Down For Whatever and he wanted to use Docker.

I created a docker farm for him and we're using Centurion for deployments, we signed up for a docker hub account to store images and started pushing code.

A few days later he emailed me saying that he wanted to switch to capistrano for code deployments instead of building a docker image each time because pushing the image to docker hub took too damn long. (upwards of 10 minutes for him at it's worst)

That felt wrong, and kind of dirty. To me, docker is about creating a bundle of code that you can deploy anywhere. Not about creating a bundle of infrastructure that then you can then deploy into.

It was also surprising because I had started off with a Dockerfile based on Brian Morearty's blog post about skipping the bundle install each time you build a docker image. So I didn't think I had a lot of optimization left available to me.

But once we got into Golfing around with the Dockerfile and the .dockerignore file we found massive improvements to be had.

The .dockerignore file

One of the issues that we found out right away was that he had 500M of log files in his app's log directory. So we added log/* to the .dockerignore. We were also picking up the .git directory, the db/*.sqlite3 databases, etc.

Once we removed those we dropped around 600M off of the size of our docker image, and that helped quite a bit.

Here's what we ended up with in the .dockerignore

The Dockerfile

Next we looked into the Dockerfile itself. Here's what we started with.
FROM dfw1/base:latest

USER root
RUN mkdir -p /opt/app/dfw
WORKDIR /opt/app/dfw
ADD Gemfile /opt/app/dfw/Gemfile
ADD Gemfile.lock /opt/app/dfw/Gemfile.lock
RUN chown -R app:app /opt/app/dfw
USER app
RUN jruby -S bundle install --without development test --no-color --path /opt/app/dfw
ADD . /opt/app/dfw
USER root
RUN chown -R app:app /opt/app/dfw
ADD ./container/dfw-supervisor.conf /etc/supervisor.d/dfw.conf
ADD ./container/dfw-nginx.conf /etc/nginx/sites-enabled/dfw.conf
USER app
RUN EXECJS_RUNTIME='Node' JRUBY_OPTS="-X-C" RAILS_ENV=production jruby -S bundle exec rake assets:precompile
USER root 

Yeah, that kind of sucks. So we started some Dockerfile golf.

( The base image that I call is basically "Install a bunch of stuff and then run supervisord as my CMD" )

Optimization Goals

I kept a couple of goals in mind while going through the dockerfile.
  • Create as few layers as possible
    • I actually had this drilled in me from the early days of Docker back when AUFS had a 42 layer limit so I was already grouping my 'RUN' commands and trying to be as sparse as possible elsewhere.
  • There should only be one big layer that isn't cached and that's the layer with the code in it. 
    • We found that with the above dockefile we were pushing a 70 MB layer and then a 100 MB layer and 15 MB layer. That was too many layers and too large. 

What We Learned

Avoid USER switching (if you can)
The first thing that I didn't like was all of the USER switching. Each switch is a layer that needs to be pushed, and though it's small pushing still takes a few seconds each for the transfer and for the layer verification.

In my new dockerfile all of the RUN commands run as root. The application itself runs as the app user thanks to supervisord.

You just need to know where to fix the permissions after the fact.

Sequence Matters!
Look at all the work i'm doing AFTER the line: ADD . /opt/app/dfw

Nothing after that line will ever be cached (because we're building a docker image because we've changed code) even though a lot of it will rarely change.

Move the static ADDs like supervisord config's to the top of the Dockerfile before the un-cacheable add.

After those two changes we're looking better
FROM dfw1/base:latest

RUN mkdir -p /opt/app/dfw
WORKDIR /opt/app/dfw
ADD ./container/dfw-supervisor.conf /etc/supervisor.d/dfw.conf
ADD ./container/dfw-nginx.conf /etc/nginx/sites-enabled/dfw.conf
ADD Gemfile /opt/app/dfw/Gemfile
ADD Gemfile.lock /opt/app/dfw/Gemfile.lock
RUN chown -R app:app /opt/app/dfw
RUN jruby -S bundle install --without development test --no-color --path /opt/app/dfw
ADD . /opt/app/dfw
RUN chown -R app:app /opt/app/dfw
RUN EXECJS_RUNTIME='Node' JRUBY_OPTS="-X-C" RAILS_ENV=production jruby -S bundle exec rake assets:precompile

But there are a few things we can still do.

Try to DRY up the Dockerfile

My "dfw1/base" image has a massive 'RUN' in it that's joined by a bunch of &&'s 
I have some things in this file that could be moved up into that RUN so that it's just 1 layer.

Specifically my mkdir -p /opt/app/dfw.

Sure, not all of the apps that use that base image will use /opt/app/dfw but it doesn't hurt
those apps for it to be there as an empty directory.

Also, since all of my apps will live in /opt/app/ and be run as 'app' I can
take out the first chown -R app:app /opt/app/dfw and have it in the 'base' image as just
chown -R app:app /opt/app (that also helps squash the dockerfile for other apps i build off that image)

Beware the chown!

This one was a huge /facepalm for me.

Look at what i do:
ADD . /opt/app/dfw
RUN chown -R app:app /opt/app/dfw

I HAVE to chown the dfw directory to app:app because Docker ADDs create files with UID 0. 
So app can't write to it's logs or create any cache files, etc.

But think about what we've done.
Layer 1: ADD ~72M of source
Layer 2: touch every file in that 72M of source to update it's user info.

Basically I now have two 72M layers that need to be pushed!

This one is a puzzler to fix. 

I solved it by changing my CMD strategy. 
The only CMD in my images was CMD ["supervisord","-c","/etc/supervisord.conf"]

I switched that to call a shell script:
chown -R app:app /opt/app
supervisord -c /etc/supervisord.conf

Now I could take all the chown's out of my images and we're not duplicating layers unnecessarily.
It also opened up quite a few optimization paths for my general workflow to add other things to that shell script instead of doing it in the layers.

Not everything needs to be done in the Dockerfile

At this point we're really well optimized. However there's one thing that still took quite a long time (as it usually does) and that's precompiling the assets for Rails.

My friend said, "Can't we just do that outside of docker?" I'm not a rails expert by any means, but he is, and if he thought we could do it I'd give it a try.

We were already using a build script to create / push our docker images. It's effectively.:
docker build -t dfw1/dfw:latest . && docker push dfw1/dfw:latest

We're developing on macs using boot2docker, which means that really all of the 'docker build' work is being done inside a virtualbox vm, which isn't the fastest way.

Also, in the container we only have jruby, which takes a bit of time to bootstrap itself.

Running the rake ahead of time cuts down the build time pretty significantly.


In the end we knocked about 9 minutes or more off the build/push of our dockerfiles and my friend was no longer calling for moving to capistrano. Our Dockerfile ended up being really nice and concise

FROM dfw1/base:latest
ADD Gemfile /opt/app/dfw/Gemfile
ADD Gemfile.lock /opt/app/dfw/Gemfile.lock
RUN RAILS_ENV=production /opt/jruby/bin/jruby -S bundle install --without development test --no-color --path /opt/app/dfw
ADD . /opt/app/dfw

It's still not perfect. The push phase still takes way to long, and the 'already pushed' images take a couple seconds each to determine that they've already been pushed.

We've speculated that if the push phase could run in parallel, or just faster we could shave 20 seconds or more off the build.  But as it is, we're content with what we have.

Wednesday, June 25, 2014

PostgreSQL Performance on Docker

While preparing my presentation on Postgres and Docker to the PDX Postgres User Group for the June meeting, I thought it would be cool to run Postgres on Docker through some rough benchmarks. What I ended up finding was fairly surprising.

The Setup

For this test I used Digital Ocean's smallest droplet 512M RAM/1 CPU/20G SSD Disks.
They have an awesome feature where you can chose the type of application you want to run on the droplet and they'll pre-configure it for you. They even have one for Docker, so that's the one I chose.

Here's a list of all of the versions of the software I used.

Host OS: Ubuntu 14.04 LTS
Kernel: 3.11.0-12-generic
Docker version 1.0.1, build 990021a
PostgreSQL Version: 9.3.4

I used pgtune to generate my configuration, i used these same values across all tests:
maintenance_work_mem = 30MB 
checkpoint_completion_target = 0.7
effective_cache_size = 352MB 
work_mem = 2304kB 
wal_buffers = 4MB 
checkpoint_segments = 8 
shared_buffers = 120MB 
max_connections = 200 

Dockerfiles can be found here

The Tests

I used Greg Smith's pgbench-tools to run my tests with the following configuration
SCALES="1 10 100"
SETCLIENTS="1 2 4 8 16"

My plan was to test how much overhead things such as Docker's virtual networking and virtual I/O incurred, and generally how much slower Postgres would be on Docker. So the test that I thought would be representative were:

  • Baseline Postgres on the system w/o Docker
  • Postgres inside Docker with virtual networking and virtual I/O
  • Postgres inside Docker with virtual networking but no virtual I/O
  • Postgres inside Docker with no virtual networking or I/O
After the results of those tests came in I ran another group of tests
  • Baseline Postgres with fsync=on
  • Baseline Postgres with fsync=off
  • Docker Postgres with fsync=on
  • Docker Postgres with fsync=off
After all of those tests were run it occurred to me that the base image I was using for docker was CentOS and my "baseline Postgres" was ubuntu. So i added one final test:
  • Docker Postgres w/Ubuntu base image
I dropped caches, restarted docker and or postgres between each run to ensure that I wasn't dealing with different caches.

EDIT: 2014-07-09

This original post was targeted at DBAs and folks who hadn't heard much of Docker, so I glossed over some of the specifics above. So I'd like to attempt to rectify that.

When I say "With no virtual networking" what I mean is Host networking --net==host
When I say "With no Virtual I/O above" what I mean is that I used a volume 

I didn't add my volume to my Docker image so that I could use the exact same image for all tests.

With regards to the validity of the test on a VM.

I ran the full suite of tests 20 to 30 times (because I didn't believe the results).
It's certainly possible that runs I managed to hit Digital Ocean at Just.The.Right.Time(tm) each time, but i doubt it.

Regardless, I don't claim that these results are perfect. I've posted my methodology and given access to the source. So please reproduce and dispute. Honestly if you found that I did something significantly wrong in these benchmarks that would make them "fit" into more sensible expectations I'd be very grateful.

The Results

The results for the first round of tests were pretty surprising.

What wasn't surprising is that with the virtual I/O and networking Docker performed predictably slower than the baseline.

However, Docker with host networking and no virtual I/O was by far faster. This was very surprising. And I still don't trust these results, despite repeated runs confirming them.

I presented these results to the PDXPUG and received some good feedback on things to try to help validate the results. 

A prevailing theory was that Docker may be messing with/caching/breaking fsync so I ran the test with fsync on and fsync off in Postgres - here are the results:

So dockerized postgres with fsync ON is faster than vanilla postgres with fsync off...
That seems really wrong, however Digital Ocean gives SSDs to their droplets which may mean that the I/O really isn't much of a factor. That would mean that the difference is elsewhere.

Another theory is that Postgres on CentOS is simply faster than Postgres on Ubuntu despite the fact that they're sharing the same Kernel, it's possible that the CentOS C library has some optimizations that the Ubuntu one does not. I ran a test on Dockerized Postgres on Ubuntu vs Vanilla Postgres on the host machine (ubuntu as well). Here's the output from that run:

Dockerized Ubuntu is pretty much on par with Dockerized CentOS. So that's good at least.

Further thoughts and analysis

First, let me state that I wouldn't be surprised at all if I missed some major fact in these tests or if I was somehow tricked by the results. The results don't feel right at all. I've run the tests over and over again and have gotten consistent results, but that doesn't mean that there isn't some sort of environmental explanation. Perhaps the Digital Ocean droplet itself is biased towards Docker due to something underneath the covers. 

But, let's suppose for a minute that these results are legit and reproducible elsewhere how could we explain it? My theory is that since Docker is a wrapper around LXC that it's possible that LCX has a more efficient execution path within the Kernel and is allowing the Dockerized Postgres to use more resources with less interruptions. 

I have some support for that, during my tests I was also gathering low level system statistics and found that for Dockerized Postgres there are significantly fewer context switches than with normal Postgres.

Dockerized Postgres

Normal Postgres

The Conclusion

It's definitely a surprising result, and one that goes against my expectations. Obviously further research is needed to confirm the result.

The thought that LXC somehow has an optimized path within the kernel is very satisfying, but at the same time there are too many variables to know for sure.

I look forward to suggestions for improving my tests and any conflicting evidence from your own tests. 

Wednesday, May 7, 2014

Stackdriver Acquired by google

Grats to Stackdriver on their Acquisition by Google.

Now if they'd like to bring me, and my Stackdriver Pluggable Custom Metrics along. That'd be just peachy.

Tuesday, November 19, 2013

Tune your Postgres RDS instance via parameter groups

It seems like the folks at Amazon set some strange defaults for their RDS Postgres instances and they make it pretty darn difficult to allow for dynamically sized instance.

You tune your Postgres RDS instance via Parameter Groups. In the parameter group configuration is all of the normal PG tuning parameters from your postgresql.conf.

They provide you with a variable: {DBInstanceClassMemory} which returns the memory in bytes available to the instance, and you can use that in some limited ways to dynamically set parameters based on the instance type you chose for your RDS database. There may be more of these variables but I wasn't able to find them.

One commenter pointed out that DBInstanceClassMemory is possibly not the entire memory of the machine. So for example: DBInstanceClassMemory on an m1.xlarge would not be 16106127360 (16GB) but instead they lower it to take into account the memory allocated to the OS. I hope that this will be changed in the futures since most postgres tuning guides are based on total system memory and not mem - arbitrary OS usage.

It'd be easy to just hop in and hardcode you're values, but that's not really very fun now is it? And besides we want to be able to grow our instances or spin up smaller / larger read replicas w/o defining a Parameter Group for each size we might want. So I'm endeavoring to make my parameter group as dynamic as possible.

I spun up an m1.xlarge RDS instance and did some playing around.

Before we get stared: one thing to note is that you don't want any spaces between the {}'s when doing calculations, otherwise AWS will respond with a format error. It also seems to have problems with multiple divisions or multiplications. Some areas they take, often they don't.

Below is a list of configuration parameters that I usually set in all my databases (which I base on PgTune). Naturally you'd want to set these based on your needs, but I find that PgTune gets you 90% there most of the time.

Edit: As some commenters have pointed out I got a few of the calculations wrong by not taking into account that RDS was using kb for some parameters, or other reasons (I actually suspect that amazon made a change to how they're calculating these parameters). I've updated the values below with the correct / current information.


We all know that max_connections is best expressed at a function of # of CPUs, however Amazon went with this formula: {DBInstanceClassMemory/12582880} - which on the m1.large would leave you with 604 connections.... Unless Amazon added a connection pooler to postgres I don't see how that would possibly work.

Folks migrating from MySQL or Oracle could easily get tripped up by not paying proper attention that that parameter.

I usually keep max connections to around 200 just because I usually deal with Ruby/Java where I need to oversubscribe a bit.


The default here is: {DBInstanceClassMemory/32768} which is correct. Originally it didn't seem to calculate correctly so I had recommended a change.


The default # of checkpoint segments is 16 which is a pretty good base, however if you're a web application (which you probably are if you're using RDS) PgTune recommends that you set it to 8.


work_mem is a little tricky. I, and pgtune, usually base it on max_connections to calculate this value. However I haven't managed to find a way to reference a database parameter from within the RDS parameter section. For example: {max_connections} doesn't work.

So you'll need to hard code something like your connections in there.

So work_mem: {DBInstanceClassMemory/204800} # (200 (max_connections) * 1024 = 204800)

On an m1.xlarge that would give you  ~74MB which is what pgtune would give you on the same instance.


For this parameter RDS just uses the postgres default which is 16MB so creating indexes on large tables would end up taking a few years.

I try to keep this one around 1GB if the machine has enough memory. A nice rule of thumb though would be System Memory (in bytes) / 16 (which is what pgtune generates).

We need to take into account the fact that RDS deals in kb for this parameter so: {DBInstanceClassMemory/16384}  (16*1024)

This gives us 924MB on an m1.xlarge


Effective cache size is generally tricky in AWS since even with provisioned IOPS you're getting slightly better than thumb-drive speed out of your disks. So you may benefit from a smaller effective_cache_size.

As commenter Alexandre Russel pointed out RDSs unit of measure for effective_cache_size is in 8kB blocks

The effective_cache_size defaults associated with RDS are {DBInstanceClassMemory/16384}
So DBInstanceClassMemory / 2 * 8k or, in other words, 50%.

To get it closer to the 75% we would normally use we can try (75/(100*8192))
{DBInstanceClassMemory*75/819200} which brings us up to a little over 10GB for an m1.xlarge.


Uses the postgres default of 2048 bytes. I usually do 512 * checkpoint segments for this one so you may need to just hardcode it in at 4096


Again no good way to dynamically set this one, fortunately it's pretty fixed for a web workload at 0.7

Wednesday, March 6, 2013

Move to BitBucket

I've moved my public projects over to BitBucket since I've got some private repos over there and I like to have everything together.

New URLs for my OpenSource projects are here:

Web Statistics