Lately I’ve been becoming more and more of a fan of is the concept of Immutable Servers while automating our infrastructure at Zapier. The concept is simple: never do server upgrades or changes on live servers, instead just build out new servers with applied updates and throw away the old ones. You basically get all the benefits of immutability in programming at the infrastructure level plus you never have to worry about configuration drift. And even better, I no longer have to have the fear that despite extensive tests someone might push a puppet manifest change that out the blue breaks our front web servers (sure we can rollback the changes and recover, but there is still a small potential outage to worry about).
Obviously you need some good tooling to make this happen. Some recent fooling around with packer has allowed me to put together a setup that I’ve been a little pleased with so far.
In our infrastructure project we have a nodes.yaml that defines node names and the AWS security groups they belong to. This is pretty straightforward and used for a variety of other tools (for example, vagrant).
We use this nodes.yaml file with rake to produce packer templates to build out new AMIs. This keeps me from having to manage a ton of packer templates as they mostly have the same features.
This is used in conjunction with a simple erb template that simply injects the nodename into it.
This will generate a packer template for each node that will
- Create an AMI in us-east-1
- Uses an Ubuntu Server 13.04 AMI to start with
- Sets the security group to packer in EC2. We create this and allow it access to puppetmaster’s security group. Otherwise packer will create a random temporary security group that won’t have access to any other groups (if you follow best practices at least)!
- installs puppet
- Runs puppet once to configure the system
We also never enable puppet agent (it defaults to not starting) so that it never polls for updates. We could also remove puppet from the server after it completes so the AMI doesn’t have it baked in.
Packer has a nice feature of enabling the user to specify shell commands and shell files to run. This is fine for bootstrapping but not so fine for doing the level of configuration management that puppet is more suited for. So our packer templates call a shell script that makes sure we don’t use the age old version of ruby linux distros love to default to and installs puppet. As part of the installation it also specifies the puppet master server name (if you’re using VPC instead of EC2 classic, you don’t need this as you can just assign the internal dns “puppet” to puppetmaster).
Now all we need to do to build out a new AMI for redis is run
packer build packs/redis.json and boom! A server is created, configured, imaged and terminated. Now just set up a few jobs in jenkins to generate these based on certain triggers and you’re one step closer to automating your immutable infrastructure.
Of course, each AMI you generate is going to cost you a penny a day or some such. This might seem small, but once you have 100 revisions of each AMI it’s going to cost you! So as a final step I whipped up a simple fabfile script to cleanup the old images. This proved to be a simple task because we include a unix timestamp in the AMI name.
Set this up as a post-build job to the jenkins job that generates the AMI and you always ensure you have only the latest one. You could probably also tweak this to keep the last 5 AMIs around too for archiving purposes.
I admit I’m still a little fresh with this concept. Ideally I’d be happy as hell to get our infrastructure to the point where each month (or week!) servers get recycled with fresh copies. Servers that are more transient like web servers or queue works this is easy. With data stores this can be a little more trickier as you need an effective strategy to boot up replicas of primary instances, promote replicas to primaries and retire the old primaries.
A final challenge is deciding what level of mutability is allowed. Deployments are obviously fine as they don’t tweak the server configuration but what about adding / removing users? Do we take an all or nothing approach or allow tiny details like SSH public keys to be updated without complete server rebuilds?