Wednesday, April 8, 2009

Running Puppet on big scale

This is a rehash of my comment in slashdot discussion and my comment on Alexey Kovrygin's blog post.

We run Puppet on hundreds of servers in two datacenters and it was pain to get it working right. There are many issues which show up here and there: memory leaks in both client (puppetd) and server (puppetmaster), periodic lock ups and even file corruption. Besides it is quite slow. These problems are being slowly fixed with each new release but right now using Puppet for big installations is a source of constant problems. Unfortunately you do not notice these problems until you get many servers to manage; on smaller installations it seems to work without problems or at least they happen much less often to be noticable. In our case number of servers we managed increased slowly so we felt into the trap and now rely on Puppet too much and now it is too late to change. At the end we have managed to work around of most of issues in the Puppet we have met so combined with monitoring to catch problems it works good enough for us. On the other hand if I were to start from scratch I would evaluate something different for the project. Perhaps I would use Cfengine. It is not as flexible and nice as puppet but probably is more stable simply because it is much more old. I talked to people who used Cfengine on much bigger scale (thousands of servers) and they did not recall stability problems with it. In the long run Puppet will be probably ok too as it is being developed actively but right now I'd consider it to be in “beta” state. Or maybe even in "alpha".

For anyone interested in how to get Puppet work for real work load this is what we do:

  • We run Puppet under Apache+Mongrel. By default it runs using WEBrick what breaks easily under any moderate load. So we use Apache+Mongrel instead of it. Another benefit of using Apache is that you can run multiple backends. This helps if you have multi-core server for puppetmaster as by itself it can use only one core. Alternatively you can use Nginx+Mongrel or another other web server with proxying capabilities + Mongrel.

  • Because Puppet is slow we load balance it across two boxes in each datacenter.

  • We restart backends from time to time because they leak memory. We have a cron job to do this every 15 minutes (yes, it is that bad).

  • Puppetmaster has a cache which we saw to get corrupted sometimes. Our "fix" is to delete it before each restart. This might be fixed in later version: I've seen some closed bug reports which loooked relevant but we still have this cache clean up just in case.

  • We do not run puppet client as daemon. We run it as a cron job. Puppet client when run as daemon leaks memory and gets stuck from time to time. In our cron job we add random sleep before starting client to make sure requests do not hit server at the same time and overload it.

  • We never serve big files over puppet using its fileserver. Puppet does a number of stupid things with big files like say reading them into memory first before serving it to puppet client. If you need to distribute big files use other means (HTTP, FTP, NFS, etc).