Kristian Lyngstøl's Blog

High-end Varnish-tuning

Posted on 2009-10-19

Most of the time when I tune varnish servers, the main problem is hit rate. That's mostly a matter wack the weasel, and fairly straight forward. However, once you go beyond that, things get fun. I'll take you through a few common tuning tricks. This is also based on no disk I/O too, so either sort that out first or expect different results.

The big ones

The first thing you want to do is sort your threads out. One thread pool for each CPU core. Never run with less than, say, 800 threads. If you think that's alot, then you don't need these tips. For max, I don't advice going over 6000, I'll explain that shortly. So if you have 8 cpu cores, you will want to set:

thread_pools 8
thread_pool_min 100
thread_pool_max 5000
thread_pool_add_delay 2

Note that I also set the thread_pool_add_delay to 2ms. That should drastically reduce the startup time for your threads, and is fairly safe. The reason we don't create everything instantly is to avoid bombing the kernel.

The main danger with threads - if we rule out I/O - is file descriptors. Currently the log format we use have a 16 bit field reserved for file descriptors, which I believe is fixed in trunk, but that limits us to 64k file descriptors. And your kernel will clean them up periodically, so running out is very very relevant, and please keep in mind that synthetic tests are horrible at testing this. You can probably use 40 000 threads in a synthetic test without running into file descriptor issues, but do not use that in production. 6000 might be high, and unless you really really really need it, I wouldn't go beyond 2000 or 3000. I've done quite a bit of testing and tried out different options on production sites, and have found that 800 is a sane minimum, and I've rarely seen max threads be an issue until you hit the fd-limit. You can watch /proc/<PID of varnish child>/fd/ to see how many fds varnish have allocated at any given time.

The next issue you are likely to run in to is cli_timeout. If your varnish is heavily loaded, it might not answer the management thread in a timely fashion, which in turn will kill it off. To avoid that, set cli_timeout to 20 seconds or more. Yes, 20. That's the extreme, but I have gradually increased this over months of  routine tests. I'm currently running these tests with a cli_timeout of  25 seconds, which so far has worked. 23 worked until today. For most sites and most real work loads, I doubt this is necessary, but if it is and you actually hit this in production, your Varnish will restart when it's most bussy - which is probably the worst possible scenario you have. Set it to at least 10-15 seconds (we increased the default to 10 seconds a while ago. It's a sane compromise, but a tad low for an overloaded Varnish)

Last but not least of the common tricks is a well kept seceret; session_linger.  When you have a bunch of threads and Varnish become CPU-bound, you are likely to get killed by context switching and whatnot. To reduce this, setting session_linger can help. You may have to experiment a bit, as it depends on your content. I recently had to set it to 120ms to get it to really do the trick. The site load would climb to 60k req/s then crumble to a measly 2-5k req/s during tests. Session linger did the trick. However, don't set it too high. That will leave your threads idling.

Session_linger has been improved in trunk, and will be enabled by default in 2.0.5, but it's still useful in 2.0.4.

[Update] Session linger cause your threads to wait around for more data from the client it's currently working with, without it, you risk switching threads between piped requests which requires moving alot of data around and allocating/freeing threads. It's better to have spare threads than to constantly switch the ones you have around.

Misc

An other value you may want to change is lru_interval. This is mainly to update the lru list, and the default is 2 seconds. There are several pages that will mention an lru_interval of 3600, but we've seen such values cause problems in the past. I would consider something like 20 seconds. It's not going to have a huge impact on your performance.

People also increase the listen depth, this might be necessary but I've not seen any solid evidence that it does, so I generally avoid it.

An other thing to consider is using critbit instead of classic hashing. That is more relevant for huge data sets, and I've not seen any significant performance gain on my synthetic tests yet, but I know some people have so it's something you might want to look into.

Session timeout is generally fine at the default (4s), but you should not increase it, or you might run into file descriptor issues.

Then there's your load balancer. We've had several cases where Varnish has run into issues because of enourmous amount of connections. You do NOT want to make a connection for every single request.

Summary

thread_pools 8
thread_pool_min 100
thread_pool_max 5000
thread_pool_add_delay 2
cli_timeout 25
session_linger 50/100/150
lru_interval 20

Testing

Testing all of this is a different story, but I will point out a few common pit falls: - Testing your stress testing tool.  You need a number of machines to test Varnish - otherwise varnish isn't going to be the bottleneck but your stress testing system is. I use a cluster of 6 servers to test Varnish, one will be the varnish server and the other 5 will hammer it - and that's barely enough, even though the Varnish server is not specced for high performance compared to the other nodes. - Using too few connections or too many - Real life seems to suggest that 10 requests per connection is fairly realistic. - Testing only cache hits. This is great for getting huge numbers, but obviously not all that realistic. For a proper test, you may want to generate urls from log files and balance them accordingly.

Results?

Our single-core Opteron at 2.2GHz handles 27k requests/s consistently. Sure, the load can hit 400-600 but hey, it works. This scales fairly well too, so if that was a dual quad core I wouldn't be surprised if we could reach 180 k req/s (but I have no idea where we'd get the firepower from to test that - or the bandwidth. I assume there'd be some completely different issues at that point). This is with 1-byte pages, mind you. I've seen varnish deliver favicon.ico at 60k req/s on a dual quad, but that was an underachiever ;)