Pushing Varnish even further

Posted on 2010-01-13

A while ago I did a little writeup on high-end Varnish tuning (http://kly.no/posts/2009_10_19__High_end_Varnish_tuning__.html) where I noted that I made our single core 2.2GHz Opteron reach 27k requests/second. This begged the questions as to how well Varnish scale with hardware. So I went ahead and tried to overload our quad-core Xeon at 2.4GHz. It would obviously take some extra fire power. At the very least, four times as much as the last batch of tests.

Hardware involved

Our main set of test servers for Varnish are called varnish1, varnish2, varnish3, varnish4, varnish6 and varnish7. These have mostly different software and hardware - which is done intentionally so we can perform tests under different circuimstances. We routinely run tests against Varnish2 and Varnish4, which run CentOS and FreeBSD, respectively. For my last test, I used Varnish2 as the server and the remaining servers as test nodes. By any normal math, I would need about 4 times more fire power to overload a 2.4GHz Quad core, compared to a single core Opteron at 2.2GHz.

To sum it up as far as this round of tests go: - Varnish1 - Single core Opteron - Varnish2 - Single core Opteron at 2.2GHz (used in the last round of tests) - Varnish3 - Single core Xeon (if I'm not much mistaken). It's also the nginx server used as backend, but that just means 1 request every X minutes. - Varnish4 - Single core Opteron (FreeBSD) - Varnish6 - Dual core Xeon of some kind - Varnish7 - Quad-core Xeon at 2.4GHz

So I needed more power. As it happens, we do alot of training and we have three classrooms full of computers for students, and I borrowed two of these class rooms, adding the following to the mix: - 10 x single core Pentium Celerons at 2.9x GHz - 10 x Core 2 Duos at 2.4ish GHz

As you might notice - a large part of the challenge when you want to test Varnish is getting your test systems to keep up.

Basic test procedures

Same as last time, more or less: 1 byte pages and httperf. I've tried ab, siege and curl... And they simply do not offer the raw power of httperf combined with the control - if anyone cares to enlighten me on how to get the most out of them, then I'm more than willing to listen.

Ideally I wanted to test with 10 requests for each connection, and with mixed data set size. As it turns out, I ended up using 100 requests / second and bursting all of the requests, which is far from realistic. More on this later.

I have an intricate script system for the nightly tests, but that's a story for an other time. For these tests I simply used clusterssh to replicate my input on 37ish shells. This has allowed me to instantly test identical setups on all the nodes, and to quickly review what their status is. I probably ran a thousand or more different variants of the same test this time around.

I've used varnishstat to monitor the request rate and other relevant stats, and top to monitor general load.

The backend I use is hosted on varnish3, which runs nginx and a simple rewrite to 'current.txt', which for this occasion was linked to a 1byte file.

Results

Varnish uses alot of threads, and as such, when it does finally saturate the CPU, the load average will skyrocket. On the last test, Varnish2 had a load of 600-700. During this load, Varnish2 would use 10-15 seconds to start 'top'.

During this round of tests I had roughly 87GHz worth of clients, spread over 25 physical computers. All of the tests systems were running at full load. Varnish7 had a load average around 45. Logging in and starting top was close to instant. And Varnish was serving 143k requests per second.

Based on the load and general snappiness, I think it is safe to conclude that while Varnish was close to the breaking point, it hadn't actually reached it. To put it simply: My clients were not fast enough. Before I told httperf to burst 100 requests for each connection, Varnish was serving 110-120k requests per second with a load less than 1.0, and the clients were still using all their fire power. I ended up stress testing my clients. Dammit.

However, as I came fairly close to the breaking point, I still believe there are a few interesting things to look at.

The scaling nature of Varnish

It's very rare that you can see an application scale so well just by throwing cpu power and cpu cores at it. Varnish essentially didn't get affected at all by the extra work needed to synchronize work on 4 cpu cores. In fact, if you look at the math, the raw performance on 4 cpu cores was actually BETTER than on one cpu core, when you look at it on a cycle-by-cycle.

I think it's reasonably safe to say that when it comes to raw performance, we've nailed it with Varnish.

In fact, scaling Varnish is far more difficult when you increase your active data set beyond physical memory. Or when you introduce latency. Or when when you have a low cache hit rate. Or any other number of corner cases. There will always be bottlenecks.

What you can learn from this is actually simple: Do not focus on the CPU when you want to scale your Varnish setup. I know it's tempting to buy the biggest baddest server around for a high-traffic site, but if your active data set can fit within physical memory and you have a 64-bit CPU, Varnish will thrive. And for the record: All CPU-usage graphs I've seen from Varnish installations confirm this. Most of the time, those sexy CPUs are just sitting idle regardless of traffic.

Myths and further research

Since I didn't reach the breaking point, there's not much I can say conclusively. However, I can repeat a few points.

Adjusting the lru_interval had little impact regardless of data set and access patterns. If I repeat this often enough, perhaps I'll stop seeing new installations with an lru_interval of 3600: DO NOT SET lru_interval TO 3600. There. I didn't even add the usual "unless you know what you are doing" part. I might've explained it before, but the problem is that it leaves you with a really badly sorted lru-listed that will cause Bad Things once you need to lru-nuke something. Possibly really really bad things. Like throwing out the 200 most popular objects on your site at the same time.

And the size of your VCL has little impact on the performance. I have not tested this extensively, but I've never registered a difference, and since your cpu will be idle most of the time anyway, you should NOT worry about CPU cycles in VCL.

An otherimportant detail is that your shmlog shouldn't trigger disk activity. On my setup, it didn't sync to disk to begin with, but you may want to stick it on a tmpfs just to be sure. I suspect this has improved throughout the 2.0-series of Varnish, but it's an easy insurance. Typically the shmlog is found in /usr/var/varnish, /usr/local/var/varnish or similar (ls /proc/*/fd | grep _.vsl is the lazy way to find it).

I tried several different settings of thread pools, acceptors, listen depth, shmlog parameters, rush exponent and such, but none of it revealed much - most likely because I never pressured Varnish enough. This will be what I want to investigate further. But it should tell you something about how far you have to go before these obscure settings start to matter.

Feedback wanted

I figure this must be some sort of record, but I'm interested in what sort of numbers others have seen or are seeing. Have anyone even come close to the numbers above - synthetic or otherwise - from a single server? Regardless of software or hardware? This is not meant as a challenge or boast, but I'm genuinely curious on what sort of traffic people are able to push. I'm interested in more "normal" requests rates too - I'm a sucker for numbers. What are you seeing on your site? Have you had scaling issues?

Kristian Lyngstøl's Blog