Kristian Lyngstøl's Blog

Testing Varnish

Posted on 2012-04-19

  • How do I benchmark Varnish?
  • How do I make sure Varnish is ready for production?
  • How do I test Varnish?

These are questions I see people ask frequently, and there is no simple answer. In this blog post, I'll go through some of what you could do to test a Varnish-site.

What this blog post is not

This is not about benchmarking. I don't do benchmarking, never have. Why not? Because it's exceedingly hard and very few people succeed at proper benchmarking.

Neither is this a blog post about testing functionality on your site. You should be doing that already. I'll only say that you should test functionality, and it's often best done by browsing the site.

Also, don't expect it to be all that complete. Ask questions in comments and I might expand upon it!

What to test

Despite what most people ask about, the tools you chose are not nearly as important as what you want to test.

If you are hosting videos, I doubt testing request/second is a sensible metric. There are a few things you need to ask yourself:

  • What is going to be the bounding factor of my site? Bandwidth? Disk I/O? CPU? Memory? Something else?
  • How much of my site is it possible to cache?
  • Is there any way to tell ahead of time?

These questions are important and they relate.

If your site is already in production under some different architecture, you are in luck. Your access logs can tell you a lot about the traffic pattern you can expect. This is a great start.

If this is a new site, though, it can be harder to estimate the answers. I recommend starting with How much of my site is it possible to cache?. If your site is mostly static content, then Varnish will be able to help you a lot, assuming you set things up accordingly. If it's a site for logged in users, you have a much harder task. It's still possible to cache content with Varnish, but it's much harder. The details of how to do that is beyond the scope of this post.

As long as you can cache the majority of the content, chances are you will not be CPU bound as far as Varnish is concerned.

Getting a baseline

Testing a Varnish-site can be really fast or you can use the next six months doing it. Let's start by getting a baseline.

I usually start out by something truly simple: Look at varnishstat and varnishlog while you use a browser to browse the site. It's important that this is not a script, because your users are likely using browsers too and you want to catch all the stuff they catch, like cookies.

To set this up, the best way is to modify /etc/hosts (or the Windows equivalent (there is one, all the viruses uses it)). The reason you don't want to just add a test-domain is because your site will go on-line using a real domain, not a test-domain. A typical /etc/hosts file could look like this for me:

127.0.0.1       localhost
127.0.1.1       freud.kly.no freud
127.0.0.1       www.example.com example.com media.example.com
# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Even better is if this is an external server. Then make sure you block any access to other port-80 services for that site. This will ensure that you don't miss any sub-domains.

What you are looking for is cache hits, misses and hitpasses. This should reveal if the site handles cookies properly or not. You may also want to fire up a different browser.

You also want to keep a look out for:

  • Vary headers.
  • Strange or unexpected Cache-Control headers
  • Set-Cookie headers.
  • And of course: 404s or other errors.

Once you've got this nailed down. If you doubt the speed of Varnish, we can always throw in wget too:

wget --delete-after -p -r www.example.com

This will give you a recursive request for www.example.com with all prerequisites (CSS, images, etc). It's not very useful in itself but it will give you a feel for how fast the site is without Varnish and then after you've cached it with Varnish. You can easily run multiple wget-commands in parallel to gauge the bandwidth usage:

while sleep 1; do wget --delete-after -p -r www.example.com; done &
while sleep 1; do wget --delete-after -p -r www.example.com; done &
while sleep 1; do wget --delete-after -p -r www.example.com; done &

Ideally this should be network-bound, but realistically speaking, wget is not /that/ fast when it comes to tiny requests.

Warning

Keep in mind that you are likely going to be hitting a DNS server frequently, specially if you don't use /etc/hosts. I've had DNS servers running at 50-70% CPU when I've done stress testing in the past, which means the DNS server is affecting the test more than you want it to.

So far none of these tricks have been very fancy.

Bringing out the big guns

So you wont reach 275 kreq/s using wget. I'm not sure that should be a goal either, but it's worth while taking a look at.

If you are moving on to testing just Varnish, not the site itself, then it's time to move away from browsers and wget. There are several tools available for this, and I tend to prefer httperf. It's not a good tool by any sensible measure, but it's a fast one. The best way to learn httperf is to stick all the arguments into a text file and set up a shell script that randomly picks them until you find something that works. The manual pages are unhelpful at best.

An alternative to httperf is siege. I'm sure siege is great, if you don't mind that it'll run into a wall and kill itself long before your web server. If you want further proof, take a look at this part of siegerc, documenting Keep-Alive:

# Connection directive. Options "close" and "keep-alive"
# Starting with release 2.57b3, siege implements persistent.
# connections in accordance to RFC 2068 using both chunked
# encoding and content-length directives to determine the.
# page size. To run siege with persistent connections set
# the connection directive to keep-alive. (Default close)
# CAUTION: use the keep-alive directive with care.
# DOUBLE CAUTION: this directive does not work well on HPUX
# TRIPLE CAUTION: don't use keep-alives until further notice
# ex: connection = close
#     connection = keep-alive
#
connection = close

A stress testing tool that doesn't support keep-alive properly isn't very helpful. Whenever I use siege, it tends to max out at about 5000-10000 requests/second.

There's also Apache Bench, commonly known as just ab. I've rarely used it, but what little use I've seen from it has not been impressive. It supports KeepAlive, but my brief look at it showed no way to control the KeepAlive-ness. From basic tests of it, it also seemed slightly slower than httperf. It does seem better today than it was the first time I looked at it, though. For this blog posts, I'll use httperf simply because it's the tool I'm most familiar with and which have given me the right combination of control and performance.

However, httperf has several flaws:

  • It is single-threaded. It can do multiple concurrent requests, but only on a single thread. This can be leveraged by running multiple instances.
  • The documentation, while mostly complete, does not really answer enough questions.
  • It tends to max out at 1022-ish concurrent connections due to an internal limit. This might be possible to avoid if you compile it yourself. I've never bothered.
  • Bussy-loops! Beware that httperf using 100% cpu does NOT mean that it is running at full capacity.
  • No graceful slowing down if you try to hit a --rate that's too fast. It'll simply give connection errors instead.

The trick to httperf is to use --rate when you can. A typical httperf command might look like this (run on my laptop):

$ httperf --rate 2000 --num-conns=10000
        --num-calls 20 --burst-length 20 --server localhost
        --port 8080 --uri /misc/dummy.png

httperf --client=0/1 --server=localhost --port=8080
        --uri=/misc/dummy.png --rate=2000 --send-buffer=4096
        --recv-buffer=16384 --num-conns=10000 --num-calls=20
        --burst-length=20

httperf: warning: open file limit > FD_SETSIZE; limiting max. # of
         open files to FD_SETSIZE

Maximum connect burst length: 20

Total: connections 10000 requests 200000 replies 200000 test-duration 7.076 s

Connection rate: 1413.2 conn/s (0.7 ms/conn, <=266 concurrent connections)
Connection time [ms]: min 1.6 avg 70.4 max 3049.8 median 33.5 stddev 286.4
Connection time [ms]: connect 27.9
Connection length [replies/conn]: 20.000

Request rate: 28264.9 req/s (0.0 ms/req)
Request size [B]: 76.0

Reply rate [replies/s]: min 39514.8 avg 39514.8 max 39514.8 stddev 0.0 (1 samples)
Reply time [ms]: response 35.9 transfer 0.0
Reply size [B]: header 317.0 content 178.0 footer 0.0 (total 495.0)
Reply status: 1xx=0 2xx=200000 3xx=0 4xx=0 5xx=0

CPU time [s]: user 1.33 system 5.61 (user 18.8% system 79.3% total 98.0%)
Net I/O: 15761.0 KB/s (129.1*10^6 bps)

Errors: total 0 client-timo 0 socket-timo 0 connrefused 0 connreset 0
Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0

Note that httperf will echo the command you ran it with back to you, with all options expanded. I took the liberty of formatting the output a bit more to make it easier to read. The options I use here are:

  • --rate 2000 - Tries to open 2000 connections per second.
  • --num-conns=10000 - Open a total of 10 000 connections for this test.
  • --num-calls=20 - Perform 20 requests per connection. (so a total of 10 000*20 requests (200 000)).
  • --burst-length 20 - Pipeline the requests. This is mainly to speed up httperf itself since it's much faster to send all 20 requests in one go than send them individually. Varnish handles it correctly anyway.
  • The rest should be self explanatory.

The first thing you should look at in the output is Errors:. If you get errors, there's a very good chance you were too optimistic with your --rate setting. Also note that the uri matters greatly. /misc/dummy.png is just that: a dummy-png I have to test. (see for yourself at http://kly.no/misc/dummy.png). Let's try the same with the front page:

$ httperf --rate 2000 --num-conns=10000 --num-calls 20
        --burst-length 20 --server localhost --port 8080 --uri /

httperf --client=0/1 --server=localhost --port=8080 --uri=/
        --rate=2000 --send-buffer=4096 --recv-buffer=16384
        --num-conns=10000 --num-calls=20 --burst-length=20

httperf: warning: open file limit > FD_SETSIZE; limiting max. # of
         open files to FD_SETSIZE

Maximum connect burst length: 42

Total: connections 1738 requests 34760 replies 34760 test-duration 10.589 s

Connection rate: 164.1 conn/s (6.1 ms/conn, <=1018 concurrent connections)
Connection time [ms]: min 477.3 avg 4592.4 max 8549.5 median 5276.5 stddev 2233.3
Connection time [ms]: connect 8.7
Connection length [replies/conn]: 20.000

Request rate: 3282.8 req/s (0.3 ms/req)
Request size [B]: 62.0

Reply rate [replies/s]: min 3077.3 avg 3311.2 max 3545.1 stddev 330.8 (2 samples)
Reply time [ms]: response 3772.9 transfer 49.8
Reply size [B]: header 326.0 content 38915.0 footer 2.0 (total 39243.0)
Reply status: 1xx=0 2xx=34760 3xx=0 4xx=0 5xx=0

CPU time [s]: user 0.61 system 9.54 (user 5.8% system 90.1% total 95.9%)
Net I/O: 125998.3 KB/s (1032.2*10^6 bps)

Errors: total 8262 client-timo 0 socket-timo 0 connrefused 0 connreset 0
Errors: fd-unavail 8262 addrunavail 0 ftab-full 0 other 0

Now see how the errors piled up. This is because we exceeded the performance httperf could offer. Yeah, httperf is far from perfect. Also note the bandwidth usage and CPU usage. I'm not sure if it's a coincidence that we're so close to gigabit, since this is a Varnish server running on localhost.

What you also may want to look at is reply status, to check the status codes. You also want to pay attention to connection times. Let's take a look at the first example again:

Connection rate: 1413.2 conn/s (0.7 ms/conn, <=266 concurrent connections)
Connection time [ms]: min 1.6 avg 70.4 max 3049.8 median 33.5 stddev 286.4
Connection time [ms]: connect 27.9
Connection length [replies/conn]: 20.000

This tells me the average connection time was 70.4ms, with a maximum at 3049.8ms. 3 seconds is quite a long time. You may want to look at that. What I do when I debug stuff like this is make sure that I rule out the tool itself as the source of worry. There is no 100% accurate method of doing this, but given the CPU load of httperf at the time, it's reasonable to assume httperf is part of the problem here. You can experiment by slightly adjusting the --rate option to see if you're close to the breaking point of httperf.

You also want to watch varnishstat during these tests.

So what did you just learn?

Frankly very little.

Sure, this means I can run Varnish at around 30k req/s on my laptop, testing FROM my laptop too. But this is not that helpful.

What settings should I use?

Well, first of all, running 20 requests over a single connection is pointless. There's almost no browser or site out there which will cause this to happen. Depending on the site, numbers between 4 and 10 requests per connection is more realistic.

If all you want is a big number, then tons of requests over a single connection is fine. But it has nothing to do with reality.

You can get httperf to do some pretty cool things if you invest time in setting up thorough tests. It can generate URLs, for instance, if that's your thing. Or simulate sessions where it asks for one page, then three other pages over the same connection X amount of time later, etc etc. This is were the six-month testing period comes into play.

I consider it a much better practice to look at access logs you have and use something simpler to iterate the list. wget can do it, and I know several newspapers that use curl for just this purpose. It was actually curl that first showed me what happens when Varnish becomes CPU bound without having a session_linger set (this is set by default now, but for the curious, what happened was that the request rate dropped to a 20th of what it was a moment before, due to context switching).

Conclusion

Test your site and by all means test Varnish, but do not assume that just because httperf or some other tool gives you 80 000 requests/second that this will match real-life traffic.

Proper testing is an art and this is just a small look at some techniques I hope people find interesting.