Posted on 2012-04-19
These are questions I see people ask frequently, and there is no simple answer. In this blog post, I'll go through some of what you could do to test a Varnish-site.
This is not about benchmarking. I don't do benchmarking, never have. Why not? Because it's exceedingly hard and very few people succeed at proper benchmarking.
Neither is this a blog post about testing functionality on your site. You should be doing that already. I'll only say that you should test functionality, and it's often best done by browsing the site.
Also, don't expect it to be all that complete. Ask questions in comments and I might expand upon it!
Despite what most people ask about, the tools you chose are not nearly as important as what you want to test.
If you are hosting videos, I doubt testing request/second is a sensible metric. There are a few things you need to ask yourself:
These questions are important and they relate.
If your site is already in production under some different architecture, you are in luck. Your access logs can tell you a lot about the traffic pattern you can expect. This is a great start.
If this is a new site, though, it can be harder to estimate the answers. I recommend starting with How much of my site is it possible to cache?. If your site is mostly static content, then Varnish will be able to help you a lot, assuming you set things up accordingly. If it's a site for logged in users, you have a much harder task. It's still possible to cache content with Varnish, but it's much harder. The details of how to do that is beyond the scope of this post.
As long as you can cache the majority of the content, chances are you will not be CPU bound as far as Varnish is concerned.
Testing a Varnish-site can be really fast or you can use the next six months doing it. Let's start by getting a baseline.
I usually start out by something truly simple: Look at varnishstat and varnishlog while you use a browser to browse the site. It's important that this is not a script, because your users are likely using browsers too and you want to catch all the stuff they catch, like cookies.
To set this up, the best way is to modify /etc/hosts (or the Windows equivalent (there is one, all the viruses uses it)). The reason you don't want to just add a test-domain is because your site will go on-line using a real domain, not a test-domain. A typical /etc/hosts file could look like this for me:
127.0.0.1 localhost 127.0.1.1 freud.kly.no freud 127.0.0.1 www.example.com example.com media.example.com # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters
Even better is if this is an external server. Then make sure you block any access to other port-80 services for that site. This will ensure that you don't miss any sub-domains.
What you are looking for is cache hits, misses and hitpasses. This should reveal if the site handles cookies properly or not. You may also want to fire up a different browser.
You also want to keep a look out for:
Once you've got this nailed down. If you doubt the speed of Varnish, we can always throw in wget too:
wget --delete-after -p -r www.example.com
This will give you a recursive request for www.example.com with all prerequisites (CSS, images, etc). It's not very useful in itself but it will give you a feel for how fast the site is without Varnish and then after you've cached it with Varnish. You can easily run multiple wget-commands in parallel to gauge the bandwidth usage:
while sleep 1; do wget --delete-after -p -r www.example.com; done & while sleep 1; do wget --delete-after -p -r www.example.com; done & while sleep 1; do wget --delete-after -p -r www.example.com; done &
Ideally this should be network-bound, but realistically speaking, wget is not /that/ fast when it comes to tiny requests.
Keep in mind that you are likely going to be hitting a DNS server frequently, specially if you don't use /etc/hosts. I've had DNS servers running at 50-70% CPU when I've done stress testing in the past, which means the DNS server is affecting the test more than you want it to.
So far none of these tricks have been very fancy.
So you wont reach 275 kreq/s using wget. I'm not sure that should be a goal either, but it's worth while taking a look at.
If you are moving on to testing just Varnish, not the site itself, then it's time to move away from browsers and wget. There are several tools available for this, and I tend to prefer httperf. It's not a good tool by any sensible measure, but it's a fast one. The best way to learn httperf is to stick all the arguments into a text file and set up a shell script that randomly picks them until you find something that works. The manual pages are unhelpful at best.
An alternative to httperf is siege. I'm sure siege is great, if you don't mind that it'll run into a wall and kill itself long before your web server. If you want further proof, take a look at this part of siegerc, documenting Keep-Alive:
# Connection directive. Options "close" and "keep-alive" # Starting with release 2.57b3, siege implements persistent. # connections in accordance to RFC 2068 using both chunked # encoding and content-length directives to determine the. # page size. To run siege with persistent connections set # the connection directive to keep-alive. (Default close) # CAUTION: use the keep-alive directive with care. # DOUBLE CAUTION: this directive does not work well on HPUX # TRIPLE CAUTION: don't use keep-alives until further notice # ex: connection = close # connection = keep-alive # connection = close
A stress testing tool that doesn't support keep-alive properly isn't very helpful. Whenever I use siege, it tends to max out at about 5000-10000 requests/second.
There's also Apache Bench, commonly known as just ab. I've rarely used it, but what little use I've seen from it has not been impressive. It supports KeepAlive, but my brief look at it showed no way to control the KeepAlive-ness. From basic tests of it, it also seemed slightly slower than httperf. It does seem better today than it was the first time I looked at it, though. For this blog posts, I'll use httperf simply because it's the tool I'm most familiar with and which have given me the right combination of control and performance.
However, httperf has several flaws:
The trick to httperf is to use --rate when you can. A typical httperf command might look like this (run on my laptop):
$ httperf --rate 2000 --num-conns=10000 --num-calls 20 --burst-length 20 --server localhost --port 8080 --uri /misc/dummy.png httperf --client=0/1 --server=localhost --port=8080 --uri=/misc/dummy.png --rate=2000 --send-buffer=4096 --recv-buffer=16384 --num-conns=10000 --num-calls=20 --burst-length=20 httperf: warning: open file limit > FD_SETSIZE; limiting max. # of open files to FD_SETSIZE Maximum connect burst length: 20 Total: connections 10000 requests 200000 replies 200000 test-duration 7.076 s Connection rate: 1413.2 conn/s (0.7 ms/conn, <=266 concurrent connections) Connection time [ms]: min 1.6 avg 70.4 max 3049.8 median 33.5 stddev 286.4 Connection time [ms]: connect 27.9 Connection length [replies/conn]: 20.000 Request rate: 28264.9 req/s (0.0 ms/req) Request size [B]: 76.0 Reply rate [replies/s]: min 39514.8 avg 39514.8 max 39514.8 stddev 0.0 (1 samples) Reply time [ms]: response 35.9 transfer 0.0 Reply size [B]: header 317.0 content 178.0 footer 0.0 (total 495.0) Reply status: 1xx=0 2xx=200000 3xx=0 4xx=0 5xx=0 CPU time [s]: user 1.33 system 5.61 (user 18.8% system 79.3% total 98.0%) Net I/O: 15761.0 KB/s (129.1*10^6 bps) Errors: total 0 client-timo 0 socket-timo 0 connrefused 0 connreset 0 Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0
Note that httperf will echo the command you ran it with back to you, with all options expanded. I took the liberty of formatting the output a bit more to make it easier to read. The options I use here are:
The first thing you should look at in the output is Errors:. If you get errors, there's a very good chance you were too optimistic with your --rate setting. Also note that the uri matters greatly. /misc/dummy.png is just that: a dummy-png I have to test. (see for yourself at http://kly.no/misc/dummy.png). Let's try the same with the front page:
$ httperf --rate 2000 --num-conns=10000 --num-calls 20 --burst-length 20 --server localhost --port 8080 --uri / httperf --client=0/1 --server=localhost --port=8080 --uri=/ --rate=2000 --send-buffer=4096 --recv-buffer=16384 --num-conns=10000 --num-calls=20 --burst-length=20 httperf: warning: open file limit > FD_SETSIZE; limiting max. # of open files to FD_SETSIZE Maximum connect burst length: 42 Total: connections 1738 requests 34760 replies 34760 test-duration 10.589 s Connection rate: 164.1 conn/s (6.1 ms/conn, <=1018 concurrent connections) Connection time [ms]: min 477.3 avg 4592.4 max 8549.5 median 5276.5 stddev 2233.3 Connection time [ms]: connect 8.7 Connection length [replies/conn]: 20.000 Request rate: 3282.8 req/s (0.3 ms/req) Request size [B]: 62.0 Reply rate [replies/s]: min 3077.3 avg 3311.2 max 3545.1 stddev 330.8 (2 samples) Reply time [ms]: response 3772.9 transfer 49.8 Reply size [B]: header 326.0 content 38915.0 footer 2.0 (total 39243.0) Reply status: 1xx=0 2xx=34760 3xx=0 4xx=0 5xx=0 CPU time [s]: user 0.61 system 9.54 (user 5.8% system 90.1% total 95.9%) Net I/O: 125998.3 KB/s (1032.2*10^6 bps) Errors: total 8262 client-timo 0 socket-timo 0 connrefused 0 connreset 0 Errors: fd-unavail 8262 addrunavail 0 ftab-full 0 other 0
Now see how the errors piled up. This is because we exceeded the performance httperf could offer. Yeah, httperf is far from perfect. Also note the bandwidth usage and CPU usage. I'm not sure if it's a coincidence that we're so close to gigabit, since this is a Varnish server running on localhost.
What you also may want to look at is reply status, to check the status codes. You also want to pay attention to connection times. Let's take a look at the first example again:
Connection rate: 1413.2 conn/s (0.7 ms/conn, <=266 concurrent connections) Connection time [ms]: min 1.6 avg 70.4 max 3049.8 median 33.5 stddev 286.4 Connection time [ms]: connect 27.9 Connection length [replies/conn]: 20.000
This tells me the average connection time was 70.4ms, with a maximum at 3049.8ms. 3 seconds is quite a long time. You may want to look at that. What I do when I debug stuff like this is make sure that I rule out the tool itself as the source of worry. There is no 100% accurate method of doing this, but given the CPU load of httperf at the time, it's reasonable to assume httperf is part of the problem here. You can experiment by slightly adjusting the --rate option to see if you're close to the breaking point of httperf.
You also want to watch varnishstat during these tests.
Frankly very little.
Sure, this means I can run Varnish at around 30k req/s on my laptop, testing FROM my laptop too. But this is not that helpful.
Well, first of all, running 20 requests over a single connection is pointless. There's almost no browser or site out there which will cause this to happen. Depending on the site, numbers between 4 and 10 requests per connection is more realistic.
If all you want is a big number, then tons of requests over a single connection is fine. But it has nothing to do with reality.
You can get httperf to do some pretty cool things if you invest time in setting up thorough tests. It can generate URLs, for instance, if that's your thing. Or simulate sessions where it asks for one page, then three other pages over the same connection X amount of time later, etc etc. This is were the six-month testing period comes into play.
I consider it a much better practice to look at access logs you have and use something simpler to iterate the list. wget can do it, and I know several newspapers that use curl for just this purpose. It was actually curl that first showed me what happens when Varnish becomes CPU bound without having a session_linger set (this is set by default now, but for the curious, what happened was that the request rate dropped to a 20th of what it was a moment before, due to context switching).
Test your site and by all means test Varnish, but do not assume that just because httperf or some other tool gives you 80 000 requests/second that this will match real-life traffic.
Proper testing is an art and this is just a small look at some techniques I hope people find interesting.