Kristian Lyngstøl's Blog

The many pitfalls of benchmarking

Posted on 2011-03-16

I was made aware of a synthetic benchmark that concerned Varnish today, and it looked rather suspicious. The services tested was Varnish, nginx, Apache and G-Wan. And G-Wan came out an order of magnitude faster than Varnish. This made me question the result. The first thing I noticed was AB, a tool I've long since given up trying to make behave properly. As there was no detailed data, I decided to give it a spin myself.

You will not find graphs. You will not find "this is best!"-quotes. I'm not even backing up my statements with httperf-output.

Disclaimer

This is not a comparison of G-Wan versus Varnish. It is not complete. It is not even a vague attempt at making either G-Wan or Varnish perform better or worse. It is not realistic. Not complete and in no way a reflection on the overall functionality, usability or performance of G-Wan.

Why not? Because I would be stupid to publicize such things without directly consulting the developers of G-Wan so that the comparison would be fair. I am a Varnish-developer.

This is a text about stress testing. Not the result of stress testing. Nothing more.

The basic idea

So G-Wan was supposedly much faster than Varnish. The feature-set is also very narrow, as it goes about things differently. The test showed that Varnish, Apache and nginx were almost comparable in performance, whereas G-Wan was ridiculously much faster. The test was also conducted on a local machine (so no networking) and using AB. As I know that it's hard to get nginx, Apache and Varnish to perform within the same level, this indicated that G-Wan did something differently that affected the test to me.

I installed g-wan and Varnish on a virtual machine and started playing with httperf.

What to test

The easiest number to demonstrate in a test is the maximum request rate. It tells you what the server can do under maximum load. However, it is also the hardest test to do precisely and fairly across daemons of vastly different nature.

Other things I have rarely written about is the response time of Varnish for average requests. This is often much more interesting to the end user, as your server isn't going to be running at full capacity anyway. The fairness and concurrency is also highly relevant. A user doing a large download shouldn't adversely affect other users.

I wasn't going to bother with all that.

First test

The first test I did was "max req/s"-like. It quickly showed that G-Wan was very fast, and in fact faster than Varnish. At first glance. The actual request-rate was faster. The CPU-usage was lower. However, Varnish is massively multi-threaded, which offsets the cpu measurements greatly and I wasn't about to trust it.

Looking closer I realized that the real bottleneck was in fact httperf. With Varnish, it was able to keep more connections open and busy at the same time, and thus hit the upper limit of concurrency. This in turned gave subtle and easily ignored errors on the client which Varnish can do little about. It seemed G-Wan was dealing with fewer sessions at the same time, but faster, which gave httperf an easier time. This does not benefit G-Wan in the real world (nor does it necessarily detract from the performance), but it does create an unbalanced synthetic test.

I experimented with this quite a bit, and quickly concluded that the level of concurrency was much higher with varnish. But it was difficult to measure. Really difficult. Because I did not want to test httperf.

The hardware I used was my home-computer, which is ridiculously overpowered. The VM (KVM) was running with two CPU cores and I executed the clients from the host-OS instead of booting up physical test-servers. (... That 275k req/s that's so much quoted? Spotify didn't skip a beat while it was running (on the same machine). ;))

Conclusion

The more I tested this, the more I was able to produce any result I wanted by tweaking the level of concurrency, the degree of load, the amount of bandwidth required and so forth.

The response time of G-Wan seemed to deteriorate with load. But that might as well be the test environment. As the load went up, it took a long time to get a response. This is just not the case with Varnish at all. I ended up doing a little hoodwinking at the end to see how far this went, and the results varied extremely with tiny variations of test-parameters. The concurrency is a major factor. And the speed of Varnish at each individual connection played a huge part. At large amounts of parallel requests Varnish would be sufficiently fast with all the connections that httperf never ran into problems, while G-Wan would be more uneven and thus trigger failures (and look slower)...

My only conclusion is that it will take me several days to properly map out the performance patterns of Varnish compared to G-Wan. They treat concurrent connections vastly different and perform very different depending on the load-pattern you throw at them. Relating this to real traffic is very hard.

But this confirms my suspicion of the bogus-ness of the blog post that lead me to perform these tests. It's not that I mind Varnish losing performance tests if we are actually slower, but it's very hard to stomach when the nature of the test is so dubious. The art of measuring realistic performance with synthetic testing is not one that can be mastered in an afternoon.

Lessons learned

(I think conclusions are supposed to be last, but never mind)

First: Be skeptical of unbalanced results. And of even results.

Second: Measure more than one factor. I've mainly focused on request-rate in my posts because I do not compare Varnish to anything but itself. Without a comparison it doesn't make that much sense to provide reply latency (though I suppose I should start supplying a measure of concurrency, since that's one of the huge strong-points of Varnish.).

Third: Conclude carefully. This is an extension of the first lesson.

A funny detail: While I read the license for the non-free G-Wan, which I always do for proprietary software, I was happy to see that it didn't have a benchmark-clause (Oracle, anyone?). But it does forbid removing or modifying the Server:-header. It also forces me to give the G-Wan-guys permission to use my using of G-Wan in their marketing… Hmm — maybe I should ... — err, never mind.