Kristian Lyngstøl's Blog

The importance of hit-rate and why you should care

Posted on 2009-09-14

Tonight is election-night in Norway, which means most media-sites are beat up properly, which truly puts them to the test.

So what happens when your site has enourmous amounts of traffic?

Cache-hit and what a cache can do for you is often significantly underestimated, and this is where I tell you why and what you can do about it.

Random data-example

If you're a news site, almost all of your actual content is identical for every single user, so in theory, you should be bound by your cache-speed. You shouldn't notice any real change on your backends. Sure, you may have some comment threads, but even with 100 comments a minute, that's what? 100 requests over 60 seconds going to a backend? 1.6 backend requests per second? Then lets add a miss for every one of those, since you're updating something (let's assume you have something smart enough to handle that - you should): 3.2 req/s to your backends. Now let's add a site with, say, 100 000 objects, each with 15 minute ttls. And for the heck of it, let's assume that every single object is constantly hit, meaning it's fetched from the backend once every 15 minutes. That's 100 000 requests every 900 seconds. Or 111.11... requests per second. And to be honest: it's ridiculous to assume you manage to fill/rotate your cache that fast, but we're not doing this math to be REALISTIC anyway.

So 111.1 + 3.2 would give you 114.3 requests/s, let's round that up to 120 req/s. That would be your theoretical maximum REGARDLESS of traffic on your frontends, and as you can see with this example, the actual 'dynamic' content only counts for a tiny fraction of this traffic. Again, this is all theoretical, but I'm sure you get the point.

In reality, 120req/s to a backend would be alot if it's going through varnish. It's not common. But stay with me.

Now, let's say your ACTUAL traffic is, say, 18 000 req/s on your front ends. Not unrealistic. Ok, so here's the kicker: you have a single element, say a java script or "clever php thingamajing" that's included in your articles for some "cool effect'. Let's assume that out of the 18k requests/s, you have 180 articles/s. Whoops, you went from 120 req/s (remember; this was an extreme estimate) to ... 120+180 == 300 req/s to your backends. And the kicker here? This scales proportionally with your load. So your backends now perform based on your front traffic. Ouch.

So yeah, that was my entirely theoretical example...

Or put an other way

Your cache hit rate is 97%, pretty good, huh? You see a few backend requests that could probably be tuned, but it's not really a point, is there? I mean, 97% is damn high, and there might be some tricky stuff to figure out to increase this hit-rate.

The problem is easier to understand if you turn this upside down. So let's start over: You have a hit-rate of 99.9%, and a miss-rate of 0.1%. Now you make a few changes and the hit-rate drops to 97%, or miss-rate of 3%. You suddenly increased your backend traffic by 3000%. Yes, that's THREE THOUSAND percent. If you scale this up, I'm fairly sure you'll understand how much a well-tuned cache can help your backends during peaks. Even if you re-do this math and say you drop from 99% to 97%, you're still talking 300% increase in backend traffic.

Or put yet an other way: Your relative backend traffic is cut in half from 0% to 50% hit-rate. And it's cut in half from 98% to 99%. And it's cut in half from 99.8% to 99.9%, and so on.... And on most bussy sites, this actually scales with the load.

The tricks of the trade

I'll go into more depth further down with code-examples for the relevant steps.

Step 0.: Call us.  We are the experts on tuning Varnish. Yes, this will cost real money. If you've read this far, there's a good chance it's worth it to you. If you want to keep the knowledge in-house and not depend on consultants, we offer training in Varnish too, which could save you time.

Step 1: Ask yourself how much your site actually changes, and compare this with your hit-rate, as seen in varnishstat.

Step 2: Determine what's causing backend requests. The most common sinner is cookies. Keep in mind that cookies exist in browsers (for a while) even if you stop sending set-cookie headers, and that cookies are set for a path. Remember that you need to deal with both set-cookie headers from backends and Cookie-headers from clients. The best approach here is to put as much logic on your backends; don't let them send cookies unless you need them, and/or limit them to a path that you can easily test against.

Step 3: Normalize your Host header and headers used in vary. This will both reduce your miss-rate and memmory-usage. Ie: www.example.com and example.com doesn't need to be cached separately.

Step 4: Review your hit-rate once again.

Step 5: Check "varnishtop -i TxURL", this is one of the most nifty tricks we have to figure out what's causing backend traffic. It will list all requests going to a backend, grouped by URL and sorted by a decaying average of frequency. Basically the number on the left should be single-digit and preferably all 1s or less (a higher number means the backend request is taking place frequently). You can then make special adjustments in VCL to reduce the miss-rate if it's possible.

Step 6: Cache everything! Even that frequently updated comment-counter can be cached; if your front page is taking 100 hits/s, a single object on pass will result in 100 backend requests/s, while that same object with a 10s cache period will only result in 0.1 backend requests/s. So even tiny TTLs matter. In several cases, we've seen a ttl of only a few seconds beeing the difference between a sluggish site and a snappy site.

Examples

I don't advice you to copy/paste this verbatim, but use it to learn.

sub vcl_recv {
        /* Remove cookies from objects that don't neeed them. */
        if (req.url ~ "\.(css|jpg|gif|jpeg|png|html)") {
                remove req.http.cookie;
        }
        /* Or better yet: Remove cookies for anything that doesn't live under the /dynamic/ tree: */
        if (!req.url ~ "^/dynamic/") {
                remove req.http.cookie;
        }
        /* This site has multiple ways to reach it, all deliver the same content, let's normalize it. */
        set req.http.host = regsub(req.http.host,"^(www|www2|img|media|hostname|php)\.example.com$","example.com");

        /* Let's assume we Vary on accept-encoding, but we only differ between nothing<gzip<deflate */
        if (req.http.Accept-Encoding ~ "deflate") {
                set req.http.Accept-Encoding = "deflate";
        } elsif (req.http.Accept-Encoding ~ "gzip") {
                set req.http.Accept-Encoding = "gzip";
        } else {
                remove req.http.Accept-Encoding;
        }
}

sub vcl_fetch {
        /* Only dynamic content is allowed to set cookies. */
        if (!req.url ~ "^/dynamic/") {
                remove obj.http.set-cookie;
        }
        /* Default ttl is based on: Cache-Control: s-maxage if present, otherwise use
         * Cache-Control: max-age, if that's not present, use default_ttl as per parameters.
         * **I**f neither max-age or s-maxage exist, use Expires**
         * All other headers are IGNORED (ie: Pragma, Cache-Control: no-cache, Cache-Control: private, etc)
         * For details, see bin/varnishd/rfc2616.c:
         * So let's ensure that non-dynamic content has a TTL. This shouldn't be necessary if
         * your backend behaves itself with regards to headers
         */
        if (req.url !~ "^/dynamic/" && obj.ttl < 10s) {
                set obj.ttl = 30m;
                /* If a ttl is set to 0 or less as per default TTL rules, so will obj.cacheable. You can overwrite
                 * this, and you should if you are chaning from not caching to caching.
                 */
                set obj.cacheable = true;
        }
        /* The commentcounter.php is data used to fill out the number of comments on articles on your front page.
         * It's safe to cache it for a while.
         */
        if (req.url ~ "^/dynamic/commentcounter.php") {
                set obj.ttl = 10s;
                set obj.cacheable = true;
        }
}

Addendum

While Varnish isn't the whole picture, your cache can be the difference between easy scaling and impossible scaling. Don't underestimate the value of a well-tuned cache layer.

[edit] Seems I've missread/remembered rfc2616.c: Expires is only taken into account in the abscense of Cache-Control s-maxage and max-age.