Posted on 2010-01-08
Yesterday presented an interesting case of "what the hey is going on!?" for a customer of ours. The problem presented itself as a front page that sporadically needed 40 to 50 seconds to purge. After a brief hunt I discovered a somewhat interesting set of coincidences.
Before I go into detail, I need to tell you what a hitpass object is and why it's needed.
Varnish can pass objects - that is to say, deliver an object without storing it in the cache - at two different stages in the request handling. The first and simplest stage you can pass at is before you have contacted the web server. This can be because you know that every page under a specific url should not be cached, or that a cookie is present and needed. This is done in vcl_recv. The slightly more complex stage to purge at is after Varnish has gotten a reply from the backend, typically because it sent a "Set-Cookie" header or otherwise had headers that indicate that the page isn't cacheable. This is done in vcl_fetch.
Normally Varnish tries to only request a specific object once, since it will look the same to all users there is no reason to make the web server generate it multiple times, and it's definitely not faster either. When you are passing an object, every client request has to go to the web server, so serializing the requests for the same url is not wanted.
If you pass in recv, it's easy for Varnish to avoid this serialization . It already knows that the next request for the same URL will have to go to the backend. Varnish makes an "anonymous" object that is never entered into the cache. A very simple and effective design.
But in vcl_fetch, things are not so easy. Varnish can't predict a pass in vcl_fetch ahead of time, so it has to serialize these requests as if it was being cached. This is obviously not a desirable situation. Enter the hitpass object.
To avoid serialized requests after a pass in vcl_fetch, Varnish makes a hitpass object which is entered into the cache for 'obj.ttl' period of time, which simply says that this url will be passed for the next obj.ttl period of time. This looks almost just like a normal object in the cache, but it has no content, just a flag telling Varnish to fetch it from a backend. And thus Varnish can perform parallel requests again, because it knows the request will not be cached before it contacts the web server. The moment Varnish sees a hitpass object, it will dereference (and thus unlock) the hitpass object and make an anonymous one, entering the same code-path as if it was a pass in vcl_recv.
Now, back to my original story and purging. What's it got to do with hitpass objects? Well, the purges on this particular site is done in vcl_hit/vcl_miss by resetting ttl to 0, effectively expiring the object. The problem was that at some point, a header on the relevant front page had told Varnish not to cache. So a hitpass object was made, and it had a ttl of roughly 4 days. So the VCL never hit vcl_hit or vcl_miss - where the purging was being done - but instead went to vcl_pass and directly to the backend, which kindly generated the page even though it just got a PURGE request, which is a bit strange, but not all that strange. What's worse, of course, is that the front page wasn't cached at all for this period of time. As it turns out, Varnish also has a minor issue which I also discovered yesterday, in that it doesn't actually purge hitpass objects correctly, at least not in this particular version of Varnish.
And if you are thinking "why not set the ttl to 0 in vcl_pass if you see a PURGE?", even if we allowed you to modify obj.ttl in vcl_pass, it wouldn't be the ttl of the hitpass object, but the anonymous object. This is also why you can't simply decide to start caching it again - when you reach vcl_fetch after hitting a hitpass object, the object you are working on is anonymous, and never entered into the cache.
My solution? Add an 'x' to the cache hash for the particular url/host combination for an immediate "fix". It was either that or restarting Varnish at peak hour. Oh, and add some sanity to the VCL to reduce the effect of this in the future.
Understand hitpass objects. If you are doing 'pass' in vcl_fetch, remember that the TTL matters. You may want to switch 'pass' in vcl_fetch with your own pass subroutine which just sets the ttl to 1 minute for instance, and use that function consistently. That way, you avoid long-lived hitpass objects, but still get the benefit of them when you want them.
Also: If you are using the artificial http "PURGE"-method to set ttl to zero in vcl_hit, don't forget to check for it in vcl_pass. You probably want to throw a 503-error in vcl_pass if you see a PURGE request.