Saving a request

This chapter is for the system administration course only

  • Grace and grace mode
  • Health checks
  • Saint mode
  • return (restart);
  • Directors
  • Using ACLs

Varnish has several mechanisms for recovering from problematic situations. It can retry a request to a different server, it can perform health checks, use an otherwise expired object and more.

This chapter discusses how these features interact with each other and how you can combine them to make your Varnish setup far more robust.

Core grace mechanisms

  • A graced object is an object that has expired, but is still kept in cache
  • Grace mode is when Varnish uses a graced object
  • There is more than one way Varnish can end up using a graced object.
  • req.grace defines how long overdue an object can be for Varnish to still consider it for grace mode.
  • beresp.grace defines how long past the beresp.ttl-time Varnish will keep an object
  • req.grace is often modified in vcl_recv based on the state of the backend.
When Varnish is in grace mode, it uses an object that has already expired as far as the TTL is concerned. There are several reasons this might happen, one of them being if a backend is marked as bad by a health probe.
For Varnish to be able to use a graced object, two things need to

happen:

  • The object needs to still be kept around. This is affected by beresp.grace in vcl_fetch.
  • The VCL has to allow Varnish to use an object as overdue as the one kept around. This is affected by req.grace in vcl_recv.

When setting up grace, you will need to modify both vcl_recv and vcl_fetch to use grace effectively. The typical way to use grace is to store an object for several hours past its TTL, but only use it a few seconds after the TTL, except if the backend is sick. We will look more at health checks in a moment, but for now, the following VCL can illustrate a normal setup:

sub vcl_recv {
        if (req.backend.healthy) {
                set req.grace = 30s;
        } else {
                set req.grace = 24h;
        }
}

sub vcl_fetch {
        set beresp.grace = 24h;
}

req.grace and beresp.grace

set beresp.ttl=1m;
set req.grace = 30s;
set beresp.grace = 1h;
  • 50s: Normal delivery
  • 62s: Normal cache miss, but grace mode possible
  • 80s: Normal cache miss, but grace mode possible
  • 92s: Normal cache miss, grace mode possible but not allowed
  • 3660s: (1h+1m) Object is removed from cache

In this time-line example, everything except the first normal delivery is assuming the object is never refreshed. If a cache miss happens at 62s and the object is refreshed, then 18 seconds later (80s) a request for the same resource would of course just hit the new 18 second old object.

The flip-side to this time line is if you set req.grace to 1h but leave beresp.grace to 30s instead. Even if grace is allowed for up to an hour, it’s not possible since the object will be removed long before that.

The lesson to learn from this is simple: There is no point in setting req.grace to a value higher than beresp.grace, but there could be a point in setting beresp.grace higher than req.grace.

Tip

You can use set req.grace = 0s; to ensure that editorial staff doesn’t get older objects (assuming they also don’t hit the cache). The obvious downside of this is that you disable all grace functionality for these users, regardless of the reason.

When can grace happen

  • A request is already pending for some specific content (deliver old content as long as fetching new content is in progress).
  • No healthy backend is available
  • You need health probes or saint mode for Varnish to consider the backend as unhealthy.

The original purpose of grace mode was to avoid piling up clients whenever a popular object expired from cache. So as long as a client is waiting for the new content, Varnish will prefer delivering graced objects over queuing up more clients to wait for this new content.

This is why setting req.grace to a low value is a good performance gain. It ensures that no client will get too old content, but as long as Varnish has a copy of the content and is in the progress of updating it, the old content will be sent. You can disable this entirely by setting req.grace=0s, and still use graced objects for unhealthy backends.

Exercise: Grace

  1. Reuse the CGI script in /usr/lib/cgi-bin/test.cgi, but increase the sleep time and allow it to cache:

    #! /bin/sh
    sleep 15
    echo "Content-type: text/plain"
    echo "Cache-control: max-age=20"
    echo
    echo "Hello world"
    date
    
  2. Make it executable

  3. Test that it works outside of Varnish

  4. Set up beresp.grace and req.grace to 10s in VCL

  5. Fire up a single request to warm the cache, it will take 15 seconds.

  6. Fire up two requests roughly in parallel

  7. Repeat until you can see how grace affects multiple clients

With this exercise, you should see that as long as the content is within the regular TTL, there is no difference. Once the TTL expires, the first client that asks for the content should be stuck for 15 seconds, while the second client should get the graced copy.

Also try setting req.grace to 0s and 10s while leaving beresp.grace intact, then do the opposite.

Bonus: What happens to the Age-header when it takes 15 seconds to generate a page?

Health checks

  • Poke your web server every N seconds
  • Affects backend selection
  • req.backend.healthy
  • Varnish needs at least threshold amount of good probes within a set of the last window probes. Where threshold and window are parameters.
  • Set using .probe
  • varnishlog: Backend_health
  • varnishadm: debug.health
backend one {
        .host = "example.com";
        .probe = {
                .url = "/healthtest";
                .interval = 3s;
                .window = 5;
                .threshold = 2;
        }
}

You can define a health check for each backend, which will cause Varnish to probe a URL every few seconds. Normally, it will take more than one failed request before Varnish stops using a specific backend server.

The above example will cause Varnish to send a request to http://example.com/healthtest every 3 seconds. When deciding whether to use a server or not, it will look at the last 5 probes it has sent and require that at least 3 of them were good.

You also have an important variable called .initial, which defaults to the same value as .threshold. It defines how many probes Varnish should pretend are good when it first starts up. Before .initial was added, Varnish needed enough time to probe the Web server and gather good probes before it was able to start functioning after boot.

debug.health
200 545
Backend foo is Healthy
Current states  good:  8 threshold:  5 window:  8
Average responsetime of good probes: 0.355237
Oldest                                                    Newest
================================================================
------------------------------------------------------4444444444 Good IPv4
------------------------------------------------------XXXXXXXXXX Good Xmit
------------------------------------------------------RRRRRRRRRR Good Recv
-------------------------------------------------HHHHHHHHHHHHHHH Happy

The above shows the output of debug.health - the same data is also available in the more concise Debug_health tag of varnishlog.

Good IPv4 indicates that the IP was available for routing and that Varnish was able to connect over IPv4. Good Xmit indicates that Varnish was able to transmit data. Good Recv indicates that Varnish got a valid reply. Happy indicates that the reply was a 200 OK.

Note

Varnish does NOT send a Host header with health checks. If you need that, you can define the entire request using .request instead of .url.

backend one {
        .host = "example.com";
        .probe = {
                .request =
                        "GET / HTTP/1.1"
                        "Host: www.foo.bar"
                        "Connection: close";
        }
}

Health checks and grace

  • If a backend is marked as sick, grace mode is attempted
  • You can use req.backend.healthy to alter req.grace when a backend is sick to allow Varnish to use even older content, if available.

When Varnish has no healthy backend available, it will attempt to use a graced copy of the object it is looking for. But all the rules you specify in VCL still apply.

Since you have req.backend.healthy available to you, you can use this to optionally increase req.grace just for requests to unhealthy backends.

Directors

  • Contains 1 or more backends
  • All backends must be known
  • Multiple selection methods
  • random, round-robin, hash, client and dns
backend one {
   .host = "localhost";
   .port = "80";
}

backend two {
   .host = "127.0.0.1";
   .port = "81";
}

director localhosts round-robin {
        { .backend = one; }
        { .backend = two; }
        { .backend = { .host = "localhost"; .port = "82"; } }
}

sub vcl_recv {
        set req.backend = localhosts;
}

Backend directors, usually just called directors, provide logical groupings of similar web servers. There are several different directors available, but they all share the same basic properties.

First of all, anywhere in VCL where you can refer to a backend, you can also refer to a director.

All directors also allow you to re-use previously defined backends, or define “anonymous” backends within the director definition. If a backend is defined explicitly and referred to both directly and from a director, Varnish will correctly record data such as number of connections (i.e.: max connections limiting) and saintmode thresholds. Defining an anonymous backend within a director will still give you all the normal properties of a backend.

And a director must have a name.

The simplest directors available are the round-robin director and the random director. The round-robin director takes no additional arguments - only the backends. It will pick the first backend for the first request, then the second backend for the second request, and so on, and start again from the top. If a health probe has marked a backend as sick, the round-robin director will skip it.

The random director picks a backend randomly. It has one per-backend parameter called weight, which provides a mechanism for balancing the traffic to the backends. It also provides a director-wide parameter called retries - it will try this many times to find a healthy backend.

The above example will result in twice as much traffic to localhost.

Client and hash directors

The client and hash directors are both special variants of the random director. Instead of a random number, the client director uses the client.identity. The client.identity variable defaults to the client IP, but can be changed in VCL. The same client will be directed to the same backend, assuming that the client.identity is the same for all requests.

Similarly, the hash director uses the hash data, which means that the same URL will go to the same web server every time. This is most relevant for multi-tiered caches.

For both the client and the hash director, the director will pick the next backend available if the preferred one is unhealthy.

The DNS director

The DNS director uses the Host header sent by a client to find a backend among a list of possibles. This allows dynamic scaling and changing of web server pools without modifying Varnish’ configuration, but instead just waiting for Varnish to pick up on the DNS changes.

As the DNS director is perhaps the most complex, some extra explanation might be useful. Consider the following example VCL.

director mydirector dns {
        .list = {
                .port = "81";
                "192.168.0.0"/24;
        }
        .ttl = 5m;
        .suffix = "internal.example.net";
}

sub vcl_recv {
        set req.backend = mydirector;
}

It defines 255 backends, all in the 192.168.0.0/24 range. The DNS director can also use the traditional (non-list) format of defining backends, and most options are supported in .list, as long as they are specified before the relevant backends.

The TTL specified is for the DNS cache. In our example, the mydirector director will cache the DNS lookups for 5 minutes. When a client asks for www.example.org, Varnish will look up www.example.org.internal.example.net, and if it resolves to something, the DNS director will check if on of the backends in the 192.168.0.0./24 range matches, then use that.

Demo: Health probes and grace

Saint mode

  • Saint mode marks an object as sick for a specific backend for a period of time
  • The rest of Varnish just sees a sick backend, be it for grace or backend selection
  • Other content from the same backend can still be accessed
  • ... unless more than a set amount of objects are added to the saintmode black list for a specific backend, then the entire backend is considered sick.
  • Normal to restart after setting beresp.saintmode = 20s; in vcl_fetch

Saint mode is meant to complement your regular health checks. Some times you just can’t spot a problem in a simple health probe, but it might be obvious in vcl_fetch.

An example could be a thumbnail generator. When it fails it might return “200 OK”, but no data. You can spot that the Length-header is 0 in vcl_fetch, but the health probes might not be able to pick up on this.

In this situation you can set beresp.saintmode = 20s;, and Varnish will not attempt to access that object (aka: URL) from that specific backend for the next 20 seconds. If you restart and attempt the same request again, Varnish will either pick a different backend if one is available, or try to use a graced object, or finally deliver an error message.

If you have more than 10 [objects] (default) objects black listed for a specific backend, the entire backend is considered sick. The rationale is that if 10 URLs already failed, there’s probably no reason to try an 11th.

There is no need to worry about recovering. The object will only be on the saint list for as long as you specify, regardless of whether the threshold is reached or not.

Use saint mode to complement your health checks. They are meant to either help Varnish “fail fast” for a backend that has failed, until the health probes can take over, or catch errors that are not possible to spot with the health checks.

As such, it’s advised to keep the saint period short. Typical suggestions are 20 seconds, 30 seconds, etc.

Restart in VCL

  • Start the VCL processing again from the top of vcl_recv.
  • Any changes made are kept.
  • Parameter max_restarts safe guards against infinite loops
  • req.restarts counts the number of restarts
sub vcl_fetch {
        if (req.restarts == 0 &&
                 req.request == "GET" &&
                 beresp.status == 301) {
                set beresp.http.location = regsub(beresp.http.location,"^http://","");
                set req.http.host = regsub(beresp.http.location,"/.*$","");
                set req.url = regsub(beresp.http.location,"[^/]*","");
                return (restart);
        }
}

Restarts in VCL can be used everywhere.

They allow you to re-run the VCL state engine with different variables. The above example simply executes a redirect without going through the client. An other example is using it in combination with PURGE and rewriting so the script that issues PURGE will also refresh the content.

Yet another example is to combine it with saint mode.

Note

Varnish version 2.1.5 is the first version where return(restart); is valid in vcl_deliver, making it available everywhere.

Backend properties

  • Most properties are optional
backend default {
   .host = "localhost";
   .port = "80";
   .connect_timeout = 0.5s;
   .between_bytes_timeout = 5s;
   .saintmode_threshold = 20;
   .first_byte_timeout = 20s;
   .max_connections = 50;
}

All the backend-specific timers that are available as parameters can also be overridden in the VCL on a backend-specific level.

While the timeouts have already been discussed, there are some other notable parameters.

The saintmode threshold defines how many items can be blacklisted by saint mode before the entire backend is considered sick. Saint mode will be discussed in more detail.

If your backend is struggling, it might be advantageous to set max_connections so only a set number of simultaneous connections will be issued to a specific backend.

Tip

Varnish only accepts hostnames for backend servers that resolve to a maximum of one IPv4 address and one IPv6 address. The parameter prefer_ipv6 defines which one Varnish will prefer.

Example: Evil backend hack

You can not use saintmode in vcl_error and health probes can be slow to pick up on trouble. So, in order to act on a failing backend right away you can use the supplied hack to force delivery of graced object right away.

You can use a fake backend that’s always sick to force a grace copy. This is considered a rather dirty hack that works.

backend normal {
        .host = "localhost";
        .probe = { .url = "/"; }
}

backend fail {
        .host = "localhost";
        .port = "21121";
        .probe = { .url = "/asfasfasf"; .initial = 0; .interval = 1d; }
}

sub vcl_recv {
        if (req.restarts == 0) {
                set req.backend = normal;
        } else {
                set req.backend = fail;
        }

        if (req.backend.healthy) {
                set req.grace = 30s;
        } else {
                set req.grace = 24h;
        }
}

sub vcl_fetch {
        set beresp.grace = 24h;
}

sub vcl_error {
        if (req.restarts == 0) {
                return (restart);
        }
}

Access Control Lists

  • An ACL is a list of IPs or IP ranges.
  • Compare with client.ip or server.ip
acl management {
        "172.16.0.0"/16;
}

acl sysadmins {
        "192.168.0.0"/16;
        ! "192.168.0.1";
}

sub vcl_recv {
        if (client.ip ~ management) {
                set req.url = regsub(req.url, "^","/proper-stuff");
        } elsif (client.ip ~ sysadmins) {
                set req.url = regsub(req.url, "^","/cool-stuff");
        }
}

ACLs are fairly simple. A single IP is listed as "192.168.1.2", and to turn it into an IP-range, add the /24 outside of the quotation marks ("192.168.1.0"/24). To exclude an IP or range from an ACL, precede it with an exclamation mark - that way you can include all the IPs in a range except the gateway, for example.

ACLs can be used for anything. Some people have even used ACLs to differantiate how their Varnish servers behaves (e.g.: A single VCL for different Varnish servers - but it evaluates server.ip to see where it really is).

Typical use cases are for PURGE requests, bans or avoiding the cache entirely.

Exercise: Combine PURGE and restart

  • Re-write the PURGE example to also issue a restart
  • The result should be that a PURGE both removes the content and fetches a new copy from the backend.

Solution: Combine PURGE and restart

acl purgers {
        "127.0.0.1";
        "192.168.0.0"/24;
}

sub vcl_recv {
        if (req.restarts == 0) {
                unset req.http.X-purger;
        }
        if (req.request == "PURGE") {
                if (!client.ip ~ purgers) {
                        error 405 "Method not allowed";
                }
                return (lookup);
        }
}

sub vcl_hit {
        if (req.request == "PURGE") {
                purge;
                set req.request = "GET";
                set req.http.X-purger = "Purged";
                error 800 "restart";
        }
}

sub vcl_miss {
        if (req.request == "PURGE") {
                purge;
                set req.request = "GET";
                set req.http.X-purger = "Purged-possibly";
                error 800 "restart";    # cant restart in miss, yet. go via error
        }
}

sub vcl_error {
        if (obj.status == 800 ) {
                return(restart);
        }
}

sub vcl_pass {
        if (req.request == "PURGE") {
                error 502 "PURGE on a passed object";
        }
}

sub vcl_deliver {
        if (req.http.X-purger) {
                set resp.http.X-purger = req.http.X-purger;
        }
}

Note

Whenever you are using req.http to store an internal variable, you should get used to unsetting it in vcl_recv on the first run. Otherwise a client could supply it directly. In this situation, the outcome wouldn’t be harmful, but it’s a good habit to establish.