Background
Caching between a browser and an origin server is controlled by request and response headers . Quite often these headers are misunderstood by users and business owners, so it is important that we as systems administrators and developers understand them well. Secondly, caching can have an impact on web site performance and bandwidth usage.
Basics
When a browser requests a piece of content from a web server the first thing that is sent back are the server headers ((http://crunchtools.com/web-server-headers-101/)). Some of the headers control caching.
It is essential to remember that any caching that a browser does is essentially voluntary and can not be required by the server. The web server provides headers to the browser to tell it whether caching is acceptable or not, but the browser makes the decision.
We are going inspect the cache control headers with a tool called ngrep (network grep). It is a command line tool that inspects network traffic coming in and out of your computer, so it can be used with any browser. Ngrep allows you to specify search patterns, which in this case, limits inspection to only web traffic.
Command:
ngrep -q -d wlan0 -W byline -i http "tcp port 80"
Output
interface: wlan0 (192.168.1.0/255.255.255.0)
filter: (ip or ip6) and ( tcp port 80 )
match: http
Explanation
- -q: Do not output “#” characters for every packet captured
- -d wlan0: Tells ngrep to listen on wireless interface, that’s what I am using now
- -W byline: Tells ngrep to respect newline characters found in the packets captured. This allows us to view HTTP headers much easier instead of grouping them all on one line
- -i http: Tells ngrep to display any packet which has http or HTTP in the payload
- “tcp port 80”: Tells ngrep to only capture tcp traffic coming from or going to port 80. This will capture both outbound inbound traffic allowing us to view response and request headers.
- interface: wlan0 (192.168.1.0/255.255.255.0): This indicates that ngrep is watching your wireless interface and will capture all traffic on the local network((http://en.wikipedia.org/wiki/Subnetwork))
- filter: (ip or ip6) and ( tcp port 80 ): Indicates that only TCP/IP traffic coming from or going to port 80 will be displayed
- match: http: Indicates that only traffic which has the word “http” somewhere in it’s payload will be displayed. Since the -i option was specified, both “HTTP” and “http” will be matched
Actions
In another window we are now going to use a browser to pull the CrunchTools logo image.
Paste the following URL in a browser:
http://crunchtools.com/wp-content/images/crunchtools-logo.png
The output returned by ngrep will look similar to the following:
Request Headers: The last two headers specify that no caching should be done by the server or any proxy along the route back to the origin web server.
GET /wp-content/images/crunchtools-logo.png HTTP/1.1.
Host: crunchtools.com.
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100330 Fedora/3.5.9-2.fc12 Firefox/3.5.9.
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8.
Accept-Language: en-us,en;q=0.5.
Accept-Encoding: gzip,deflate.
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7.
Keep-Alive: 300.
Connection: keep-alive.
Pragma: no-cache.
Cache-Control: no-cache.
.
Response Headers: The server, in return, responds with a standard 200 response which means that the file was found on the server and the content was delivered. Notice the binary data at the end of the response, some of the data was removed to make the output easier to read.
HTTP/1.1 200 OK.
Date: Sat, 08 May 2010 22:38:31 GMT.
Server: Apache/2.2.3 (CentOS).
Last-Modified: Sat, 08 May 2010 11:25:27 GMT.
ETag: "2a28c-73a1-6eb997c0".
Accept-Ranges: bytes.
Content-Length: 29601.
Cache-Control: max-age=7200, public.
Expires: Sun, 09 May 2010 00:38:31 GMT.
Connection: close.
Content-Type: image/png.
.
.PNG.
.
....IHDR.............K%7U....sRGB...
Now hit the refresh button on the browser. The output returned by ngrep will look similar to the following:
Request Headers: This time the content is requested, but the If-Modified-Since header specifies a date. This is called a validation request and the server will not deliver content unless it is stale.
GET /wp-content/images/crunchtools-logo.png HTTP/1.1.
Host: crunchtools.com.
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100330 Fedora/3.5.9-2.fc12 Firefox/3.5.9.
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8.
Accept-Language: en-us,en;q=0.5.
Accept-Encoding: gzip,deflate.
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7.
Keep-Alive: 300.
Connection: keep-alive.
If-Modified-Since: Sat, 08 May 2010 11:25:27 GMT.
If-None-Match: "2a28c-73a1-6eb997c0".
Cache-Control: max-age=0.
Response Headers: The server responds with a 304 which means the data is good and the browser can display it to the user immediately from it’s cache.
HTTP/1.1 304 Not Modified.
Date: Sat, 08 May 2010 22:43:11 GMT.
Server: Apache/2.2.3 (CentOS).
Connection: close.
ETag: "2a28c-73a1-6eb997c0".
Expires: Sun, 09 May 2010 00:43:11 GMT.
Cache-Control: max-age=7200, public.
Finally, we are going to see what happens when we modify the file by adding a single byte to the end of the image file. On the server, I run the following command:
echo -n "t" crunchtools-logo.png
Request Header: Notice the If-None-Match header. It specifies a unique label for the file being requested. This label, which is called an ETAG, was given to the browser in the last response and was saved. The browser is telling the server to give it a fresh copy if it has changed.
GET /wp-content/images/crunchtools-logo.png HTTP/1.1.
Host: crunchtools.com.
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100330 Fedora/3.5.9-2.fc12 Firefox/3.5.9.
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8.
Accept-Language: en-us,en;q=0.5.
Accept-Encoding: gzip,deflate.
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7.
Keep-Alive: 300.
Connection: keep-alive.
If-Modified-Since: Sat, 08 May 2010 11:25:27 GMT.
If-None-Match: "2a28c-73a1-6eb997c0".
Cache-Control: max-age=0.
Response Headers: The server responds with a 200 and delivers a fresh copy because the ETAG changed. The web server generated a new ETAG because the file changed. Also, notice the content length is 1 byte larger.
HTTP/1.1 200 OK.
Date: Sat, 08 May 2010 22:49:58 GMT.
Server: Apache/2.2.3 (CentOS).
Last-Modified: Sat, 08 May 2010 22:49:01 GMT.
ETag: "2a28c-73a4-fb599140".
Accept-Ranges: bytes.
Content-Length: 29602.
Cache-Control: max-age=7200, public.
Expires: Sun, 09 May 2010 00:49:58 GMT.
Connection: close.
Content-Type: image/png.
.
.PNG.
.
....IHDR.............K%7U....sRGB.....
Experiment
Try pulling cache control headers from web servers for several big or small websites. Try to analyze and understand what the cache headers mean. You will notice that some cache HTML, others cache only images. Each company has different strategies for caching content.
curl www.facebook.com
curl www.msn.com
curl www.cnn.com
Further Reading
- http://palisade.plynt.com/issues/2008Jul/cache-control-attributes/
- http://www.mnot.net/cache_docs/
- http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.2.6