Elements or Lower

Mon, 05 Dec 2005

The PageCache

I’ve previously noted that the CMS I’ve put together uses a fried rather than baked model for its presentation layer. For a few months now, this has only partly been true.

The presentation layer now implements a cache for the final HTML of pages, and serves from that when there’s a copy of the requested page there. The advantage of this is an acceleration in the delivery of cached pages, and a reduction in the amount of redundant work the CMS has to do, especially for popular pages.

The cache is a regular database table, containing a resource ID, the “framework” (viewing context), and the actual generated HTML. When a request is made for a page, once the CMS has analysed the URL to establish the resource ID and framework in question, it checks to see if there’s matching content in the cache. If there is, it serves it; if not, it proceeds to generate the content as normal.

This is different from a normal baked CMS, only in that the pre-generated content is effectively served from a database rather than as static files on the server, and that the CMS continues to fry-up content if there isn’t anything already baked. By doing this, and by binding the cache to the CMS at a fairly fundamental level, we can achieve quite a lot of flexibility.

For a typical page, if there isn’t any pre-generated content, the CMS will generate the page as normal, and then try to store the final HTML in the database for the next request. The CMS won’t actually permit the storage of pages, however, in the following circumstances:

  1. The page is a PDF file.

  2. The page is being viewed in the test environment (the CMS can have separate test and live content for any page).

  3. The request contained a query string or POST content.

  4. The content is marked as do-not-cache.

All that happens here is that the generated content is never stored, and so the CMS will be forced to re-generate it for each identical request. The initial check for pre-generated content, therefore, trusts the database completely. If there’s pre-generated content available, it’s always served. This keeps the processing overhead to serving that content to an absolute minimum.

The database, however, is deliberately very fragile. The CMS wipes the entire cache overnight, to avoid any pages becoming stale, and the administration layer wipes selected portions of the cache when amendments are made. Generally, changing the textual content of a page means that the CMS only need wipe the cache for that page alone, whereas changing the title or metadata of a page prompts wiping the cache for the page, its descendants in the site hierarchy and its siblings in the site hierarchy. Moving a page within the site prompts the CMS to wipe the entire cache, and so on. The CMS tries to be as cautious as it can be here — it’s better to wipe too much of the cache, than not enough.

Even then, given the range of sources from which the CMS can acquire content, sometimes the cache still can occasionally contain data that’s not perfectly fresh, and so there are options within the “Administration Shell” to wipe the entire cache, or to wipe the cache only for a specific resource.

Because the preparation of a finished page of HTML can involve a lot of work for the CMS, including a number of XML transformations, introducing the cache has helped response times — particularly for popular pages — by a surprising amount. It’s far, far easier for the CMS to shunt data straight from the database than to go through the normal page generation process, and because the system runs under mod_perl, the CMS code is itself compiled into the Apache process, and the database connection is also cached and reused.

The CMS request log keeps a high-resolution timer of the processing time taken from parsing the incoming HTTP headers to initiating the log record. Taking the homepage as an example, a sample request without the PageCache took 0.906 seconds to complete. With the output of that request cached, the second request took only 0.007 seconds. Most pages don’t take quite that long to generate the content (typically around 0.3-0.6 seconds) — but even here, the difference between half a second and less than a tenth of a second is palpable.