cache-size implementation downsides

Wed Jun 13 21:02:42 CEST 2018

Hi, all:

TL;DR: The way cache-size is implemented is pretty clever and efficient,
but has important downsides due to filename collisions.

Long:

The way caching is implemented in cgit, it uses the path of the URL
request (plus any query strings) as the cache key. That key is then
passed through a fast one-way hashing function to arrive to a
deterministic integer between 0 and the value of cache-size. That
integer is then used to generate the name of the file that will be
stored in the cache-root directory.

This is clever and fast, because this means cgit doesn't need to list
the contents of the cache-root directory to make sure there are no more
than cache-size number of files in it (directory listing is expensive).
However, this also means that if the value of cache-size is set
sufficiently low, there will inevitably be collisions. To make sure that
we don't serve the wrong cache, cgit stores the full key (path+query
string) inside the cache file at offset 0. When there is a cache file in
place matching the request, cgit makes sure that it is for the correct
key before serving it. If the keys differ, the old cache value is
thrown out and the new response is generated and cached.

Like I said, this is clever and fast because we get enforcement of the
maximum number of files basically for free. However, collisions have
important downsides on a busy site like git.kernel.org:

1. If two popular but expensive URLs have the same cache-file object,
they can negate a lot of cache savings. E.g. if a very expensive diff
has the same cache-file as another very expensive diff, and both are
alternately hit with sufficient frequency, it results in a lot of cache
misses and high load.

2. I have witnessed cache corruption due to collisions (which is
a bug in itself). One of our frontends was hit by a lot of agressive
crawling of snapshots that raised the load to 60+ (many, many gzip
processes). After we blackholed the bot, some of the cache objects for
non-snapshot URLs had trailing gzip junk in them, meaning that either
two instances were writing to the same file, or something else resulted
in cache corruption. This is probably a race condition somewhere in the
locking code.

3. If we raise the value of cache-size higher in hopes to avoid
collisions, we risk running out of disk space due to large snapshots
accumulating in cache-root, since nothing cleans them up.

#3 is our interim solution, and we use a cronjob [1] that runs
every 5 minutes to clean up old snapshots static URLs, attempting to
respect the cache-ttl values. However, I wonder if a more robust cache
backend implementation can help resolve some of these problems,
especially the high chance of collisions when cache-size is set
sufficiently low.

.. [1] https://gist.github.com/mricon/28dd36fa093f9cb0748f63756f0f0a8d

-K
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <http://lists.zx2c4.com/pipermail/cgit/attachments/20180613/ce149cb0/attachment.asc>