We generate an awful lot of logs at work. The application I work on outputs a large JSON log line each time we make a decision, and we make a lot of decisions. We end up with many tens of gigabytes of JSON log file per machine per day.
All this is then gzipped, archived and placed into longer-term storage, accessible to
me over NFS. If we need to do any diagnostics or answer historical questions about
the decisions our software made, the main tools we used were zgrep
and jq
. These are both irreplaceably awesome, but zgrep
in particular
understandably take a little while to run. The gzip file format does not in
general1 support random access, and so zgrep
is forced to decompress the
entire file serially and then search each line.
I thought there must be a better way, and when I discovered the zran.c example in the zlib sources (a way to accelerate seeks inside gzipped files) I set to work on zindex.
zindex
creates an index on your gzipped files and then zq
lets you quickly
search that index. Not only does zindex
create an index based on a
configurable key per line, it also creates an acceleration structure to allow
random access within an existing gzipped file. That makes it super fast to
query and return the matching lines without the index itself being too large.
Random access is afforded by a “checkpoint” of the decompressor’s state being stored every few megabytes of data. Later on, this checkpoint can be used to re-initialise the decompressor to decompress from within the middle of the file. To read from a particular offset in the file, the nearest checkpoint before the required data is found, the decompressor initialized with its state, and then data decompressed and discarded until the required offset is found. This bounds the amount of time needed to read a file to be on average the time taken to decompress half the checkpoint size’s amount of data. Without checkpoints the average time is half the time needed to decompress the entire file.
The index itself can be generated in a number of ways:
n
th field of a CSV log line.--regex 'id:([0-9]+)'
would search
for matches like id:1234
and then the index would be built on the 1234
.jq
and generating matches for
each orderId.id
in the actions
array:
"jq --raw-output --unbuffered '[.actions[].orderId.id] | join(\" \")'"
By default the index is created as <FILENAME>.gz.zindex
, but this can be changed
if (for example) your gzip file is on a read-only NFS share.
Once the index is built, querying it is simple: run zq
and pass it the path
to the gzipped file and then whatever queries you want to make. zq
will use
the default index on a file unless told otherwise: it can also query by line
number. I plan on adding support for multiple indices soon.
Typical numeric indexes I have built are around 20% of the size of the compressed file size, which is to say around 4% of the uncompressed size. Indexing runs at around 50MB of compressed data per second (250MB/sec uncompressed). I suspect the limiting factor in my tests is NFS.
Querying the index is pretty fast; taking on average 0.01s per query in my tests.
Still to do are supporting multiple indices, and optimizing some cases where I can reuse the decompression state without rewinding and starting again.
As you’d expect the source is available on github and I welcome patches. Ideas and bug reports are also welcomed. I hope to get a few things fixed up and then do a v1.0 release, complete with binaries: at the moment you’ll need to build from source.
It is possible to create points in the gzip stream where one can later
seek()
to by resetting the state of the compressor and flushing it
periodically. That wasn’t an option in our case as the file archival
process is not something I had the freedom to change. ↩
Matt Godbolt is a C++ developer living in Chicago. Follow him on Mastodon or Bluesky.