Minimalist Guide to Lossless Compression

Lossless compression is the act of making a dataset smaller than its original form while still being able to transform the compressed version back into the original source material. This contrasts lossy compression which produces a derivative dataset that, while being something humans can appreciate, cannot recreate the original source material.

In a world where storage is cheap, why do we still compress data? The answer to that is datasets are growing at a faster rate than storage throughput or network bandwidth. 100 MB of data can be compressed to 10 MB in less than a second and transferred in under a second over a 100 Mbps connection. Most decompression systems run at 100 to 500 MB/s meaning the compression overhead isn't enough to burden the wall clock time needed to deliver the original 100 MB of content. Not only does this make the best use of the networking bottleneck, but it also frees up resources for other payloads.

Also, consider that if you can only read at 100 MB/s off a mechanical drive but your CPU can decompress data at ~500 MB/s then the mechanical drive is able to provide 5x the throughput you'd otherwise expect thanks to compression.

The world of compression is a rabbit hole of intellectual curiosity. In this post, I'll describe the highlights I've come across while trying to improve compression performance in database systems I deploy.

From Entropy to Lempel-Ziv

Entropy, an Information Theory term coined by Claude Shannon in 1948, describes the minimum number of bits, on average, needed to encode a dataset. Think of entropy as the number of yes/no questions that would need to be answered in order to describe any one piece of information within a dataset.

Compression aims to reduce that ceiling of bits by transforming data in such a way that patterns emerge exposing duplication within a given dataset.

The act of producing optimally large, repeating tokens within a dataset was the key concept behind Abraham Lempel's and Yaakov Ziv's work on the LZ77 and LZ78 algorithms developed in 1977 and 1978 respectively. Their work has had a huge impact on the field of lossless compression. Many algorithms used in popular software today are variations and tweaks to those two algorithms developed in the late 1970s.

Below is a family tree of related dictionary-based compression algorithms which trace their linages back to LZ77 and LZ78. I've also included the year of earliest publication I could find for each of them and a few noteworthy software and file format implementations.

├── LZ77 (1977)
│   ├── LZR (1981)
│   ├── LZSS (1982, WINRAR)
│   │   ├── LZB (1987)
│   │   └── LZH (1987)
│   ├── DEFLATE (1989, ZIP, gzip, PNG, zlib)
│   │   └── DEFLATE64 (1999)
│   ├── ROLZ (1991)
│   │   ├── LZP (1995)
│   │   └── LZRW1 (1991)
│   │       ├── LZJB (1998, ZFS)
│   │       └── LZRW1-A, 2, 3, 3-A, 4, 5 (<1996 and onward)
│   ├── LZS (1994, Cisco IOS)
│   ├── LZX (1995, CHM files)
│   ├── LZO (1996, Nintendo 64, PlayStation)
│   ├── LZMA (1998, 7ZIP, xz)
│   │   ├── LZMA2 (2009)
│   │   └── LZHAM (2009)
│   ├── Statistical Lempel-Ziv (2001)
│   ├── SLZ (<2009)
│   ├── LZ4 (2011, Hadoop, ZFS, bcolz)
│   │   └── ZStandard (2015, Redshift, ORC, Parquet, RocksDB, bcolz)
│   ├── Snappy (2011, ORC, Parquet, MariaDB, Cassandra, MongoDB, Lucene, bcolz)
│   └── Brotli (2013, Google Chrome, Firefox, nginx)
│       └── LZFSE (2015, iOS, Mac OSX)
└── LZ78 (1978)
    ├── LZW (1984, PKZIP, GIF, TIFF, PDF)
    │   ├── LZMW (1985)
    │   │   └── LZAP (1988)
    │   ├── LZC (2007)
    │   │   └── LZT (1987)
    │   └── LZWL (2006)
    └── LZJ (1985)

In contrast, these are a few non-dictionary-based compression systems.

├── PPM (1984)
│   └── PAQ (2002)
│      └── 20+ Variants (2003-2009)
└── bzip2 (1996)

As you can see, the compression world may not be fast-moving but rarely does a year pass without some sort of iteration and improvement.

Lossless Compression Benchmarks

In 2015, Przemyslaw Skibinski started a project on GitHub called lzbench. This project aims to compile the source code of a wide range of lossless compression algorithms into a single binary and then benchmark them against various workloads. The project's README contains benchmark results for a 212 MB file extracted from the Silesia compression corpus. Each of the compression methods used is run with a variety of settings; this gives an idea of what sort of results could be expected from a performance tuning exercise.

In the above benchmark results, blosclz, density, lz4, lz4fast, lzo1x and shrinker were all tuned to compress at more than 400 MB/s while maintaining at least a 5:3 compression ratio.

The benchmark shows memcpy, which simply copied the data from one memory location to another, running at ~8.5 GB/s. Given the computational overheads of each of the compressors, this transfer rate can be seen as a ceiling on the hardware used (C-Blosc claims it can get past memory-bound performance issues but I've yet to study their claim in detail). The density and lz4fast benchmarks stand out as their compression speeds of 800 and 785 MB/s respectively, were the fastest of any compressor able to achieve a 3:2 compression ratio.

Of the compressors that are able to achieve a 3:1 compression ratio, only Zstandard could do so at 88 MB/s. More than 75% of the other compressors that could achieve that compression ratio couldn't do so at any more than 15 MB/s, some 6x slower than Zstandard.

The vast majority of decompressors could break the 250 MB/s barrier, 25% of the decompressors broke the 1 GB/s barrier and a few, including LZ4, could be tuned to decompress in excess of 2 GB/s. Decompression at this rate would demand either RAM drives, RAID 0 or NVMe storage in order to keep up with these levels of throughput. The SATA bus on most systems is limited to 1.2 GB/s so this would be a bottleneck in need of addressing if it were included in the storage pipeline.

Lastly, the compression ratio of more than 4:1 that xz achieved is interesting. This compressor is popular in packaging software. A software package could be downloaded 100s of millions of times so it's worth the effort to find the best compression ratio possible given the amount of bandwidth and the diversity of network connections involved in software distribution. This goal can excuse the 2 MB/s compression rate xz managed during the compression process.

Do bear in mind that some decompression tools can require an excessive amount of memory and compute power. The high-ratio compression systems can suffer from this greatly so this trade-off should be considered as well.

Sort, then Compress

Many lossless compression systems can be aided by being fed sorted data. The sliding window compressors use rarely cover the entire dataset and the more clustered the values the easier it is to detect repeating patterns. Most SQL-based systems don't guarantee the order of rows returned without an ORDER BY clause so the sorted form of the data on disk should be of little concern (with Redshift's sort key a notable exception).

Below I've set up a standalone Hadoop system running HDFS and Presto using the instructions from my Hadoop 3 Install Guide. I've taken the first 6 ORC files representing 120 million of the 1.1 billion taxi rides that have taken place in New York City over six years. This post describes how I produced this dataset in CSV format and I've run a large number of benchmarks where I've converted that CSV data into ORC format before examining query performance.

Below are a few modifications I've made to the Presto configuration in my stand-alone guide. First, to sort 120M rows of data in Presto will require a memory limit of at least 8 GB.

$ sudo vi /opt/presto/etc/config.properties

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=8GB
query.max-memory-per-node=8GB
query.max-total-memory-per-node=8GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080

$ sudo /opt/presto/bin/launcher restart

Next, I'll create a warehouse folder as the tables created below will run via the Hive connector and it defaults to store tables it helps create in /user/hive/warehouse on HDFS.

$ hdfs dfs -mkdir -p /user/hive/warehouse

I'll then copy the first six ORC files I have saved in my home folder onto HDFS.

$ hdfs dfs -mkdir /trips_orc
$ hdfs dfs -copyFromLocal \
    ~/orc/00000[0-5]_0 \
    /trips_orc/

I'll create a schema for the trips_orc table in Hive. This lets Presto do schema-on-read and understand the column layout of the ORC files.

$ hive

CREATE EXTERNAL TABLE trips_orc (
    trip_id                 INT,
    vendor_id               STRING,
    pickup_datetime         TIMESTAMP,
    dropoff_datetime        TIMESTAMP,
    store_and_fwd_flag      STRING,
    rate_code_id            SMALLINT,
    pickup_longitude        DOUBLE,
    pickup_latitude         DOUBLE,
    dropoff_longitude       DOUBLE,
    dropoff_latitude        DOUBLE,
    passenger_count         SMALLINT,
    trip_distance           DOUBLE,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    ehail_fee               DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    payment_type            STRING,
    trip_type               SMALLINT,
    pickup                  STRING,
    dropoff                 STRING,

    cab_type                STRING,

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          STRING,
    pickup_borocode         SMALLINT,
    pickup_boroname         STRING,
    pickup_ct2010           STRING,
    pickup_boroct2010       STRING,
    pickup_cdeligibil       STRING,
    pickup_ntacode          STRING,
    pickup_ntaname          STRING,
    pickup_puma             STRING,

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         STRING,
    dropoff_borocode        SMALLINT,
    dropoff_boroname        STRING,
    dropoff_ct2010          STRING,
    dropoff_boroct2010      STRING,
    dropoff_cdeligibil      STRING,
    dropoff_ntacode         STRING,
    dropoff_ntaname         STRING,
    dropoff_puma            STRING
) STORED AS orc
  LOCATION '/trips_orc/';

The following SQL will create four new tables. Each will be in GZIP-compressed, ORC format. Each one will pick a different field to sort on. Note I'm only storing four columns from the original table for the sake of both time and memory consumption during this operation.

$ vi sort.sql

CREATE TABLE sorted_by_vendor_id
WITH (format='ORC') AS
    SELECT trip_id,
           vendor_id,
           pickup_datetime,
           pickup_longitude
    FROM trips_orc
    ORDER BY vendor_id;

CREATE TABLE sorted_by_trip_id
WITH (format='ORC') AS
    SELECT trip_id,
           vendor_id,
           pickup_datetime,
           pickup_longitude
    FROM trips_orc
    ORDER BY trip_id;

CREATE TABLE sorted_by_pickup_datetime
WITH (format='ORC') AS
    SELECT trip_id,
           vendor_id,
           pickup_datetime,
           pickup_longitude
    FROM trips_orc
    ORDER BY pickup_datetime;

CREATE TABLE sorted_by_pickup_longitude
WITH (format='ORC') AS
    SELECT trip_id,
           vendor_id,
           pickup_datetime,
           pickup_longitude
    FROM trips_orc
    ORDER BY pickup_longitude;

I'll execute the above with Presto.

$ presto \
    --server localhost:8080 \
    --catalog hive \
    --schema default \
    --file sort.sql

The resulting table sizes were as follows:

  GB | sorted by
----------------------
0.91 | pickup_longitude
1.06 | trip_id
1.12 | pickup_datetime
1.33 | vendor_id

The largest table is 1.46x bigger than the smallest.

There is an argument that one should test the first 50K-odd records of any table against all possible sort keys when compressing a given dataset. There is no speeding up a solid-state drive by 1.46x but the above happily reduced the throughput requirements by that amount.

If the memory usage of sorting your entire dataset exceeded your cluster's capacity you could look to sort on each table partition one at a time instead. Good is not the enemy of perfect and reclaiming storage and throughput capacity will help make the most of your cluster's hardware.

An Attack Vector

Compression has also been an attack vector for decades now. There have been "Compression Fuzzing" efforts to help uncover vulnerabilities but given decompression utilities are some of the most widely-deployed software in the world, any exploit can have a global impact.

A few examples include the "ZIP of death" which is a 42 KB ZIP file that extracts 4.5 PB of data. Unsuspecting web applications that decompressed user-submitted content need to be hardened for this sort of attack. BREACH exploited an HTTPS compression vulnerability and CRIME was another exploit disclosed around the same time that worked over HTTPS and SPDY connections that used compression.

Incidents like the above are making compression designers contemplate what attack vectors could result from forth-coming dictionary encoding methods.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.