Lossless compression is the act of making a dataset smaller than its original form while still being able to transform the compressed version back into the original source material. This contrasts lossy compression which produces a derivative dataset that, while being something humans can appreciate, cannot recreate the original source material.
In a world where storage is cheap, why do we still compress data? The answer to that is datasets are growing at a faster rate than storage throughput or network bandwidth. 100 MB of data can be compressed to 10 MB in less than a second and transferred in under a second over a 100 Mbps connection. Most decompression systems run at 100 to 500 MB/s meaning the compression overhead isn't enough to burden the wall clock time needed to deliver the original 100 MB of content. Not only does this make the best use of the networking bottleneck, but it also frees up resources for other payloads.
Also, consider that if you can only read at 100 MB/s off a mechanical drive but your CPU can decompress data at ~500 MB/s then the mechanical drive is able to provide 5x the throughput you'd otherwise expect thanks to compression.
The world of compression is a rabbit hole of intellectual curiosity. In this post, I'll describe the highlights I've come across while trying to improve compression performance in database systems I deploy.
From Entropy to Lempel-Ziv
Entropy, an Information Theory term coined by Claude Shannon in 1948, describes the minimum number of bits, on average, needed to encode a dataset. Think of entropy as the number of yes/no questions that would need to be answered in order to describe any one piece of information within a dataset.
Compression aims to reduce that ceiling of bits by transforming data in such a way that patterns emerge exposing duplication within a given dataset.
The act of producing optimally large, repeating tokens within a dataset was the key concept behind Abraham Lempel's and Yaakov Ziv's work on the LZ77 and LZ78 algorithms developed in 1977 and 1978 respectively. Their work has had a huge impact on the field of lossless compression. Many algorithms used in popular software today are variations and tweaks to those two algorithms developed in the late 1970s.
Below is a family tree of related dictionary-based compression algorithms which trace their linages back to LZ77 and LZ78. I've also included the year of earliest publication I could find for each of them and a few noteworthy software and file format implementations.
├── LZ77 (1977)
│ ├── LZR (1981)
│ ├── LZSS (1982, WINRAR)
│ │ ├── LZB (1987)
│ │ └── LZH (1987)
│ ├── DEFLATE (1989, ZIP, gzip, PNG, zlib)
│ │ └── DEFLATE64 (1999)
│ ├── ROLZ (1991)
│ │ ├── LZP (1995)
│ │ └── LZRW1 (1991)
│ │ ├── LZJB (1998, ZFS)
│ │ └── LZRW1-A, 2, 3, 3-A, 4, 5 (<1996 and onward)
│ ├── LZS (1994, Cisco IOS)
│ ├── LZX (1995, CHM files)
│ ├── LZO (1996, Nintendo 64, PlayStation)
│ ├── LZMA (1998, 7ZIP, xz)
│ │ ├── LZMA2 (2009)
│ │ └── LZHAM (2009)
│ ├── Statistical Lempel-Ziv (2001)
│ ├── SLZ (<2009)
│ ├── LZ4 (2011, Hadoop, ZFS, bcolz)
│ │ └── ZStandard (2015, Redshift, ORC, Parquet, RocksDB, bcolz)
│ ├── Snappy (2011, ORC, Parquet, MariaDB, Cassandra, MongoDB, Lucene, bcolz)
│ └── Brotli (2013, Google Chrome, Firefox, nginx)
│ └── LZFSE (2015, iOS, Mac OSX)
└── LZ78 (1978)
├── LZW (1984, PKZIP, GIF, TIFF, PDF)
│ ├── LZMW (1985)
│ │ └── LZAP (1988)
│ ├── LZC (2007)
│ │ └── LZT (1987)
│ └── LZWL (2006)
└── LZJ (1985)
In contrast, these are a few non-dictionary-based compression systems.
├── PPM (1984)
│ └── PAQ (2002)
│ └── 20+ Variants (2003-2009)
└── bzip2 (1996)
As you can see, the compression world may not be fast-moving but rarely does a year pass without some sort of iteration and improvement.
Lossless Compression Benchmarks
In 2015, Przemyslaw Skibinski started a project on GitHub called lzbench. This project aims to compile the source code of a wide range of lossless compression algorithms into a single binary and then benchmark them against various workloads. The project's README contains benchmark results for a 212 MB file extracted from the Silesia compression corpus. Each of the compression methods used is run with a variety of settings; this gives an idea of what sort of results could be expected from a performance tuning exercise.
In the above benchmark results, blosclz, density, lz4, lz4fast, lzo1x and shrinker were all tuned to compress at more than 400 MB/s while maintaining at least a 5:3 compression ratio.
The benchmark shows memcpy, which simply copied the data from one memory location to another, running at ~8.5 GB/s. Given the computational overheads of each of the compressors, this transfer rate can be seen as a ceiling on the hardware used (C-Blosc claims it can get past memory-bound performance issues but I've yet to study their claim in detail). The density and lz4fast benchmarks stand out as their compression speeds of 800 and 785 MB/s respectively, were the fastest of any compressor able to achieve a 3:2 compression ratio.
Of the compressors that are able to achieve a 3:1 compression ratio, only Zstandard could do so at 88 MB/s. More than 75% of the other compressors that could achieve that compression ratio couldn't do so at any more than 15 MB/s, some 6x slower than Zstandard.
The vast majority of decompressors could break the 250 MB/s barrier, 25% of the decompressors broke the 1 GB/s barrier and a few, including LZ4, could be tuned to decompress in excess of 2 GB/s. Decompression at this rate would demand either RAM drives, RAID 0 or NVMe storage in order to keep up with these levels of throughput. The SATA bus on most systems is limited to 1.2 GB/s so this would be a bottleneck in need of addressing if it were included in the storage pipeline.
Lastly, the compression ratio of more than 4:1 that xz achieved is interesting. This compressor is popular in packaging software. A software package could be downloaded 100s of millions of times so it's worth the effort to find the best compression ratio possible given the amount of bandwidth and the diversity of network connections involved in software distribution. This goal can excuse the 2 MB/s compression rate xz managed during the compression process.
Do bear in mind that some decompression tools can require an excessive amount of memory and compute power. The high-ratio compression systems can suffer from this greatly so this trade-off should be considered as well.
Sort, then Compress
Many lossless compression systems can be aided by being fed sorted data. The sliding window compressors use rarely cover the entire dataset and the more clustered the values the easier it is to detect repeating patterns. Most SQL-based systems don't guarantee the order of rows returned without an ORDER BY clause so the sorted form of the data on disk should be of little concern (with Redshift's sort key a notable exception).
Below I've set up a standalone Hadoop system running HDFS and Presto using the instructions from my Hadoop 3 Install Guide. I've taken the first 6 ORC files representing 120 million of the 1.1 billion taxi rides that have taken place in New York City over six years. This post describes how I produced this dataset in CSV format and I've run a large number of benchmarks where I've converted that CSV data into ORC format before examining query performance.
Below are a few modifications I've made to the Presto configuration in my stand-alone guide. First, to sort 120M rows of data in Presto will require a memory limit of at least 8 GB.
$ sudo vi /opt/presto/etc/config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=8GB
query.max-memory-per-node=8GB
query.max-total-memory-per-node=8GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080
$ sudo /opt/presto/bin/launcher restart
Next, I'll create a warehouse folder as the tables created below will run via the Hive connector and it defaults to store tables it helps create in /user/hive/warehouse on HDFS.
$ hdfs dfs -mkdir -p /user/hive/warehouse
I'll then copy the first six ORC files I have saved in my home folder onto HDFS.
$ hdfs dfs -mkdir /trips_orc
$ hdfs dfs -copyFromLocal \
~/orc/00000[0-5]_0 \
/trips_orc/
I'll create a schema for the trips_orc table in Hive. This lets Presto do schema-on-read and understand the column layout of the ORC files.
$ hive
CREATE EXTERNAL TABLE trips_orc (
trip_id INT,
vendor_id STRING,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag STRING,
rate_code_id SMALLINT,
pickup_longitude DOUBLE,
pickup_latitude DOUBLE,
dropoff_longitude DOUBLE,
dropoff_latitude DOUBLE,
passenger_count SMALLINT,
trip_distance DOUBLE,
fare_amount DOUBLE,
extra DOUBLE,
mta_tax DOUBLE,
tip_amount DOUBLE,
tolls_amount DOUBLE,
ehail_fee DOUBLE,
improvement_surcharge DOUBLE,
total_amount DOUBLE,
payment_type STRING,
trip_type SMALLINT,
pickup STRING,
dropoff STRING,
cab_type STRING,
precipitation SMALLINT,
snow_depth SMALLINT,
snowfall SMALLINT,
max_temperature SMALLINT,
min_temperature SMALLINT,
average_wind_speed SMALLINT,
pickup_nyct2010_gid SMALLINT,
pickup_ctlabel STRING,
pickup_borocode SMALLINT,
pickup_boroname STRING,
pickup_ct2010 STRING,
pickup_boroct2010 STRING,
pickup_cdeligibil STRING,
pickup_ntacode STRING,
pickup_ntaname STRING,
pickup_puma STRING,
dropoff_nyct2010_gid SMALLINT,
dropoff_ctlabel STRING,
dropoff_borocode SMALLINT,
dropoff_boroname STRING,
dropoff_ct2010 STRING,
dropoff_boroct2010 STRING,
dropoff_cdeligibil STRING,
dropoff_ntacode STRING,
dropoff_ntaname STRING,
dropoff_puma STRING
) STORED AS orc
LOCATION '/trips_orc/';
The following SQL will create four new tables. Each will be in GZIP-compressed, ORC format. Each one will pick a different field to sort on. Note I'm only storing four columns from the original table for the sake of both time and memory consumption during this operation.
$ vi sort.sql
CREATE TABLE sorted_by_vendor_id
WITH (format='ORC') AS
SELECT trip_id,
vendor_id,
pickup_datetime,
pickup_longitude
FROM trips_orc
ORDER BY vendor_id;
CREATE TABLE sorted_by_trip_id
WITH (format='ORC') AS
SELECT trip_id,
vendor_id,
pickup_datetime,
pickup_longitude
FROM trips_orc
ORDER BY trip_id;
CREATE TABLE sorted_by_pickup_datetime
WITH (format='ORC') AS
SELECT trip_id,
vendor_id,
pickup_datetime,
pickup_longitude
FROM trips_orc
ORDER BY pickup_datetime;
CREATE TABLE sorted_by_pickup_longitude
WITH (format='ORC') AS
SELECT trip_id,
vendor_id,
pickup_datetime,
pickup_longitude
FROM trips_orc
ORDER BY pickup_longitude;
I'll execute the above with Presto.
$ presto \
--server localhost:8080 \
--catalog hive \
--schema default \
--file sort.sql
The resulting table sizes were as follows:
GB | sorted by
----------------------
0.91 | pickup_longitude
1.06 | trip_id
1.12 | pickup_datetime
1.33 | vendor_id
The largest table is 1.46x bigger than the smallest.
There is an argument that one should test the first 50K-odd records of any table against all possible sort keys when compressing a given dataset. There is no speeding up a solid-state drive by 1.46x but the above happily reduced the throughput requirements by that amount.
If the memory usage of sorting your entire dataset exceeded your cluster's capacity you could look to sort on each table partition one at a time instead. Good is not the enemy of perfect and reclaiming storage and throughput capacity will help make the most of your cluster's hardware.
An Attack Vector
Compression has also been an attack vector for decades now. There have been "Compression Fuzzing" efforts to help uncover vulnerabilities but given decompression utilities are some of the most widely-deployed software in the world, any exploit can have a global impact.
A few examples include the "ZIP of death" which is a 42 KB ZIP file that extracts 4.5 PB of data. Unsuspecting web applications that decompressed user-submitted content need to be hardened for this sort of attack. BREACH exploited an HTTPS compression vulnerability and CRIME was another exploit disclosed around the same time that worked over HTTPS and SPDY connections that used compression.
Incidents like the above are making compression designers contemplate what attack vectors could result from forth-coming dictionary encoding methods.