In February I published a blog post on normalising 1.1 billion taxi journeys made in New York City over six years into CSV files and then loading them into Redshift. The blog post was run on a free tier instance and only had enough space to import around 200 million records. In this blog post I'll launch a ds2.xlarge instance, import all 1.1 billion records and run benchmark queries similar to the ones I've run on BigQuery, Elasticsearch, PostgreSQL and Presto on Dataproc.
104 GB of Data on S3
I'll start out by uploading the 104 GB of compressed CSV files I generated in my original Redshift blog post to S3. The files took 13 hours and 18 minutes to upload averaging out at 17.38 Mbit/s.
I've benchmarked my internet connection at 49 Mbit/s up with a server in a neighbouring country and routinely hit 33 Mbps+ with servers in Ireland. It would be nice to see S3 able to saturate my upload capacity at some point.
Redshift Up & Running
Redshift will need storage capacity that is 2.5x the size of the uncompressed dataset in order to import and sort the data properly in one go. The uncompressed data is around 500 GB so the 2 TB of mechanical capacity offered by the ds2.xlarge cluster type should be able to fit the data just fine.
The cluster comes with 14 EC2 compute units, 31 GB of RAM and "moderate" I/O performance of 0.4 GB/s from what I can gather in Amazon's documentation. I launched the cluster in eu-west-1c.
I've been told an SSD-based cluster could be 2-3x faster. The cheapest cluster type that could store up to 1.5 TB of data on SSDs is the dc1.8xlarge which is over 5.5x more expensive.
Importing into Redshift
I've created a manifest file of all the files I've uploaded to S3. This list will be used by Redshift to import all the data into a single table. I've given the list of files in a random order so that if there are any issues with pockets of data they stand a better chance of being spotted sooner rather than later.
$ vi trips.manifest
{
"entries": [
{"url": "s3://<s3_bucket>/csv/trips_xbw.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xae.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xcd.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xau.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbu.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbe.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xaf.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xap.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xaz.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xao.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbi.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xba.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbt.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xcb.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xah.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbm.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xby.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbc.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbp.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xai.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbh.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xab.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xat.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xay.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbx.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbv.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xaj.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xag.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xac.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xan.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbz.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbl.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xas.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbg.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xax.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbj.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbr.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xaw.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbf.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xam.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbb.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbn.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbo.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xal.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xaq.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xca.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbd.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xar.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xaa.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbk.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xak.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xav.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbs.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xbq.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xcc.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_xad.csv.gz", "mandatory": true}
]
}
The following will upload the manifest file to S3.
$ s3cmd put trips.manifest s3://<s3_bucket>/csv/
I'll set environment variables containing the username and password for the cluster that I defined when I launched it.
$ read MASTER_USERNAME
$ read MASTER_PASSWORD
$ export MASTER_USERNAME
$ export MASTER_PASSWORD
I'll then use PostgreSQL's CLI tool to connect to the cluster.
$ PGPASSWORD=$MASTER_PASSWORD \
psql -h trips.c6rssls4aefs.eu-west-1.redshift.amazonaws.com \
-p 5439 \
-U $MASTER_USERNAME trips
The following is the schema for the trips table. The encodings used were suggested by Redshift's ANALYZE command. The data types used on each column were picked to be as granular as the underlying data would allow for. The original dataset does have some data quality issues so many of the DECIMAL fields allow for a much wider range of values than what would be expected on a well-groomed dataset. The sort key is not being encoded which is in-line with Amazon's guidance.
CREATE TABLE trips (
trip_id INTEGER NOT NULL DISTKEY ENCODE LZO,
vendor_id VARCHAR(3) ENCODE LZO,
-- Sort keys shouldn't be encoded (compressed)
pickup_datetime TIMESTAMP NOT NULL,
dropoff_datetime TIMESTAMP NOT NULL ENCODE LZO,
store_and_fwd_flag VARCHAR(1) ENCODE RUNLENGTH,
rate_code_id SMALLINT ENCODE LZO,
pickup_longitude DECIMAL(18,14) ENCODE MOSTLY8,
pickup_latitude DECIMAL(18,14) ENCODE MOSTLY8,
dropoff_longitude DECIMAL(18,14) ENCODE MOSTLY8,
dropoff_latitude DECIMAL(18,14) ENCODE MOSTLY8,
passenger_count SMALLINT NOT NULL DEFAULT '0' ENCODE LZO,
trip_distance DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
fare_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
extra DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
mta_tax DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
tip_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
tolls_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
ehail_fee DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
improvement_surcharge DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
total_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
payment_type VARCHAR(3) ENCODE RUNLENGTH,
trip_type SMALLINT ENCODE LZO,
pickup VARCHAR(50) ENCODE LZO,
dropoff VARCHAR(50) ENCODE LZO,
cab_type VARCHAR(6) NOT NULL ENCODE LZO,
precipitation SMALLINT DEFAULT '0' ENCODE LZO,
snow_depth SMALLINT DEFAULT '0' ENCODE LZO,
snowfall SMALLINT DEFAULT '0' ENCODE LZO,
max_temperature SMALLINT DEFAULT '0' ENCODE LZO,
min_temperature SMALLINT DEFAULT '0' ENCODE LZO,
average_wind_speed SMALLINT DEFAULT '0' ENCODE LZO,
pickup_nyct2010_gid SMALLINT ENCODE LZO,
pickup_ctlabel VARCHAR(10) ENCODE LZO,
pickup_borocode SMALLINT ENCODE LZO,
pickup_boroname VARCHAR(13) ENCODE LZO,
pickup_ct2010 VARCHAR(6) ENCODE LZO,
pickup_boroct2010 VARCHAR(7) ENCODE LZO,
pickup_cdeligibil VARCHAR(1) ENCODE RUNLENGTH,
pickup_ntacode VARCHAR(4) ENCODE LZO,
pickup_ntaname VARCHAR(56) ENCODE LZO,
pickup_puma VARCHAR(4) ENCODE LZO,
dropoff_nyct2010_gid SMALLINT ENCODE LZO,
dropoff_ctlabel VARCHAR(10) ENCODE LZO,
dropoff_borocode SMALLINT ENCODE LZO,
dropoff_boroname VARCHAR(13) ENCODE LZO,
dropoff_ct2010 VARCHAR(6) ENCODE LZO,
dropoff_boroct2010 VARCHAR(7) ENCODE LZO,
dropoff_cdeligibil VARCHAR(1) ENCODE RUNLENGTH,
dropoff_ntacode VARCHAR(4) ENCODE LZO,
dropoff_ntaname VARCHAR(56) ENCODE LZO,
dropoff_puma VARCHAR(4) ENCODE LZO,
primary key(trip_id)
) sortkey(pickup_datetime);
The following will start the importing process.
COPY trips
FROM 's3://<s3_bucket>/csv/trips.manifest'
CREDENTIALS
'aws_access_key_id=...;aws_secret_access_key=...'
DELIMITER ','
EMPTYASNULL
ESCAPE
GZIP
MANIFEST
MAXERROR 100000
REMOVEQUOTES
TRIMBLANKS
TRUNCATECOLUMNS;
The above took 4 hours and 33 minutes to import.
I ran a VACUUM on the data after the import. The following completed in 1 minute and 42 seconds.
VACUUM;
Erroneous Values
The original PostgreSQL schema this data came from defined DECIMAL columns as NUMERIC meaning they'd accept any sort of numeric value. This meant the schema wasn't enforcing any sort of value range control. When I first started this import many of the fields such as trip_distance were defined as DECIMAL(6,3) but errors from values such as 1236007.30 would cause Redshift to abort importing after 45 - 50 minutes.
The dataset combines data from different types of taxis operating in New York City, some of which report less amounts of data than others. I had to remove some NOT NULL constraints in order for these records to import as well. Again, I didn't know about these problems till 45 - 50 minutes into the import.
I first started importing the data at 10:30am UTC+3 and it wasn't until 16:05pm that I got a schema together that would allow for these data quality issues. To top it off I was paying for the Redshift cluster during each one of the imports that failed.
The following is a command I ran to find what the last 100,000 errors were during any failed import and an example output from one of the failed imports.
SELECT colname, err_reason, COUNT(*)
FROM stl_load_errors
GROUP BY 1, 2
ORDER BY 1, 2 DESC;
colname | rate_code_id
err_reason | Missing data for not-null field
count | 100000
It's important to keep an eye on the Redshift console during a LOAD job. PostgreSQL's CLI will just sit without reporting anything if there is a failure. Without seeing the query terminating in the console I wouldn't know there was something wrong.
It's even more important to make sure your data is in very good shape before you present it to Redshift. Datasets like the one I'm working with are going to be loaded with data quality issues that need to be taken care of prior to running any LOAD jobs on Redshift.
Benchmarking Redshift
I ran each of the queries below multiple times in a row. The times reported were the fastest achieved.
\timing on
The following completed in 48 seconds.
SELECT cab_type,
count(*)
FROM trips
GROUP BY 1;
The following completed in 59 seconds.
SELECT passenger_count,
avg(total_amount)
FROM trips
GROUP BY 1;
The following completed in 1 minute and 28 seconds.
SELECT passenger_count,
extract(year from pickup_datetime),
count(*)
FROM trips
GROUP BY 1,
2;
The following completed in 1 minutes and 57 seconds.
SELECT passenger_count,
extract(year from pickup_datetime),
round(trip_distance),
count(*)
FROM trips
GROUP BY 1,
2,
3
ORDER BY 2,
4 desc;
Costs
I had the Redshift cluster up for a total of 10 hours which set me back $9.50. The cost of storing 104 GB of data on S3 is $3.12 if I were to leave the CSV data there for a month. On top of those two costs is the 20% VAT that needs to be collected as I'm using a personal account in the UK.