All 1.1 Billion Taxi Rides on Redshift

In February I published a blog post on normalising 1.1 billion taxi journeys made in New York City over six years into CSV files and then loading them into Redshift. The blog post was run on a free tier instance and only had enough space to import around 200 million records. In this blog post I'll launch a ds2.xlarge instance, import all 1.1 billion records and run benchmark queries similar to the ones I've run on BigQuery, Elasticsearch, PostgreSQL and Presto on Dataproc.

104 GB of Data on S3

I'll start out by uploading the 104 GB of compressed CSV files I generated in my original Redshift blog post to S3. The files took 13 hours and 18 minutes to upload averaging out at 17.38 Mbit/s.

I've benchmarked my internet connection at 49 Mbit/s up with a server in a neighbouring country and routinely hit 33 Mbps+ with servers in Ireland. It would be nice to see S3 able to saturate my upload capacity at some point.

Redshift Up & Running

Redshift will need storage capacity that is 2.5x the size of the uncompressed dataset in order to import and sort the data properly in one go. The uncompressed data is around 500 GB so the 2 TB of mechanical capacity offered by the ds2.xlarge cluster type should be able to fit the data just fine.

The cluster comes with 14 EC2 compute units, 31 GB of RAM and "moderate" I/O performance of 0.4 GB/s from what I can gather in Amazon's documentation. I launched the cluster in eu-west-1c.

I've been told an SSD-based cluster could be 2-3x faster. The cheapest cluster type that could store up to 1.5 TB of data on SSDs is the dc1.8xlarge which is over 5.5x more expensive.

Importing into Redshift

I've created a manifest file of all the files I've uploaded to S3. This list will be used by Redshift to import all the data into a single table. I've given the list of files in a random order so that if there are any issues with pockets of data they stand a better chance of being spotted sooner rather than later.

$ vi trips.manifest

{
    "entries": [
        {"url": "s3://<s3_bucket>/csv/trips_xbw.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xae.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xcd.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xau.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbu.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbe.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xaf.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xap.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xaz.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xao.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbi.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xba.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbt.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xcb.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xah.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbm.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xby.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbc.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbp.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xai.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbh.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xab.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xat.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xay.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbx.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbv.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xaj.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xag.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xac.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xan.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbz.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbl.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xas.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbg.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xax.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbj.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbr.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xaw.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbf.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xam.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbb.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbn.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbo.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xal.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xaq.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xca.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbd.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xar.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xaa.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbk.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xak.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xav.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbs.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xbq.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xcc.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_xad.csv.gz", "mandatory": true}
    ]
}

The following will upload the manifest file to S3.

$ s3cmd put trips.manifest s3://<s3_bucket>/csv/

I'll set environment variables containing the username and password for the cluster that I defined when I launched it.

$ read MASTER_USERNAME
$ read MASTER_PASSWORD
$ export MASTER_USERNAME
$ export MASTER_PASSWORD

I'll then use PostgreSQL's CLI tool to connect to the cluster.

$ PGPASSWORD=$MASTER_PASSWORD \
    psql -h trips.c6rssls4aefs.eu-west-1.redshift.amazonaws.com \
         -p 5439 \
         -U $MASTER_USERNAME trips

The following is the schema for the trips table. The encodings used were suggested by Redshift's ANALYZE command. The data types used on each column were picked to be as granular as the underlying data would allow for. The original dataset does have some data quality issues so many of the DECIMAL fields allow for a much wider range of values than what would be expected on a well-groomed dataset. The sort key is not being encoded which is in-line with Amazon's guidance.

CREATE TABLE trips (
    trip_id                 INTEGER NOT NULL DISTKEY ENCODE LZO,
    vendor_id               VARCHAR(3) ENCODE LZO,

    -- Sort keys shouldn't be encoded (compressed)
    pickup_datetime         TIMESTAMP NOT NULL,

    dropoff_datetime        TIMESTAMP NOT NULL ENCODE LZO,
    store_and_fwd_flag      VARCHAR(1) ENCODE RUNLENGTH,
    rate_code_id            SMALLINT ENCODE LZO,
    pickup_longitude        DECIMAL(18,14) ENCODE MOSTLY8,
    pickup_latitude         DECIMAL(18,14) ENCODE MOSTLY8,
    dropoff_longitude       DECIMAL(18,14) ENCODE MOSTLY8,
    dropoff_latitude        DECIMAL(18,14) ENCODE MOSTLY8,
    passenger_count         SMALLINT NOT NULL DEFAULT '0' ENCODE LZO,
    trip_distance           DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    fare_amount             DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    extra                   DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    mta_tax                 DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    tip_amount              DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    tolls_amount            DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    ehail_fee               DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    improvement_surcharge   DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    total_amount            DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    payment_type            VARCHAR(3) ENCODE RUNLENGTH,
    trip_type               SMALLINT ENCODE LZO,
    pickup                  VARCHAR(50) ENCODE LZO,
    dropoff                 VARCHAR(50) ENCODE LZO,

    cab_type                VARCHAR(6) NOT NULL ENCODE LZO,

    precipitation           SMALLINT DEFAULT '0' ENCODE LZO,
    snow_depth              SMALLINT DEFAULT '0' ENCODE LZO,
    snowfall                SMALLINT DEFAULT '0' ENCODE LZO,
    max_temperature         SMALLINT DEFAULT '0' ENCODE LZO,
    min_temperature         SMALLINT DEFAULT '0' ENCODE LZO,
    average_wind_speed      SMALLINT DEFAULT '0' ENCODE LZO,

    pickup_nyct2010_gid     SMALLINT ENCODE LZO,
    pickup_ctlabel          VARCHAR(10) ENCODE LZO,
    pickup_borocode         SMALLINT ENCODE LZO,
    pickup_boroname         VARCHAR(13) ENCODE LZO,
    pickup_ct2010           VARCHAR(6) ENCODE LZO,
    pickup_boroct2010       VARCHAR(7) ENCODE LZO,
    pickup_cdeligibil       VARCHAR(1) ENCODE RUNLENGTH,
    pickup_ntacode          VARCHAR(4) ENCODE LZO,
    pickup_ntaname          VARCHAR(56) ENCODE LZO,
    pickup_puma             VARCHAR(4) ENCODE LZO,

    dropoff_nyct2010_gid    SMALLINT ENCODE LZO,
    dropoff_ctlabel         VARCHAR(10) ENCODE LZO,
    dropoff_borocode        SMALLINT ENCODE LZO,
    dropoff_boroname        VARCHAR(13) ENCODE LZO,
    dropoff_ct2010          VARCHAR(6) ENCODE LZO,
    dropoff_boroct2010      VARCHAR(7) ENCODE LZO,
    dropoff_cdeligibil      VARCHAR(1) ENCODE RUNLENGTH,
    dropoff_ntacode         VARCHAR(4) ENCODE LZO,
    dropoff_ntaname         VARCHAR(56) ENCODE LZO,
    dropoff_puma            VARCHAR(4) ENCODE LZO,

    primary key(trip_id)
) sortkey(pickup_datetime);

The following will start the importing process.

COPY trips
  FROM 's3://<s3_bucket>/csv/trips.manifest'
  CREDENTIALS
    'aws_access_key_id=...;aws_secret_access_key=...'
  DELIMITER ','
  EMPTYASNULL
  ESCAPE
  GZIP
  MANIFEST
  MAXERROR 100000
  REMOVEQUOTES
  TRIMBLANKS
  TRUNCATECOLUMNS;

The above took 4 hours and 33 minutes to import.

I ran a VACUUM on the data after the import. The following completed in 1 minute and 42 seconds.

VACUUM;

Erroneous Values

The original PostgreSQL schema this data came from defined DECIMAL columns as NUMERIC meaning they'd accept any sort of numeric value. This meant the schema wasn't enforcing any sort of value range control. When I first started this import many of the fields such as trip_distance were defined as DECIMAL(6,3) but errors from values such as 1236007.30 would cause Redshift to abort importing after 45 - 50 minutes.

The dataset combines data from different types of taxis operating in New York City, some of which report less amounts of data than others. I had to remove some NOT NULL constraints in order for these records to import as well. Again, I didn't know about these problems till 45 - 50 minutes into the import.

I first started importing the data at 10:30am UTC+3 and it wasn't until 16:05pm that I got a schema together that would allow for these data quality issues. To top it off I was paying for the Redshift cluster during each one of the imports that failed.

The following is a command I ran to find what the last 100,000 errors were during any failed import and an example output from one of the failed imports.

SELECT colname, err_reason, COUNT(*)
FROM stl_load_errors
GROUP BY 1, 2
ORDER BY 1, 2 DESC;

colname    | rate_code_id
err_reason | Missing data for not-null field
count      | 100000

It's important to keep an eye on the Redshift console during a LOAD job. PostgreSQL's CLI will just sit without reporting anything if there is a failure. Without seeing the query terminating in the console I wouldn't know there was something wrong.

It's even more important to make sure your data is in very good shape before you present it to Redshift. Datasets like the one I'm working with are going to be loaded with data quality issues that need to be taken care of prior to running any LOAD jobs on Redshift.

Benchmarking Redshift

I ran each of the queries below multiple times in a row. The times reported were the fastest achieved.

\timing on

The following completed in 48 seconds.

SELECT cab_type,
       count(*)
FROM trips
GROUP BY 1;

The following completed in 59 seconds.

SELECT passenger_count,
       avg(total_amount)
FROM trips
GROUP BY 1;

The following completed in 1 minute and 28 seconds.

SELECT passenger_count,
       extract(year from pickup_datetime),
       count(*)
FROM trips
GROUP BY 1,
         2;

The following completed in 1 minutes and 57 seconds.

SELECT passenger_count,
       extract(year from pickup_datetime),
       round(trip_distance),
       count(*)
FROM trips
GROUP BY 1,
         2,
         3
ORDER BY 2,
         4 desc;

Costs

I had the Redshift cluster up for a total of 10 hours which set me back $9.50. The cost of storing 104 GB of data on S3 is $3.12 if I were to leave the CSV data there for a month. On top of those two costs is the 20% VAT that needs to be collected as I'm using a personal account in the UK.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.