BrytlytDB is an in-GPU-memory database built on top of PostgreSQL. It's operated using many of PostgreSQL's command line utilities, it's wire protocol compatible so third-party PostgreSQL clients can connect to BrytlytDB and queries are even parsed, planned and optimised by PostgreSQL's regular codebase before the execution plan is passed off to GPU-optimised portions of code BrytlytDB offer.
Clustering works right out of the box, GPU-powered JOINs are supported, Stored Procedures are fully functional, Deep- and Machine Learning workloads via Torch are supported and BI visualisation software in the form of SpotLyt is included with BrytlytDB as well.
The feature list is too long for one blog to cover so in this post I'll just be taking a look at how performant BrytlytDB's OLAP functionality is. In this benchmark I'll see how well 32 Tesla K80 GPUs spread across two EC2 instances perform when querying 1.1 billion taxi trips. I'll be using the same dataset I've used to benchmark Amazon Athena, BigQuery, ClickHouse, Elasticsearch, EMR, kdb+/q, MapD, PostgreSQL, Redshift and Vertica. I have a single-page summary of all these benchmarks for comparison.
Note: Since writing this blog post I've benchmarked the newer, version 2.0 of BrytlytDB. I suggest you read that benchmark as it's more representative of the current version of the software.
A GPU-Powered AWS EC2 Cluster
For this benchmark I'll be using two p2.16xlarge EC2 instances running Ubuntu 16.04.2 LTS in Amazon Web Services' eu-west-1a region in Ireland. Each machine has 8 Nvidia K80 cards which have 2 GPUs each, 64 virtual CPUs and 732 GB of memory. There's also 20 Gbit/s of networking capacity available to each instance.
Below are the specifications of the compute capabilities available each one of the EC2 instances.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 1209.207
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4600.10
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-15,32-47
NUMA node1 CPU(s): 16-31,48-63
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt ida
This is the layout of the GPUs available on one of the two EC2 instances.
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:0F.0 Off | 0 |
| N/A 61C P0 60W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:00:10.0 Off | 0 |
| N/A 47C P0 69W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:00:11.0 Off | 0 |
| N/A 66C P0 59W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:00:12.0 Off | 0 |
| N/A 54C P0 72W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:00:13.0 Off | 0 |
| N/A 61C P0 59W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:00:14.0 Off | 0 |
| N/A 49C P0 70W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:00:15.0 Off | 0 |
| N/A 66C P0 58W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:00:16.0 Off | 0 |
| N/A 51C P0 70W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 8 Tesla K80 Off | 0000:00:17.0 Off | 0 |
| N/A 62C P0 64W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 9 Tesla K80 Off | 0000:00:18.0 Off | 0 |
| N/A 49C P0 72W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 10 Tesla K80 Off | 0000:00:19.0 Off | 0 |
| N/A 63C P0 59W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 11 Tesla K80 Off | 0000:00:1A.0 Off | 0 |
| N/A 52C P0 70W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 12 Tesla K80 Off | 0000:00:1B.0 Off | 0 |
| N/A 65C P0 60W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 13 Tesla K80 Off | 0000:00:1C.0 Off | 0 |
| N/A 52C P0 73W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 14 Tesla K80 Off | 0000:00:1D.0 Off | 0 |
| N/A 64C P0 60W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 15 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 52C P0 71W / 149W | 76MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
The first machine has an IP address of 52.214.237.134 and has three roles: Global Transaction Manager, Coordinator and host to the first 16 data nodes.
The second machine has an IP address of 34.250.232.38 and has two roles: it is also a coordinator and hosts the second set of 16 data nodes.
TCP port 5432 is open between the two machines for communicating via PostgreSQL's wire protocol, TCP port 7777 is open for global transaction manager communication and ports 20,000 through to 20,031 are open for data node communication.
In addition to the default 20 GB EBS volumes on each EC2 instance there are six 500 GB General Purpose SSDs attached which offer a baseline of 100 IOPS each and can burst to 3,000 IOPS if need be. These drives are setup in a RAID 0 configuration on each instance.
Before setting up the RAID array I'll install a few dependencies. This was run on both EC2 instances.
$ sudo apt update
$ sudo apt install \
mdadm \
wget \
xfsprogs
Below are the commands used to setup the RAID array on each instance.
$ sudo mdadm \
--create \
--verbose \
--force /dev/md0 \
--level=0 \
/dev/xvdb \
--raid-devices=6 \
/dev/xvdc \
/dev/xvdd \
/dev/xvde \
/dev/xvdf \
/dev/xvdg
$ sudo mkfs.ext4 /dev/md0
$ sudo mkdir /raidArray
$ sudo mount /dev/md0 /raidArray
$ sudo chown ubuntu /raidArray
Here's what the RAID layout looked like after it was setup.
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvda 202:0 0 20G 0 disk
└─xvda1 202:1 0 20G 0 part /
xvdb 202:16 0 500G 0 disk
└─md0 9:0 0 3T 0 raid0 /raidArray
xvdc 202:32 0 500G 0 disk
└─md0 9:0 0 3T 0 raid0 /raidArray
xvdd 202:48 0 500G 0 disk
└─md0 9:0 0 3T 0 raid0 /raidArray
xvde 202:64 0 500G 0 disk
└─md0 9:0 0 3T 0 raid0 /raidArray
xvdf 202:80 0 500G 0 disk
└─md0 9:0 0 3T 0 raid0 /raidArray
xvdg 202:96 0 500G 0 disk
└─md0 9:0 0 3T 0 raid0 /raidArray
This RAID 0 setup offers a partition with a capacity of 3.2 TB on each instance:
$ df -H
Filesystem Size Used Avail Use% Mounted on
udev 387G 0 387G 0% /dev
tmpfs 78G 9.2M 78G 1% /run
/dev/xvda1 21G 7.2G 14G 35% /
tmpfs 387G 0 387G 0% /dev/shm
tmpfs 5.3M 0 5.3M 0% /run/lock
tmpfs 387G 0 387G 0% /sys/fs/cgroup
tmpfs 78G 0 78G 0% /run/user/1000
/dev/md0 3.2T 57G 3.0T 2% /raidArray
BrytlytDB Up & Running
I've run the following to download BrytlytDB's install script. BrytlytDB is commercial software so I cannot divulge the URL I pulled this from at this time. The following was run on both EC2 instances.
$ cd ~
$ wget https://<server>/install.sh
I've then edited the install script on the first instance with the following instance-specific values:
$ vi install.sh
GPUS=16
X2DATA=/raidArray/data
COORDHOST=52.214.237.134
COORDSTART=0
GTMHOST=52.214.237.134
And the install script on the second instance was edited with the following instance-specific values:
$ vi install.sh
GPUS=16
X2DATA=/raidArray/data
COORDHOST=34.250.232.38
COORDSTART=1
GTMHOST=52.214.237.134
GTMCOUNT=0
DATASTART=16
With those changes in place I then ran the install script on both EC2 instances.
$ chmod +x install.sh
$ ./install.sh
The install script conducts the following:
- Download the BrytlytDB binaries.
- Install Nvidia's GPU drivers.
- Generate database initialisation bash scripts.
- Configure PostgreSQL with the correct IP addresses and port numbers used throughout the cluster.
- Deploy BrytlytDB-specific cluster configuration to each Data Node and individual GpuManagers.
- Generate start up scripts for the Global Transaction Manager, GpuManagers, Datanodes and the Coordinators.
Once that's done I can form the cluster with the following registration script. The following was run on the first EC2 instance.
$ vi register.sh
#!/bin/bash
for i in {0..15}; do
/usr/local/x2/bin/psql \
-p 5432 \
-h localhost \
postgres \
-c "CREATE NODE d$i WITH (HOST='52.214.237.134',
TYPE='datanode',
PORT=$((20000+$i)));"
done
for i in {16..31}; do
/usr/local/x2/bin/psql \
-p 5432 \
-h localhost \
postgres \
-c "CREATE NODE d$i WITH (HOST='34.250.232.38',
TYPE='datanode',
PORT=$((20000+$i)));"
done
/usr/local/x2/bin/psql \
-p 5432 \
-h localhost \
postgres \
-c "CREATE NODE c1 WITH (HOST='34.250.232.38',
TYPE='coordinator',
PORT=5432);"
$ ./register.sh
The second registration script run on the other EC2 instance is identical with the first with the exception of the coordinator line pointing to 52.214.237.134 instead.
Loading 1.1 Billion Trips Into BrytlytDB
For this benchmark I've downloaded and decompressed one half the 500 GB of CSV data I created in my Billion Taxi Rides in Redshift blog post onto each EC2 instance. The data sits across 56 files across both machines but for the quickest load time I want to create 32 reasonably equally-sized CSV files and load 16 files on each EC2 instance simultaneously. The 32 files will pair up with the 32 GPUs available across the cluster and should allow for the best utilisation of the GPU resources when running queries.
Here is half of the original 500 GB data set on the first EC2 instance:
$ ls -l /raidArray/s3
... 2024092233 ... trips_xaa.csv.gz
... 2023686578 ... trips_xab.csv.gz
... 2022717460 ... trips_xac.csv.gz
... 2023507930 ... trips_xad.csv.gz
... 2024224441 ... trips_xae.csv.gz
... 2030784022 ... trips_xaf.csv.gz
... 2036427328 ... trips_xag.csv.gz
... 2038915704 ... trips_xah.csv.gz
... 2039556774 ... trips_xai.csv.gz
... 2039272131 ... trips_xaj.csv.gz
... 2019988039 ... trips_xak.csv.gz
... 2001054379 ... trips_xal.csv.gz
... 1999737958 ... trips_xam.csv.gz
... 1997898854 ... trips_xan.csv.gz
... 1997471865 ... trips_xao.csv.gz
... 2000533767 ... trips_xap.csv.gz
... 1998974328 ... trips_xaq.csv.gz
... 2009717364 ... trips_xar.csv.gz
... 2016934402 ... trips_xas.csv.gz
... 2022353263 ... trips_xat.csv.gz
... 2013756232 ... trips_xau.csv.gz
... 2013475097 ... trips_xav.csv.gz
... 2020404819 ... trips_xaw.csv.gz
... 2015585302 ... trips_xax.csv.gz
... 2008026925 ... trips_xay.csv.gz
... 2007821692 ... trips_xaz.csv.gz
... 2005814365 ... trips_xba.csv.gz
... 2013779043 ... trips_xbb.csv.gz
And this is the other half on the second instance:
$ ls -l /raidArray/s3
... 2018006900 ... trips_xbc.csv.gz
... 2019260716 ... trips_xbd.csv.gz
... 1893270062 ... trips_xbe.csv.gz
... 1854614596 ... trips_xbf.csv.gz
... 1854948290 ... trips_xbg.csv.gz
... 1855481821 ... trips_xbh.csv.gz
... 2008073682 ... trips_xbi.csv.gz
... 2023698241 ... trips_xbj.csv.gz
... 2028094175 ... trips_xbk.csv.gz
... 2030037816 ... trips_xbl.csv.gz
... 2048044463 ... trips_xbm.csv.gz
... 2031794840 ... trips_xbn.csv.gz
... 2034562660 ... trips_xbo.csv.gz
... 2034332904 ... trips_xbp.csv.gz
... 2036182649 ... trips_xbq.csv.gz
... 2035637794 ... trips_xbr.csv.gz
... 2026865353 ... trips_xbs.csv.gz
... 2028888024 ... trips_xbt.csv.gz
... 2027725347 ... trips_xbu.csv.gz
... 2027090129 ... trips_xbv.csv.gz
... 2024075447 ... trips_xbw.csv.gz
... 2025027115 ... trips_xbx.csv.gz
... 2020889873 ... trips_xby.csv.gz
... 2025885378 ... trips_xbz.csv.gz
... 2023688771 ... trips_xca.csv.gz
... 2024722253 ... trips_xcb.csv.gz
... 2030741659 ... trips_xcc.csv.gz
... 1383149265 ... trips_xcd.csv.gz
On each instance I'll decompress the GZIP files. Below took about four minutes to complete on each instance.
$ cd /raidArray/s3/
$ for filename in trips_x*.csv.gz; do
gzip -d $filename &
done
Then I'll concatenate the ~560 million lines of CSV data on each instance and break it up into files of 35 million lines each.
$ cat trips_x*.csv \
| split --lines=35000000 \
--additional-suffix=.csv
The above completed in 8 minutes on each EC2 instance.
I'm now left with 16 CSV files on the first EC2 instance:
$ ls -l /raidArray/s3/x*.csv
... 16269790081 ... xaa.csv
... 16256520062 ... xab.csv
... 16260850939 ... xac.csv
... 16300101241 ... xad.csv
... 16328293941 ... xae.csv
... 16308059268 ... xaf.csv
... 15710765508 ... xag.csv
... 15694289596 ... xah.csv
... 15724665907 ... xai.csv
... 15701359090 ... xaj.csv
... 15805197311 ... xak.csv
... 15825215893 ... xal.csv
... 15798558696 ... xam.csv
... 15810367473 ... xan.csv
... 15740445794 ... xao.csv
... 15739487044 ... xap.csv
And 16 files on the other EC2 instance:
$ ls -l /raidArray/s3/x*.csv
... 15832236746 ... xaa.csv
... 15500745803 ... xab.csv
... 15377897035 ... xac.csv
... 15219115615 ... xad.csv
... 15910087929 ... xae.csv
... 16174771007 ... xaf.csv
... 16281792208 ... xag.csv
... 16282698764 ... xah.csv
... 16290969841 ... xai.csv
... 16268493212 ... xaj.csv
... 16268866466 ... xak.csv
... 16265349110 ... xal.csv
... 16252570464 ... xam.csv
... 16281353543 ... xan.csv
... 16241569911 ... xao.csv
... 13355974598 ... xap.csv
I'll then connect to the coordinator node on each EC2 instance and setup BrytlytDB's gpu_manager_fdw extension and foreign data wrapper.
$ /usr/local/x2/bin/psql -d postgres
psql (PGXC 1.3devel, based on PG 9.4beta1)
Type "help" for help.
CREATE EXTENSION gpu_manager_fdw;
CREATE SERVER gm_fdw_server
FOREIGN DATA WRAPPER gpu_manager_fdw;
I can then create the trips table.
CREATE FOREIGN TABLE trips (
trip_id INT,
vendor_id VARCHAR(3),
pickup_datetime DATE,
dropoff_datetime DATE,
store_and_fwd_flag VARCHAR(1),
rate_code_id INT,
pickup_longitude DOUBLE PRECISION,
pickup_latitude DOUBLE PRECISION,
dropoff_longitude DOUBLE PRECISION,
dropoff_latitude DOUBLE PRECISION,
passenger_count INT,
trip_distance DOUBLE PRECISION,
fare_amount DOUBLE PRECISION,
extra DOUBLE PRECISION,
mta_tax DOUBLE PRECISION,
tip_amount DOUBLE PRECISION,
tolls_amount DOUBLE PRECISION,
ehail_fee DOUBLE PRECISION,
improvement_surcharge DOUBLE PRECISION,
total_amount DOUBLE PRECISION,
payment_type VARCHAR(10),
trip_type INT,
pickup VARCHAR(50),
dropoff VARCHAR(50),
cab_type VARCHAR(6),
precipitation INT,
snow_depth INT,
snowfall INT,
max_temperature INT,
min_temperature INT,
average_wind_speed INT,
pickup_nyct2010_gid INT,
pickup_ctlabel VARCHAR(10),
pickup_borocode INT,
pickup_boroname VARCHAR(13),
pickup_ct2010 VARCHAR(6),
pickup_boroct2010 VARCHAR(7),
pickup_cdeligibil VARCHAR(1),
pickup_ntacode VARCHAR(4),
pickup_ntaname VARCHAR(56),
pickup_puma VARCHAR(4),
dropoff_nyct2010_gid INT,
dropoff_ctlabel VARCHAR(10),
dropoff_borocode INT,
dropoff_boroname VARCHAR(13),
dropoff_ct2010 VARCHAR(6),
dropoff_boroct2010 VARCHAR(7),
dropoff_cdeligibil VARCHAR(1),
dropoff_ntacode VARCHAR(4),
dropoff_ntaname VARCHAR(56),
dropoff_puma VARCHAR(4)
) SERVER gm_fdw_server OPTIONS (max_size '38000000', index '24')
DISTRIBUTE BY HASH (trip_id);
The above sets the maximum row count per GPU device at 38 million and creates an index on the cab_type column. The cab_type column has very low cardinality so the indexing will be less beneficial versus a column with a higher cardinality of values.
Some of the data types used in the above table aren't the types I normally use in my benchmarks. The reason for this is that there is still limited data type support available in this early version of BrytlytDB. I've had to substitute DATETIME with DATE which will truncate the timestamps in the dataset to just the date rather than the full date and time. There is no SMALLINT support yet so I've had to use the larger INT type as an replacement for those fields. DECIMAL types aren't yet supported so I'll be using DOUBLE PRECISION as a replacement there.
The above replacements could speed up or slow down the queries I benchmark with so I'm hoping to do another benchmark when data type support is widened in the future.
With the table created I'll launch 32 simultaneous load jobs across the two EC2 instances to load the data into BrytlytDB's trips table.
This is the import script I ran on the first EC2 instance:
$ vi import.sh
count=0
for filename in /raidArray/s3/xa*.csv; do
/usr/local/x2/bin/psql \
-d postgres \
-p 5432 \
-h localhost \
-c "EXECUTE DIRECT ON ( d$count )
'SELECT load_data(''trips'',
''$filename'',
'',
'',
100000,
''0'',
'''',
''append'')'" 2>&1 &
(( count++ ))
done
And this is the load script for the second instance.
$ vi import.sh
count=16
for filename in /raidArray/s3/xa*.csv; do
/usr/local/x2/bin/psql \
-d postgres \
-p 5432 \
-h localhost \
-c "EXECUTE DIRECT ON ( d$count )
'SELECT load_data(''trips'',
''$filename'',
'',
'',
100000,
''0'',
'''',
''append'')'" 2>&1 &
(( count++ ))
done
The first EC2 instance loaded its half of the dataset in 1 hour, 12 minutes and 21 seconds. The second instance loaded its half in 1 hour, 9 minutes and 57 seconds.
After the data was loaded in I could see the PostgreSQL data directory filled up with reasonably evenly-sized data folders. Here's what they look like on the first EC2 instance:
$ du -hs /raidArray/data/*
36M c0
5.4G d0
5.4G d1
5.7G d10
5.7G d11
5.7G d12
5.6G d13
5.7G d14
5.6G d15
5.5G d2
5.7G d3
5.7G d4
5.7G d5
5.7G d6
5.7G d7
5.7G d8
5.7G d9
4.0M g0
I was then able to use PostgreSQL's CLI tool to make sure I can see the table and all 1.1 billion records.
$ /usr/local/x2/bin/psql -d postgres
\d
List of relations
Schema | Name | Type | Owner
--------+-------+---------------+--------
public | trips | foreign table | ubuntu
(1 row)
select count(*) from trips;
count
------------
1113653018
(1 row)
Benchmarking BrytlytDB
The times quoted below are the lowest query times seen during a series of runs. As with all my benchmarks, I use the lowest query time as a way of indicating "top speed".
$ /usr/local/x2/bin/psql -d postgres
\timing on
The following completed in 0.762 seconds.
SELECT cab_type,
count(*)
FROM trips
GROUP BY cab_type;
The following completed in 2.472 seconds.
SELECT passenger_count,
avg(total_amount)
FROM trips
GROUP BY passenger_count;
The following completed in 4.131 seconds.
SELECT passenger_count,
extract(year from pickup_datetime) AS pickup_year,
count(*)
FROM trips
GROUP BY passenger_count,
pickup_year;
The following completed in 6.041 seconds.
SELECT passenger_count,
extract(year from pickup_datetime) AS pickup_year,
cast(trip_distance as int) AS distance,
count(*) AS the_count
FROM trips
GROUP BY passenger_count,
pickup_year,
distance
ORDER BY pickup_year,
the_count desc;
Given all the features of PostgreSQL are still available I'm blown away at how fast BrytlytDB is able to aggregate data. The cluster I used cost around $30 / hour which means this system out performed other Cloud-based and PostgreSQL-based data warehousing solutions both in terms of wall clock time and in terms of cost per hour of running the cluster.
That being said I know Nvidia's K80 chips use extremely fast memory and have thousands of compute cores so I expect further optimisations from BrytlytDB to drive down these already fast query times.