Update: MapD rebranded as OmniSci in 2018.
In my last two blog posts I've looked at querying 1.1 billion taxi trips made in New York City over the course of six years using MapD on 8 Nvidia Telsa K80 GPU cards and 4 Nvidia Titan X cards respectively. In both of these scenarios I was working with bare metal. In this blog post I'll look at running the same benchmark on 4 g2.8xlarge EC2 instances on AWS.
To make all the GPUs on each of the 4 instances look like they're on the same machine I'll be using Bitfusion Boost wrapper around MapD.
An EC2 GPU Cluster Up & Running
Bitfusion has an AWS Marketplace product called Bitfusion Boost Ubuntu 14 Cuda 7.5 that will launch a cluster using CloudFormation and bootstrap each node in the cluster with their software and dependencies, so that in the end it all looks like a giant instance with 16 GPUs.
The g2.8xlarge instances each have 2 Nvidia GRID K520 cards and each card has 2 GPUs and 8 GB GDDR5 of memory (4 GB per card). The last price I saw for each of the cards on Amazon.com was $1,590 so $2.808 / hour for two of them seems like a steal to me.
On top of the 4 g2.8xlarge instances, a storage-optimised i2.4xlarge instance will be launched which will act as a head node of the cluster and store all the data when it's at rest.
Once CloudFormation has done its thing we can SSH into the head node.
$ ssh ubuntu@52.38.222.214
Once connect we're greeted with the following message:
########################################################################################################################
########################################################################################################################
____ _ _ __ _ _
| __ )(_) |_ / _|_ _ ___(_) ___ _ __ (_) ___
| _ \| | __| |_| | | / __| |/ _ \| '_ \ | |/ _ \
| |_) | | |_| _| |_| \__ \ | (_) | | | |_| | (_) |
|____/|_|\__|_| \__,_|___/_|\___/|_| |_(_)_|\___/
Welcome to Bitfusion Boost Ubuntu 14 Cuda 7.5 - Ubuntu 14.04 LTS (GNU/Linux 3.13.0-88-generic x86_64)
This AMI is brought to you by Bitfusion.io
http://www.bitfusion.io
Please email all feedback and support requests to:
amisupport@bitfusion.io
We would love to hear from you! Contact us with any feedback or a feature request at the email above.
########################################################################################################################
########################################################################################################################
Instance Type: i2.4xlarge
Number of GPUs: 16
Downloading 1.1 Billion Taxi Journeys
I'll download the 104 GB of CSV data I created in my Billion Taxi Rides in Redshift blog post. This data sits in 56 GZIP files and decompresses into around 500 GB of raw CSV data.
But before I can download that data I'll create a RAID 0, striped array for the data to live on. The i2.4xlarge instance type is supposed to come with 4 drives but lsblk was only showing three being available when we launched this instance. I need about 500 GB of space for the uncompressed CSV files and 350 GB of space for MapD's internal representation of this data so the 1.5 TB of space I'll get from the two drives should be more than enough.
The first drive is already in use so I'll fdisk the two remaining drives, use mdadm to setup the RAID array and then I'll format the array.
$ sudo apt update
$ sudo apt install mdadm
$ sudo fdisk /dev/xvdb
$ sudo fdisk /dev/xvdc
$ sudo mdadm \
-C /dev/md0 \
-l raid0 \
-n 2 \
/dev/xvd[b-c]1
$ sudo mkfs.ext3 /dev/md0
Now that the drives are ready I'll mount them to the /data directory.
$ sudo mkdir /data
$ sudo mount /dev/md0 /data
$ sudo chown -R ubuntu:ubuntu /data
Once that's done we can see the RAID array is setup by running lsblk.
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvda 202:0 0 500G 0 disk
└─xvda1 202:1 0 500G 0 part /
xvdb 202:16 0 745.2G 0 disk
└─xvdb1 202:17 0 745.2G 0 part
└─md0 9:0 0 1.5T 0 raid0 /data
xvdc 202:32 0 745.2G 0 disk
└─xvdc1 202:33 0 745.2G 0 part
└─md0 9:0 0 1.5T 0 raid0 /data
I'll now use the AWS CLI tool to download the 104 GB of compressed CSV data and store it on the RAID array.
$ mkdir /data/benchdata
$ cd /data/benchdata
$ aws configure set \
default.s3.max_concurrent_requests \
100
$ aws s3 sync s3://<s3_bucket>/csv/ ./
It took 11 minutes and 42 seconds to download the data from the Irish S3 bucket I keep the taxi data in to the cluster in Oregon at the rate of 1.185 Gbit/s.
MapD doesn't support loading CSV data from GZIP files at this time so I'll decompress the CSV files before loading them. The i2.4xlarge instance type has 16 vCPUs so I'll use xargs to run 16 simultaneous instances of gunzip.
$ find *.gz | xargs -n 1 -P 16 gunzip
Updating Bitfusion
Normally Bitfusion's AMI will have the latest version of Boost but I was told prior to running the benchmark there was a last-minute release with a number of improvements so to take advantage of them I updated each g2.8xlarge instance.
The following shows the private IP addresses of each node in the GPU cluster.
$ cat /etc/bitfusionio/adaptor.conf
10.0.0.207
10.0.0.208
10.0.0.209
10.0.0.210
I then SSH'ed into each instance and updated the software with the following steps:
$ ssh -i ~/.ssh/mykey.pem 10.0.0.207
$ sudo service bfboost-opencl stop
$ sudo service bfboost-cuda-server stop
$ sudo apt update
$ sudo apt install bfboost
$ sudo sed -i "s/respawn/respawn\nlimit nofile 1024000 1024000/" \
/etc/init/bfboost-cuda-server.conf
$ sudo sed -i "s/respawn/respawn\nlimit nofile 1024000 1024000/" \
/etc/init/bfboost-opencl.conf
$ sudo service bfboost-cuda-server start
$ sudo service bfboost-opencl start
I then checked to make sure I'm running the specific point release I was looking to upgrade to.
$ sudo dpkg --list | grep bfboost
0.1.0+1532
The following is the output from Nvidia's system management interface on a g2.8xlarge instance. Note this tool isn't running via the Bitfusion Boost wrapper and is only showing the hardware available to this one specific instance.
$ nvidia-smi
+------------------------------------------------------+
| NVIDIA-SMI 352.93 Driver Version: 352.93 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 On | 0000:00:03.0 Off | N/A |
| N/A 31C P8 17W / 125W | 2098MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GRID K520 On | 0000:00:04.0 Off | N/A |
| N/A 30C P8 17W / 125W | 2098MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GRID K520 On | 0000:00:05.0 Off | N/A |
| N/A 32C P8 18W / 125W | 2098MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GRID K520 On | 0000:00:06.0 Off | N/A |
| N/A 31C P8 17W / 125W | 2098MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6784 C /usr/bin/cuda-server 2085MiB |
| 1 6784 C /usr/bin/cuda-server 2085MiB |
| 2 6784 C /usr/bin/cuda-server 2085MiB |
| 3 6784 C /usr/bin/cuda-server 2085MiB |
+-----------------------------------------------------------------------------+
And now back on the head node I can now run Bitfusion Boost and see 16 GPUs available.
$ bfboost client \
/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 16 CUDA Capable device(s)
Device 0: "GRID K520"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4096 MBytes (4294770688 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 797 MHz (0.80 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 3
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
...
MapD Up & Running
I will now install MapD. MapD is commercial software, but if you are interested in doing this yourself or in proof of concept with your own datasets, you can email them at info@mapd.com. For this benchmark I was kindly provided with the latest version of their software via a link.
$ mkdir -p /data/prod/
$ cd /data/prod/
$ wget http://.../mapd2-latest-Linux-x86_64.sh
$ chmod +x mapd2-latest-Linux-x86_64.sh
$ ./mapd2-latest-Linux-x86_64.sh
I'll then create a symlink so I have a simpler folder name to work with.
$ ln -s mapd2-1.2.0-20160711-90d6743-Linux-x86_64/ mapd
I'll then create a data directory for MapD to store its internal files in and initialise the database.
$ mkdir -p /data/prod/mapd-storage/data
$ mapd/bin/initdb --data /data/prod/mapd-storage/data
MapD has a Java dependency and the LD_LIBRARY_PATH environment variable that comes out of the box with the i2.4xlarge instance isn't pointing at the correct location so the following will fix that.
$ export LD_LIBRARY_PATH=/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/
With all those dependencies in place we can launch MapD's server within the Bitfusion Boost wrapper.
$ nohup bfboost client \
"mapd/bin/mapd_server --data /data/prod/mapd-storage/data" &
Importing 1.1 Billion Trips Into MapD
I'll create a schema for the trips table and use a fragment size of 75 million.
$ vi create_trips_table.sql
CREATE TABLE trips (
trip_id INTEGER,
vendor_id VARCHAR(3) ENCODING DICT,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag VARCHAR(1) ENCODING DICT,
rate_code_id SMALLINT,
pickup_longitude DECIMAL(14,2),
pickup_latitude DECIMAL(14,2),
dropoff_longitude DECIMAL(14,2),
dropoff_latitude DECIMAL(14,2),
passenger_count SMALLINT,
trip_distance DECIMAL(14,2),
fare_amount DECIMAL(14,2),
extra DECIMAL(14,2),
mta_tax DECIMAL(14,2),
tip_amount DECIMAL(14,2),
tolls_amount DECIMAL(14,2),
ehail_fee DECIMAL(14,2),
improvement_surcharge DECIMAL(14,2),
total_amount DECIMAL(14,2),
payment_type VARCHAR(3) ENCODING DICT,
trip_type SMALLINT,
pickup VARCHAR(50) ENCODING DICT,
dropoff VARCHAR(50) ENCODING DICT,
cab_type VARCHAR(6) ENCODING DICT,
precipitation SMALLINT,
snow_depth SMALLINT,
snowfall SMALLINT,
max_temperature SMALLINT,
min_temperature SMALLINT,
average_wind_speed SMALLINT,
pickup_nyct2010_gid SMALLINT,
pickup_ctlabel VARCHAR(10) ENCODING DICT,
pickup_borocode SMALLINT,
pickup_boroname VARCHAR(13) ENCODING DICT,
pickup_ct2010 VARCHAR(6) ENCODING DICT,
pickup_boroct2010 VARCHAR(7) ENCODING DICT,
pickup_cdeligibil VARCHAR(1) ENCODING DICT,
pickup_ntacode VARCHAR(4) ENCODING DICT,
pickup_ntaname VARCHAR(56) ENCODING DICT,
pickup_puma VARCHAR(4) ENCODING DICT,
dropoff_nyct2010_gid SMALLINT,
dropoff_ctlabel VARCHAR(10) ENCODING DICT,
dropoff_borocode SMALLINT,
dropoff_boroname VARCHAR(13) ENCODING DICT,
dropoff_ct2010 VARCHAR(6) ENCODING DICT,
dropoff_boroct2010 VARCHAR(7) ENCODING DICT,
dropoff_cdeligibil VARCHAR(1) ENCODING DICT,
dropoff_ntacode VARCHAR(4) ENCODING DICT,
dropoff_ntaname VARCHAR(56) ENCODING DICT,
dropoff_puma VARCHAR(4) ENCODING DICT
) WITH (FRAGMENT_SIZE=75000000);
I'll create two environment variables with my credentials for MapD.
$ read MAPD_USERNAME
$ read MAPD_PASSWORD
$ export MAPD_USERNAME
$ export MAPD_PASSWORD
The following will create the table schema using the mapdql CLI tool.
$ mapd/bin/mapdql mapd \
-u $MAPD_USERNAME \
-p $MAPD_PASSWORD \
< create_trips_table.sql
With the table and files in place I'll load the 500 GB of CSV data into MapD. The following completed in 60 minutes and 42 seconds.
$ for filename in /data/benchdata/*.csv; do
echo "COPY trips
FROM '$filename'
WITH (header='false');" | \
mapd/bin/mapdql \
mapd \
-u $MAPD_USERNAME \
-p $MAPD_PASSWORD
done
Here's a snapshot of top during the import.
top - 07:06:57 up 2:13, 2 users, load average: 10.75, 10.51, 10.25
Tasks: 479 total, 1 running, 478 sleeping, 0 stopped, 0 zombie
%Cpu0 : 74.3 us, 5.3 sy, 0.0 ni, 20.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 75.1 us, 1.7 sy, 0.0 ni, 23.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 74.0 us, 2.3 sy, 0.0 ni, 23.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 72.2 us, 3.6 sy, 0.0 ni, 24.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 75.3 us, 2.0 sy, 0.0 ni, 22.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 74.8 us, 2.3 sy, 0.0 ni, 22.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 72.1 us, 3.3 sy, 0.0 ni, 24.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 73.1 us, 9.3 sy, 0.0 ni, 17.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu8 : 74.2 us, 2.0 sy, 0.0 ni, 23.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu9 : 73.7 us, 2.7 sy, 0.0 ni, 23.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu10 : 74.9 us, 2.3 sy, 0.0 ni, 22.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu11 : 74.1 us, 3.0 sy, 0.0 ni, 22.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu12 : 76.5 us, 1.3 sy, 0.0 ni, 22.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu13 : 75.2 us, 2.3 sy, 0.0 ni, 22.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu14 : 72.8 us, 3.3 sy, 0.0 ni, 23.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu15 : 74.1 us, 4.3 sy, 0.0 ni, 21.3 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 12590446+total, 12531040+used, 594064 free, 897984 buffers
KiB Swap: 0 total, 0 used, 0 free. 11976936+cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19915 ubuntu 20 0 4611148 1.278g 20028 S 1231 1.1 538:04.36 mapd/bin/mapd_server --data /data/prod/mapd-storage/data
19910 ubuntu 20 0 63008 4520 3340 S 0.0 0.0 0:00.00 bfboost client mapd/bin/mapd_server --data /data/prod/mapd-storage/+
...
Once the import is complete 894 GB of capacity is now being used on the RAID array.
$ df -H
Filesystem Size Used Avail Use% Mounted on
udev 65G 13k 65G 1% /dev
tmpfs 13G 844k 13G 1% /run
/dev/xvda1 529G 6.4G 501G 2% /
none 4.1k 0 4.1k 0% /sys/fs/cgroup
none 5.3M 0 5.3M 0% /run/lock
none 65G 13k 65G 1% /run/shm
none 105M 0 105M 0% /run/user
/dev/md0 1.6T 894G 602G 60% /data
Benchmarking MapD
The times quoted below are the lowest query times seen during a series of runs.
$ mapd/bin/mapdql \
mapd \
-u $MAPD_USERNAME \
-p $MAPD_PASSWORD
\timing on
The following completed in 0.028 seconds.
SELECT cab_type,
count(*)
FROM trips
GROUP BY cab_type;
The following completed in 0.2 seconds.
SELECT passenger_count,
avg(total_amount)
FROM trips
GROUP BY passenger_count;
The following completed in 0.237 seconds.
SELECT passenger_count,
extract(year from pickup_datetime) AS pickup_year,
count(*)
FROM trips
GROUP BY passenger_count,
pickup_year;
The following completed in 0.578 seconds.
SELECT passenger_count,
extract(year from pickup_datetime) AS pickup_year,
cast(trip_distance as int) AS distance,
count(*) AS the_count
FROM trips
GROUP BY passenger_count,
pickup_year,
distance
ORDER BY pickup_year,
the_count desc;
It's amazing to see that $14.98 / hour's worth of EC2 instances can query 1.1 Billion records 56x faster than any CPU-based solution I've run so far. Going forward, the ability of MapD and Bitfusion to combine GPUs to create clusters of on-demand supercomputers, will have a profound impact on the size of the potential problems that can now be tackled. The future of BI is going to be on GPUs.