Rapid7 publish a Reverse DNS result set every few weeks in relation to their Project Sonar Study. It's comprised of DNS pointer record responses they've seen across the entire publicly routable IPv4 spectrum. These results won't return every hostname pointing at a given IPv4 address but they're often hostnames created by the network provider. This can be very useful for distinguishing between IPs used for residential, enterprise and cloud connectivity.
In a recent survey 37% of Python developers use the language to build web scraping bots. A lot of these run on various cloud platforms and other hosting providers. Knowing a hostname of a given IPv4 address helps in excluding synthetic traffic from any weblog analysis. This isn't a complete answer to identifying bot traffic and can certainly throw up false positives but nonetheless, it's a reasonably clear signal of infrastructure being used.
Executing DNS lookups on an ad-hoc basis in order to enrich network traffic logs can result in an unpredictable amount of network latency and running the same query in quick succession is wasteful. Batch lookups wouldn't lend themselves well to this problem as returning answers in real-time is more valuable due to the freshness of the data and its value to network operations. It's reasonable to expect that a lookup agent would hold open a connection or pool of connections to a database so for this exercise I'll sequentially lookup IPv4 addresses while maintaining the same database connection.
My interest here is in seeing the performance differences between using PostgreSQL with a B-Tree index versus ClickHouse and its MergeTree engine for this use case.
Installing Prerequisites
The machine I'm using has an Intel Core i5 4670K clocked at 3.4 GHz, 8 GB of DDR3 RAM and a 250 GB Samsung NVMe SSD 960 EVO connected via a PCIe expansion card.
Below I'll install PostgreSQL 12.1 and ClickHouse 19.17.6.36 as well as a variety of other tools and libraries. The following was run on a fresh installation of Ubuntu 16.04.2 LTS.
$ sudo apt-key adv \
--keyserver hkp://keyserver.ubuntu.com:80 \
--recv E0C56BD4
$ wget -qO- \
https://www.postgresql.org/media/keys/ACCC4CF8.asc \
| sudo apt-key add -
$ echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" \
| sudo tee /etc/apt/sources.list.d/clickhouse.list
$ echo "deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main" \
| sudo tee /etc/apt/sources.list.d/pgdg.list
$ sudo apt update
$ sudo apt install \
clickhouse-client \
clickhouse-server \
jq \
libpq-dev \
pigz \
postgresql-12 \
postgresql-client-12 \
postgresql-contrib \
python-dev \
python-pip \
sqlite3 \
virtualenv
I'll create a Python virtual environment with packages for PostgreSQL and ClickHouse connectivity as well as a domain name parser.
$ virtualenv ~/.lookup
$ source ~/.lookup/bin/activate
$ pip install \
'clickhouse-driver[lz4]' \
psycopg2 \
tldextract
I'll launch ClickHouse Server. PostgreSQL already launched its server automatically when it was installed.
$ sudo service clickhouse-server start
I'll create a default ClickHouse client configuration file for my UNIX account. I set the password for the default ClickHouse user to root when I installed the server software.
$ mkdir -p ~/.clickhouse-client
$ vi ~/.clickhouse-client/config.xml
<config>
<password>root</password>
</config>
The Reverse DNS Dataset
The following is an ANSI diagram of the stages the dataset will go through in this tutorial.
Original GZIP-compressed JSON from Rapid 7 (11 GB)
└── JSON split into 4 files (11 GB)
└── Four GZIP-compressed CSV files of full hostnames and IPv4 addresses in dot-decimal format (6 GB)
└── Four uncompressed CSV files of domain name stubs and IPv4 addresses in 32-bit unsigned integer format (23 GB)
└── SQLite3 Database (28 GB excluding the index)
└── Shrunken uncompressed CSV (0.5 GB)
├── PostgreSQL Database (1.4 GB)
└── ClickHouse Log Engine Database (0.1 GB)
└── ClickHouse MergeTree Engine Database (0.3 GB)
I'll download the Reverse DNS dataset from Rapid7. I've noticed they remove older versions of their database when newer versions are published so in the future you will probably have to modify the URL below with a newer one found on their downloads page.
$ wget -c -O rdns.json.gz \
https://opendata.rapid7.com/sonar.rdns_v2/2019-11-27-1574813015-rdns.json.gz
The dataset is delivered as a ~1.27 billion-line, 11 GB GZIP-compressed file with one JSON-serialised record per line. The following is an example of one record.
$ pigz -dc rdns.json.gz \
| head -n1 \
| python -mjson.tool
{
"name": "1.10.230.137",
"timestamp": "1574835472",
"type": "ptr",
"value": "node-k95.pool-1-10.dynamic.totinternet.net"
}
I want to process this data using four processes, each will run on a separate CPU core. The dataset has around 1.2 billion lines so I'll break them into files of, at most, 320 million lines each.
$ pigz -dc rdns.json.gz \
| split --lines=320000000 \
--filter="pigz > rdns_\$FILE.json.gz"
These are the resulting files produced by the above. Each file is ~2.8 GB.
rdns_xaa.json.gz
rdns_xab.json.gz
rdns_xac.json.gz
rdns_xad.json.gz
I will then extract the full domain name and IPv4 address from each file in parallel and place them in GZIP-compressed, comma-delimited CSV files.
$ ls rdns_*.json.gz \
| xargs -P4 -n1 -I {} sh -c \
"pigz -dc {} \
| jq '.name + \",\" + .value' \
| sed 's/\"//g' \
| pigz > {}.csv.gz"
The above produced 6 GB of GZIP-compressed CSV files. Here is an example of the output of the above operation.
$ gunzip -c rdns_xaa.json.gz.csv.gz | tail -n3
104.197.172.33,33.172.197.104.bc.googleusercontent.com
104.197.172.34,34.172.197.104.bc.googleusercontent.com
104.197.172.35,35.172.197.104.bc.googleusercontent.com
Extracting Domain Name Stubs
I want to remove any CNAMEs and top-level domains from the domain names. In the above example, 33.172.197.104.bc.googleusercontent.com would simply become googleusercontent. This makes it easier to group domains from entities that use a wide variety of top-level domains and removes the noise of the CNAMEs which can be unique to each IPv4 address. Below I'll use tldextract for this task.
The code below will also convert the IPv4 address from its dot-decimal format into an unsigned 32-bit integer.
$ vi domain_stub.py
import gzip
from glob import glob
from multiprocessing import Pool
from socket import inet_aton
from struct import unpack
from uuid import uuid4
from tldextract import extract as tld_ex
ip2int = lambda x: unpack("!I", inet_aton(x))[0]
def extract(manifest):
input_file, output_file = manifest
with open(output_file, 'a') as out:
for line in gzip.open(input_file, 'r'):
try:
ip, domain = line.split(',')
out.write('%d,%s\n' % (ip2int(ip),
tld_ex(domain)[1]))
except:
pass
filenames = [(csv_gz,
'out.%s.csv' % str(uuid4())[:6])
for csv_gz in glob('rdns_*.json.gz.csv.gz')]
pool = Pool(4)
pool.map(extract, filenames)
$ python domain_stub.py
The above took over a day to run on my machine. I believe the tldextract library is well-written and reasonable steps have been taken to optimise it. Nonetheless, I did raise a ticket to see if there is anything that could be done to speed up the performance even further. I suspect a string-focused library like this could be ported to C, C++, Rust or GoLang and yield greater performance.
UPDATE: I've since port the above to Rust. See my Faster Top Level Domain Name Extraction with Rust blog post for further details.
The above produced 23 GB of uncompressed CSV files. Here is a sample from one of the resulting output files.
$ head -n3 out.7dcbe9.csv
17491593,totinternet
17491594,totinternet
17491595,totinternet
Reducing the Search Space
In the above example, totinternet is listed as the hostname stub for at least three IPv4 addresses that run in sequential order. I want to see if I can build ranges where a hostname stub is paired with the lowest and highest IPv4 addresses that run sequentially and uninterrupted. This way there will be fewer records describing the same dataset and ultimately shrink the search space for the lookup tasks later on in this post.
I'll import the dataset into a single table in SQLite3, apply an index and then go through each record ordered by the value of the IPv4 address and print out any uninterrupted sequences for any given hostname stub.
$ sqlite3 lookup.db
CREATE TABLE ip_host (
"ipv4" INTEGER,
"domain_stub" TEXT
);
The following produced a 28 GB database prior to the index being applied.
$ cat out.*.csv \
| sqlite3 -csv \
-separator ',' \
lookup.db \
'.import /dev/stdin ip_host'
$ sqlite3 lookup.db
CREATE INDEX ip_host_ipv4 ON ip_host (ipv4);
$ vi shrink.py
import sqlite3
lookup_conn = sqlite3.connect('lookup.db')
lookup_cur = lookup_conn.cursor()
sql = '''SELECT ipv4, domain_stub
FROM ip_host
ORDER BY ipv4'''
lookup_cur.execute(sql)
last_ip, last_domain, consecutive = 0, None, 0
for ipv4, domain_stub in lookup_cur:
if ipv4 != last_ip + (consecutive + 1):
if last_domain:
print '%s,%d,%d' % (last_domain,
last_ip,
(last_ip + consecutive))
last_ip = ipv4
last_domain = domain_stub
consecutive = 0
else:
consecutive = consecutive + 1
if consecutive:
print '%s,%d,%d' % (last_domain,
last_ip,
(last_ip + consecutive))
$ python shrink.py > shrunken.csv
The resulting CSV file is 508 MB uncompressed and is made up of 17,784,359 lines, 71x fewer than the source dataset. Here is a sample of the file produced.
$ head -n3 shrunken.csv
one,16777217,16777217
bigredgroup,16778244,16778244
gtelecom,16778501,16778501
I'll produce a random set of records that will be used to do lookups in the benchmarks below. This will ensure that every lookup is a hit and every query is unique.
$ sort -R shrunken.csv | head -n1000 > unique_ips
Populating PostgreSQL
I'll set up a PostgreSQL account and create a database that will be populated by the "shrunken" dataset.
$ sudo -u postgres \
bash -c "psql -c \"CREATE USER mark
WITH PASSWORD 'test'
SUPERUSER;\""
$ createdb ip_ranges
$ psql ip_ranges
CREATE TABLE ip_ranges (
domain_stub VARCHAR(255),
"start" BIGSERIAL,
"end" BIGSERIAL
);
\copy ip_ranges FROM 'shrunken.csv' DELIMITER ',' CSV
PostgreSQL can scan its indices both forwards and backwards at nearly the same speed but when running a scan, it cannot change direction without starting a new scan. I'll create an index on the dataset that sorts the data by the first IPv4 address in the range and then by the last IPv4 in the range in reverse. That way when there is a hit on the first IPv4 address in the range the second column's hit will always come afterwards avoiding PostgreSQL having to start a second scan.
CREATE INDEX ip_ranges_inverse
ON ip_ranges ("start", "end" DESC);
I'll then re-order the table based on the ordering in the above index.
ALTER TABLE ip_ranges
CLUSTER ON ip_ranges_inverse;
I'll also update the statistics used by PostgreSQL's query planner.
VACUUM ANALYZE ip_ranges;
The resulting database is 1,479 MB in PostgreSQL's internal format.
\l+ ip_ranges
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description
-----------+-------+----------+-------------+-------------+-------------------+---------+------------+-------------
ip_ranges | mark | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | 1479 MB | pg_default |
Below I'll be using a prepared query for the lookup process. The PREPARE statement will parse the SQL, analyse it and perform any rewriting/macro expansion of the query ahead of time and only once during this exercise. Later on, when the EXECUTE statement is called the query will only need to be planned and executed saving a good amount of overhead.
This post originally did not state that sequential scanning should be turned off and PostgreSQL would pick this over scanning its index. Thanks to Justin Azoff for pointing this out.
EXPLAIN EXECUTE find_stub(2884839823);
QUERY PLAN
---------------------------------------------------------------------------------------
Limit (cost=0.00..0.10 rows=1 width=8)
-> Seq Scan on ip_ranges (cost=0.00..386967.84 rows=3903622 width=8)
Filter: ((start <= '2884839823'::bigint) AND ("end" >= '2884839823'::bigint))
SET enable_seqscan = off;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Limit (cost=0.44..0.58 rows=1 width=8)
-> Index Scan using ip_ranges_inverse on ip_ranges (cost=0.44..568625.21 rows=3903622 width=8)
Index Cond: ((start <= '2884839823'::bigint) AND ("end" >= '2884839823'::bigint))
I've added this setting to the code below.
$ vi run_pg.py
from psycopg2 import connect as PG
unique_ips = \
[int(x.split(',')[1])
for x in open('unique_ips', 'r+b')\
.read()\
.strip()\
.splitlines()]
conn = PG(database='ip_ranges')
cursor = conn.cursor()
cursor.execute('SET enable_seqscan = off;')
cursor.execute('''
PREPARE find_stub AS
SELECT domain_stub
FROM ip_ranges
WHERE "start" <= $1
AND "end" >= $1
LIMIT 1;''')
sql = 'EXECUTE find_stub(%(ip)s);'
for ip in unique_ips:
cursor.execute(sql, {'ip': ip})
resp = cursor.fetchone()
assert len(resp[0].strip()) > 0
cursor.execute('''DEALLOCATE find_stub;''')
$ time python run_pg.py
The above completed in 3 minutes and 41 seconds giving a lookup rate of 16,290/hour.
Populating ClickHouse
I'll first load the CSV data into a Log Engine table in ClickHouse.
$ clickhouse-client
CREATE TABLE ip_ranges_log (
domain_stub String,
start UInt32,
end UInt32
) ENGINE = Log;
$ cat shrunken.csv \
| clickhouse-client \
--query="INSERT INTO ip_ranges_log FORMAT CSV"
I'll then produce a MergeTree Engine table which will convert the row-oriented data from the previous table into a form that will be faster to search against. The MergeTree Engine demands a date to partition the data against so I've picked a place holder date of January 1st, 1970.
$ clickhouse-client
CREATE TABLE ip_ranges
ENGINE = MergeTree(const_date, (start, end), 8192)
AS SELECT
toDate('1970-01-01') AS const_date,
domain_stub,
start,
end
FROM ip_ranges_log;
The Log Engine table is 166 MB in ClickHouse's internal format and the MergeTree Engine table is 283 MB.
I couldn't find an equivalent of PostgreSQL's prepared statements in ClickHouse so the following will simply execute a SELECT statement in full for each and every record.
$ vi run_ch.py
from clickhouse_driver import Client as CH
unique_ips = \
[int(x.split(',')[1])
for x in open('unique_ips', 'r+b')\
.read()\
.strip()\
.splitlines()]
client = CH('localhost',
port=9000,
password='root',
secure=False,
verify=False,
compression=False)
sql = '''SELECT domain_stub
FROM ip_ranges
WHERE start <= %d
AND end >= %d
LIMIT 1'''
for ip in unique_ips:
resp = client.execute(sql % (ip, ip))[0]
assert len(resp[0].strip()) > 0
$ time python run_ch.py
The above completed in 9.19 seconds giving a lookup rate of just under 392K/hour. This is 24x faster than PostgreSQL.
Closing Thoughts
The performance gap in the hourly lookup rate is off by an order of magnitude. Even if I ran four threads querying PostgreSQL it would still be an order of magnitude off and that's not considering any performance increases that could be seen running the same number of lookup threads with ClickHouse.
PostgreSQL has Zedstore in development. This is a columnar storage engine that could go some way to speeding up the above queries when it's released. PostgreSQL is a fantastic tool when a dataset can't enforce append-only operations but it would be nice to see if use cases like the above could be optimised further than they already have been.
If this code above were to ever be used in production a Least Recently Used (LRU) cache on the results from either PostgreSQL or ClickHouse could go a long way to returning subsequent matching records quicker.
The processing of the source data into a usable form lends itself well to systems with many CPU cores. Using an EC2 spot instance, like the 72-vCPU c5.18xlarge, would be a good choice to process the data quickly while only spending a few dollars to do so. The above instructions would need the four files and process counts expanded to the number of vCPUs on the given machine. I/O might become constrained at some point and should also be taken into consideration.
The shrinking operation would be a lot quicker if partitioned by the class A of each IPv4 address and run in parallel. Running everything sequentially means it's bottlenecked by the speed of a single CPU core. Also, I suspect ClickHouse could import and sort this data much quicker than SQLite3.
ClickHouse has standalone client support for feeding JSON in via UNIX Pipes, transforming it using SQL and outputting the results to CSV, Parquet or even native ClickHouse Log Engine format. I suspect a few stages of the data processing above could be removed with a single, albeit more elaborate, ClickHouse stage.
The Rapid7 database does have gaps in the dataset from unsuccessful lookups. If you're objective is to identify cloud providers then using the IPv4 address lists that have either been crowd-sourced or officially published for AWS, Google Cloud and Azure to enrich this dataset will go a long way to achieving a more complete picture of global IPv4 usage.
Another enrichment could include looking for class C ranges with a lot of gaps and doing a WHOIS on the first IPv4 address. It might announce that a single entity owns the entire class C and therefore all IPs in that range could be marked as such.
I also noted that there were a lot of small gaps where it may be safe to assume that missing IPv4 addresses between two ranges might be owned by the same entity and could be merged. The following is an example of this:
$ grep megaegg shrunken.csv | head
megaegg,16793601,16793855
megaegg,16793857,16794111
megaegg,16794113,16794367
megaegg,16794369,16794623
megaegg,16794625,16794879
megaegg,16794881,16795135
megaegg,16795137,16795391
megaegg,16795393,16795647
megaegg,16795649,16795903
megaegg,16795905,16796159
I took a look at what sort of reduction could be found merging ranges that shared the same domain name stub and were within 12 IPv4 addresses of one another.
$ vi merge_ip_ranges.py
lines = [(x.split(',')[0],
int(x.split(',')[1]),
int(x.split(',')[2]))
for x in open('shrunken.csv', 'r+b')\
.read()\
.strip()\
.lower()\
.splitlines()]
banked_first = None
for num, (host, first, last) in enumerate(lines):
if num + 1 >= len(lines):
continue
if host == lines[num + 1][0] and lines[num + 1][1] - last < 12:
if banked_first is None:
banked_first = first
else:
if banked_first:
print '%s,%d,%d' % (host, banked_first, last)
banked_first = None
else:
print '%s,%d,%d' % (host, first, last)
The result was a reduction of 17,784,359 records down to 4,969,340. This would also produce a database with much wider coverage of the global IPv4 space than the dataset initially offered by Rapid7, although its accuracy would be questionable.
The above could be simplified further. If you assume that if you don't see any other domain name stubs between two records with matching stubs then the entire space between those two records is owned by the same entity. Here's an example of such a scenario:
tepco,16781382,16781392
tepco,16781419,16781422
tepco,16781453,16781471
The first two records have 27 IPv4 addresses between them, the second and third have 31 addresses between them. When I removed the 12-IP Address range predicate from the above Python script I was presented with 3,693,710 records.
Finally, JSON has wide tooling support and many developers understand what it is and how to work with it but for delivery of a billion rows or more, it would be nice to see Parquet as one of the delivery formats. Parquet's tooling support has grown immensely in the last few years and manipulating data stored in Parquet is far more efficient than JSON. There's an argument that a Parquet distributable could come with instructions on converting it into JSON or other formats using open source tooling.