Home | Benchmarks | Categories | Atom Feed

Posted on Mon 29 November 2021 under DevOps and Networking

Faster Top Level Domain Name Extraction with Go

Last year, I built a database of second-level domains from the reverse DNS names for 1.27 billion IPv4 addresses. I covered the steps I took to create the dataset in my Fast IPv4 to Host Lookups blog post. The source data came from Rapid7's Reverse DNS (RDNS) Study and is formatted in line-delimited JSON. I used a Python library called tldextract to extract the top-level domain (TLD) and any CNAMEs from each record's full domain name. For example, "company-name" would be extracted from "test.system.company-name.co.uk".

The Python-based extraction process took a day to complete. In August, I ported the script to Rust and brought that down to 33 minutes. Fellow Canadian Vincent Foley saw my post and discovered that only 20 records contained Unicode and the ASCII-only code path hadn't been optimised for. After re-writing the underlying TLD extraction code, Vincent was able to bring the processing time down to around 3.5 minutes.

In this post, I'll take a first attempt at re-writing the Rust version of my code in Go and analyse its performance.

Go Up & Running

The system used in this blog post is a step up from the one used in 2019. It has been upgraded to Ubuntu 20.04 LTS with 16 GB of RAM and 1 TB of SSD capacity. The CPU is still the same, a 4-core, Intel Core i5 4670K clocked at 3.4 GHz.

I'll install Go via a PPA so I get a more recent version than the one Ubuntu supports out of the box.

$ sudo apt update
$ sudo apt install \
    jq \
    pigz \
    software-properties-common

$ sudo add-apt-repository ppa:longsleep/golang-backports
$ sudo apt update
$ sudo apt install golang-go

The above installed version 1.17.3. You may have a newer version installed depending on when you run the above commands.

$ go version
go version go1.17.3 linux/amd64

Rapid7's Reverse DNS Dataset

I'm still using the July 28th RDNS dataset from Rapid7. It has 1,242,695,760 lines of JSON and sits at just over 125 GB of data when uncompressed. This is what the first record in the archive looks like.

$ pigz -dc 2021-07-28-1627430820-rdns.json.gz \
    | head -n1 \
    | jq
{
  "timestamp": "1627467007",
  "name": "1.120.175.74",
  "value": "cpe-1-120-175-74.4cbp-r-037.cha.qld.bigpond.net.au",
  "type": "ptr"
}

I've broken the file up into four parts of roughly the same size so that they can be processed independently of one another and I can max out the four cores on my CPU.

$ pigz -dc 2021-07-28-1627430820-rdns.json.gz \
    | split --lines=310673940 \
            --filter="pigz > rdns_\$FILE.json.gz"

If you're running the above on macOS, replace split with gsplit.

A Data Transformer Built using Go

Below is to Go version of the RDNS transformer.

$ mkdir -p ~/rdns
$ cd ~/rdns
$ vi go.mod
module marklit/rdns

go 1.17
$ vi main.go
package main

import (
    "bufio"
    "bytes"
    "compress/gzip"
    "encoding/binary"
    "encoding/json"
    "fmt"
    "github.com/globalsign/publicsuffix"
    "log"
    "net"
    "os"
    "strconv"
    "strings"
)

type RDNS struct {
    Timestamp string `json:"timestamp"`
    Name      net.IP `json:"name"`
    Value     string `json:"value"`
    Type      string `json:"type"`
}

func main() {
    if err := publicsuffix.Update(); err != nil {
        panic(err.Error())
    }

    file, err := os.Open(os.Args[1])

    if err != nil {
        fmt.Println(err.Error())
    }

    reader, err := gzip.NewReader(file)

    if err != nil {
        fmt.Println(err.Error())
    }

    scanner := bufio.NewScanner(reader)

    writer := bufio.NewWriterSize(os.Stdout, 4096)

    var ipv4_int uint32
    var record RDNS

    for scanner.Scan() {
        if err := json.Unmarshal([]byte(scanner.Text()), &record); err != nil {
            log.Fatal("Unable to parse: %w", err)
        } else {
            binary.Read(bytes.NewBuffer(record.Name.To4()), binary.BigEndian, &ipv4_int)

            var suffix, _ = publicsuffix.PublicSuffix(record.Value)

            no_tld := strings.TrimRight(record.Value, suffix)
            dots := strings.Split(no_tld, ".")
            fmt.Fprintln(writer, strconv.FormatUint(uint64(ipv4_int), 10) + "," + dots[len(dots)-1])
        }
    }
}

I'll collect the 3rd-party libraries and then build a binary of the above with the debug symbols stripped out.

$ go mod tidy
$ go build -ldflags "-s -w"

The binary produced is 5,017,600 bytes in size. This is mostly due to the Go Runtime being embedded.

Running the Go Build

Originally the following finished in 71 minutes and 18 seconds. Following a bufio upgrade to the code, I was able to reduce this to 49 minutes and 11 seconds. This is about 1.5x slower than my Rust version I built in August.

$ ls ../rdns_*.json.gz \
    | xargs \
        -P4 \
        -n1 \
        -I {} \
        sh -c "./main {} > {}.csv"

During the above operation, htop showed every CPU core maxed out.

The Performance Gap

I built a one-million-line extract of the dataset to use for analysis.

$ gunzip -c ~/2021-07-28-1627430820-rdns.json.gz \
    | head -n1000000 \
    | gzip \
    > ~/million.json.gz

I then ran the above Go code through perf and strace.

$ sudo perf stat -dd \
    ~/rdns/main \
    ~/million.json.gz \
    > /dev/null
$ sudo strace -wc \
    ~/rdns/main \
    ~/million.json.gz \
    > /dev/null

I then re-built the August version of my Rust code as well as Vincent's version and ran them through the above as well. This was done with Rust's newer stable channel version 1.56.1. Both codebases were built using the following:

$ RUSTFLAGS='-Ctarget-cpu=native' \
    cargo build --release

My Go code originally took 14.336 seconds to process all one million records, my Rust version took 11.062 seconds and Vincent's took 1.409 seconds. None of these pieces of code share the same public suffixes database and the Go version is further hampered by the fact the software checks for a new version each time it starts. This will make a difference on short runs but doesn't account for the performance gap when the 1.27B-record database is being processed.

Go was originally printing output straight to fmt.Println. I/O is unbuffered by default in Go so I instead used bufio with a 4K buffer. This brought a 2.52x speed up to the code with a run time of 5.678 seconds. Unfortunately, that didn't translate well to the 1.27B-record run where only 22 minutes were shaved off a 71-minute run time.

I had a look to see if all three versions were built with AVX or AVX2 instructions. My Go and Rust binaries both have them but Vincent's doesn't for some reason.

$ vi avx_patterns.tmp
vaddpd\|vaddps\|vaddsubpd\|vaddsubps\|vandnpd\|vandnps\|vandpd\|
vandps\|vblendpd\|vblendps\|vblendvpd\|vblendvps\|vbroadcastf128\|
vbroadcasti128\|vbroadcastsd\|vbroadcastss\|vcmppd\|vcmpps\|vcmpsd\|
vcmpss\|vcvtdq2pd\|vcvtdq2ps\|vcvtpd2dq\|vcvtpd2ps\|vcvtps2dq\|
vcvtps2pd\|vcvttpd2dq\|vcvttps2dq\|vdivpd\|vdivps\|vdpps\|
vextractf128\|vextracti128\|vgatherdpd\|vgatherdps\|vgatherqpd\|
vgatherqps\|vhaddpd\|vhaddps\|vhsubpd\|vhsubps\|vinsertf128\|
vinserti128\|vlddqu\|vmaskmovpd\|vmaskmovps\|vmaxpd\|vmaxps\|vminpd\|
vminps\|vmovapd\|vmovaps\|vmovddup\|vmovddup\|vmovdqa\|vmovdqu\|
vmovmskpd\|vmovmskps\|vmovntdq\|vmovntdqa\|vmovntpd\|vmovntps\|
vmovshdup\|vmovsldup\|vmovupd\|vmovups\|vmpsadbw\|vmulpd\|vmulps\|
vorpd\|vorps\|vpabsb\|vpabsd\|vpabsw\|vpackssdw\|vpacksswb\|
vpackusdw\|vpackuswb\|vpaddb\|vpaddd\|vpaddq\|vpaddsb\|vpaddsw\|
vpaddusb\|vpaddusw\|vpaddw\|vpalignr\|vpand\|vpandn\|vpavgb\|
vpavgw\|vpblendd\|vpblendvb\|vpblendw\|vpbroadcastb\|vpbroadcastd\|
vpbroadcastq\|vpbroadcastw\|vpcmpeqb\|vpcmpeqd\|vpcmpeqq\|vpcmpeqw\|
vpcmpgtb\|vpcmpgtd\|vpcmpgtq\|vpcmpgtw\|vperm2f128\|vperm2i128\|
vpermd\|vpermilpd\|vpermilps\|vpermpd\|vpermps\|vpermq\|vpgatherdd\|
vpgatherdq\|vpgatherqd\|vpgatherqq\|vphaddd\|vphaddsw\|vphaddw\|
vphsubd\|vphsubsw\|vphsubw\|vpmaddubsw\|vpmaddwd\|vpmaskmovd\|
vpmaskmovq\|vpmaxsb\|vpmaxsd\|vpmaxsw\|vpmaxub\|vpmaxud\|vpmaxuw\|
vpminsb\|vpminsd\|vpminsw\|vpminub\|vpminud\|vpminuw\|vpmovmskb\|
vpmovsxbd\|vpmovsxbq\|vpmovsxbw\|vpmovsxdq\|vpmovsxwd\|vpmovsxwq\|
vpmovzxbd\|vpmovzxbq\|vpmovzxbw\|vpmovzxdq\|vpmovzxwd\|vpmovzxwq\|
vpmuldq\|vpmulhrsw\|vpmulhuw\|vpmulhw\|vpmulld\|vpmullw\|vpmuludq\|
vpor\|vpsadbw\|vpshufb\|vpshufd\|vpshufhw\|vpshuflw\|vpsignb\|vpsignd\|
vpsignw\|vpslld\|vpslldq\|vpsllq\|vpsllvd\|vpsllvq\|vpsllw\|vpsrad\|
vpsravd\|vpsraw\|vpsrld\|vpsrldq\|vpsrlq\|vpsrlvd\|vpsrlvq\|vpsrlw\|
vpsubb\|vpsubd\|vpsubq\|vpsubsb\|vpsubsw\|vpsubusb\|vpsubusw\|vpsubw\|
vptest\|vpunpckhbw\|vpunpckhdq\|vpunpckhqdq\|vpunpckhwd\|vpunpcklbw\|
vpunpckldq\|vpunpcklqdq\|vpunpcklwd\|vpxor\|vpxor\|vrcpps\|vroundpd\|
vroundps\|vrsqrtps\|vshufpd\|vshufps\|vsqrtpd\|vsqrtps\|vsubpd\|
vsubps\|vtestpd\|vtestps\|vunpckhpd\|vunpckhps\|vunpcklpd\|vunpcklps\|
vxorpd\|vxorps\|vzeroall\|vzeroupper
$ cat avx_patterns.tmp | tr -d '\n' > avx_patterns
$ objdump -d ~/rdns/main \
    | grep -cf avx_patterns # 3,539 (GoLang version)
$ objdump -d ~/rust-rdns/target/release/rdns \
    | grep -cf avx_patterns # 462 (My Rust version)
$ objdump -d ~/vfb-tldextract/target/release/vfb-tldextract \
    | grep -cf avx_patterns # 0 (Vincent's)

My Go code originally made 16,763 context switches during its run, my Rust code made 2,359 and Vincent's only made 205. After the bufio upgrade, the Go code halved its context switch count.

System calls to read and mmap were all within the same order of magnitude of one another but Go originally made 230,383 calls to the write system call. This is roughly one call for every 4 records. My Rust code only made 3,392 which is only 67 more than Vincent's code.

I suspected that if the buffer fmt.Println is writing to wasn't flushed so often there could be a dramatic increase in performance which turned out to be true. The bufio upgrade brought the calls to write down to 505, which is 10x less than either of the Rust versions of this application. Go's total system call count, though still high, was brought down to 33,235. Calls to rt_sigreturn, futex and epoll_pwait now take up the vast majority of system calls. Again, the major improvements seen with the million-record run didn't translate to the 1.27B-record run.

Further Improvements

Going forward, the domain name parsing library used by Go needs to be optimised for ASCII-only domain names and needs the public suffixes database downloaded ahead of time.

There has also been discussion on lobste.rs around using ByteDance's performance-focused JSON deserialisation library Sonic.

Also, though the following optimisation would benefit every version of this application, moving to pipe-only output similar to what was used in my Fastest FizzBuzz post would reduce the amount of data being copied between memory locations and could have a substantial impact on run time.

As I come across optimisations for the Go version of this software I'll update this blog post.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2024 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.