Analysing Petabytes of Websites

Common Crawl is a California-based, non-profit organisation that aims to crawl the internet once every month or so and make the data collected public via Amazon S3. Key people involved with the project include Peter Norvig, Director of Research at Google. Their crawls typically yield upwards of two to three billion web pages and in 2015 alone they produced almost 1.4 PB of web page contents and derived metadata (when decompressed).

To give some context, Google stated in 2013 that they index 30 trillion pages though it's estimated they only serve around 50 million of those pages in their search results. In 2012, Yandex said they index "10s of billions" of pages. DuckDuckGo indexed 1.2 billion pages in 2012 according to this estimate.

In January of 2017, Common Crawl put together a 3.14 billion-page, 250 TB crawl. In this blog post, I'll analyse a portion of that dataset to find out which web servers are the most popular in terms of linkable URLs crawled.

Common Crawl's Data Lake

When Amazon Web Services began hosting Common Crawl's data back in 2012 they were held in the aws-publicdatasets S3 bucket situated in the US East Region. That bucket still exists today but it doesn't have a lot of data from the past 12 months in it. Common Crawl now uses its own commoncrawl S3 bucket that looks to have a current and complete dataset of their published crawls. Below you can see their crawls segmented by year and week number at the time of publishing.

$ aws s3 ls s3://commoncrawl/crawl-data/

PRE CC-MAIN-2013-20/
PRE CC-MAIN-2013-48/
PRE CC-MAIN-2014-10/
PRE CC-MAIN-2014-15/
PRE CC-MAIN-2014-23/
PRE CC-MAIN-2014-35/
PRE CC-MAIN-2014-41/
PRE CC-MAIN-2014-42/
PRE CC-MAIN-2014-49/
PRE CC-MAIN-2014-52/
PRE CC-MAIN-2015-06/
PRE CC-MAIN-2015-11/
PRE CC-MAIN-2015-14/
PRE CC-MAIN-2015-18/
PRE CC-MAIN-2015-22/
PRE CC-MAIN-2015-27/
PRE CC-MAIN-2015-32/
PRE CC-MAIN-2015-35/
PRE CC-MAIN-2015-40/
PRE CC-MAIN-2015-48/
PRE CC-MAIN-2016-07/
PRE CC-MAIN-2016-15/
PRE CC-MAIN-2016-18/
PRE CC-MAIN-2016-22/
PRE CC-MAIN-2016-26/
PRE CC-MAIN-2016-30/
PRE CC-MAIN-2016-36/
PRE CC-MAIN-2016-40/
PRE CC-MAIN-2016-44/
PRE CC-MAIN-2016-50/
PRE CC-MAIN-2017-04/
...

I'm not able to prove where the commoncrawl bucket is based but I suspect it's still in the US East Region. It's important that any EMR cluster launched is in the same region for good data locality.

$ aws s3api get-bucket-location --bucket commoncrawl

An error occurred (AccessDenied) when calling the GetBucketLocation operation: Access Denied

When a page is crawled the request, response and contents (not limited to just HTML) are stored using WARC format in a warc.gz file. These warc.gz files store many pages and grow to around 1 GB in size. For each crawl (there is usually one every month) there can be upwards of 60,000 warc.gz files. These are all listed in a manifest for that month's crawl.

The following is the manifest file for the January 2017 crawl. The manifest is around 7 MB when decompressed and contains 57,800 lines. Each line is a single URI of a warc.gz file. There is no prefixed protocol or hostname as S3 and HTTPS are both supported.

$ curl -O https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-04/warc.paths.gz
$ gunzip -c warc.paths.gz | head

crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00000-ip-10-171-10-70.ec2.internal.warc.gz
crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00001-ip-10-171-10-70.ec2.internal.warc.gz
crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00002-ip-10-171-10-70.ec2.internal.warc.gz
crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00003-ip-10-171-10-70.ec2.internal.warc.gz
crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00004-ip-10-171-10-70.ec2.internal.warc.gz
crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00005-ip-10-171-10-70.ec2.internal.warc.gz
crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00006-ip-10-171-10-70.ec2.internal.warc.gz
crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00007-ip-10-171-10-70.ec2.internal.warc.gz
crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00008-ip-10-171-10-70.ec2.internal.warc.gz
crawl-data/CC-MAIN-2017-04/segments/1484560279169.4/warc/CC-MAIN-20170116095119-00009-ip-10-171-10-70.ec2.internal.warc.gz

When a page is written into a warc.gz file there are two derived datasets that are created off the back of it. The first is a JSON extract of the request, response and HTML metadata which is stored in a warc.wat.gz file. These are usually around ~350 MB in size. The second is a plain-text extract of the HTML contents which is stored in warc.wet.gz files. These are usually around ~150 MB in size.

To demonstrate what these files look like I'll fetch a random HackerNews page. First, I'll search for all https://news.ycombinator.com/* pages using Common Crawl's index for their January 2017 crawl.

$ curl --silent 'http://index.commoncrawl.org/CC-MAIN-2017-04-index?url=https%3A%2F%2Fnews.ycombinator.com%2F*&output=json' \
    > hn.paths

That request returned 13,591 pages from the January crawl.

$ wc -l hn.paths

13591 hn.paths

Note that any one page may have been crawled and stored more than once and there is no guarantee a page crawled in one month's crawl will be crawled in another.

Each line in the hn.paths results file is a JSON string representing the metadata of the page crawled. It contains the warc.gz file URI that the page contents can be found in as well as the byte offset in that GZIP file and the length of the contents when GZIP-compressed. Here is one page picked at random:

$ sort -R hn.paths \
    | head -n1 \
    | python -mjson.tool

{
    "digest": "D6UPIKJTS6XLRLUWTW3HL2S44IE2GUZS",
    "filename": "crawl-data/CC-MAIN-2017-04/segments/1484560279657.18/warc/CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz",
    "length": "2049",
    "mime": "text/html",
    "offset": "822555329",
    "status": "200",
    "timestamp": "20170117120519",
    "url": "https://news.ycombinator.com/item?id=4781011",
    "urlkey": "com,ycombinator,news)/item?id=4781011"
}

I'll download the warc.gz file and extract the page. I'll run the head command to take the first 822,555,329 + 2,049 bytes of raw GZIP data, I'll then pipe that into tail and take the last 2,049 bytes of GZIP data isolating the compressed content used just for this one page. I'll then decompress the contents using gunzip.

$ curl -O https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-04/segments/1484560279657.18/warc/CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz
$ head -c 822557378 CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz \
    | tail -c 2049 \
    | gunzip -c

If you want to save some bandwidth you could provide the offset and range to CURL so it only fetches that 2,049 bytes from Amazon in the first place.

$ curl -H "range: bytes=822555329-822557378" \
       -O \
       https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-04/segments/1484560279657.18/warc/CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz
$ gunzip -c CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz \
    2>/dev/null

Below are the headers in full followed by the HTML. I've truncated the HTML for this blog post but I can assure you the entire HTML of the page is present.

WARC/1.0
WARC-Type: response
WARC-Date: 2017-01-17T12:05:19Z
WARC-Record-ID: <urn:uuid:9cd7b193-4ce0-44e5-924f-478d69798b52>
Content-Length: 3955
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:c915b66c-b823-44f6-94f4-452aa439a12f>
WARC-Concurrent-To: <urn:uuid:20114dd5-747e-4321-80da-0c449bb37894>
WARC-IP-Address: 104.20.44.44
WARC-Target-URI: https://news.ycombinator.com/item?id=4781011
WARC-Payload-Digest: sha1:D6UPIKJTS6XLRLUWTW3HL2S44IE2GUZS
WARC-Block-Digest: sha1:PF6MXD5SQPBVXVKSRW6WUIW36MPMS2RS
WARC-Truncated: length

HTTP/1.1 200 OK
Set-Cookie: __cfduid=daf9300df4da2584d3a99b5afc474f47f1484654719; expires=Wed, 17-Jan-18 12:05:19 GMT; path=/; domain=.ycombinator.com; HttpOnly
Connection: close
Server: cloudflare-nginx
Cache-Control: max-age=0
X-Frame-Options: DENY
Strict-Transport-Security: max-age=31556900; includeSubDomains
Vary: Accept-Encoding
Date: Tue, 17 Jan 2017 12:05:19 GMT
CF-RAY: 3229acbe785d23d8-IAD
Content-Type: text/html; charset=utf-8

<html op="item"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?vg9HEiw8gAskbHjOLY38">
        <link rel="shortcut icon" href="favicon.ico">
        <title>What are business hours when you are a developer platform used by developers glo... | Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">
...

If I want to fetch the JSON metadata extract for this page I can find its content in the .warc.gz file's warc.wat.gz sibling. To get the right URL, change the URL's 5th sub-folder from warc to wat and change the file extension from warc.gz to warc.wat.gz.

$ curl -O https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-04/segments/1484560279657.18/wat/CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.wat.gz

I don't have any offset information for the warc.wat.gz file so I'll run zgrep to find the content instead. The JSON payloads below have been truncated for readability purposes.

$ zgrep -B3 -A7 'id\=4781011' \
    CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.wat.gz

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: https://news.ycombinator.com/item?id=4781011
WARC-Date: 2017-01-17T12:05:19Z
WARC-Record-ID: <urn:uuid:cbdb944d-15b1-4e6a-aa26-262791508a94>
WARC-Refers-To: <urn:uuid:20114dd5-747e-4321-80da-0c449bb37894>
Content-Type: application/json
Content-Length: 1358

{"Envelope":{"Format":"WARC","WARC-Header-Length":"361","Block-Digest":"sha1:QBYGW7UDVVNPHNJ2XECOCJD4L7YI6UPF",...

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: https://news.ycombinator.com/item?id=4781011
WARC-Date: 2017-01-17T12:05:19Z
WARC-Record-ID: <urn:uuid:c87f2618-7d4a-41d2-95c8-271721681a7d>
WARC-Refers-To: <urn:uuid:9cd7b193-4ce0-44e5-924f-478d69798b52>
Content-Type: application/json
Content-Length: 4080

{"Envelope":{"Format":"WARC","WARC-Header-Length":"575","Block-Digest":"sha1:PF6MXD5SQPBVXVKSRW6WUIW36MPMS2RS",...

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: https://news.ycombinator.com/item?id=4781011
WARC-Date: 2017-01-17T12:05:19Z
WARC-Record-ID: <urn:uuid:e8242cc7-9787-4c35-b9d7-6b6531b99e90>
WARC-Refers-To: <urn:uuid:ed64f4c4-494a-4ca3-9118-228bd9cce3a0>
Content-Type: application/json
Content-Length: 1109

{"Envelope":{"Format":"WARC","WARC-Header-Length":"389","Block-Digest":"sha1:I55H52HFRALSA2RHZ2TCEKYNIIZVDUUT",...

There were three requests stored for this page in the file. Here is the JSON for the longest of the three. I've truncated the HTML meta links.

{
    "Envelope": {
        "Format": "WARC",
        "WARC-Header-Length": "575",
        "Block-Digest": "sha1:PF6MXD5SQPBVXVKSRW6WUIW36MPMS2RS",
        "Actual-Content-Length": "3955",
        "WARC-Header-Metadata": {
            "WARC-Type": "response",
            "WARC-Truncated": "length",
            "WARC-Date": "2017-01-17T12:05:19Z",
            "WARC-Warcinfo-ID": "<urn:uuid:c915b66c-b823-44f6-94f4-452aa439a12f>",
            "Content-Length": "3955",
            "WARC-Record-ID": "<urn:uuid:9cd7b193-4ce0-44e5-924f-478d69798b52>",
            "WARC-Block-Digest": "sha1:PF6MXD5SQPBVXVKSRW6WUIW36MPMS2RS",
            "WARC-Payload-Digest": "sha1:D6UPIKJTS6XLRLUWTW3HL2S44IE2GUZS",
            "WARC-Target-URI": "https://news.ycombinator.com/item?id=4781011",
            "WARC-IP-Address": "104.20.44.44",
            "WARC-Concurrent-To": "<urn:uuid:20114dd5-747e-4321-80da-0c449bb37894>",
            "Content-Type": "application/http; msgtype=response"
        },
        "Payload-Metadata": {
            "Trailing-Slop-Length": "4",
            "Actual-Content-Type": "application/http; msgtype=response",
            "HTTP-Response-Metadata": {
                "Headers": {
                    "X-Frame-Options": "DENY",
                    "Strict-Transport-Security": "max-age=31556900; includeSubDomains",
                    "Date": "Tue, 17 Jan 2017 12:05:19 GMT",
                    "Vary": "Accept-Encoding",
                    "CF-RAY": "3229acbe785d23d8-IAD",
                    "Set-Cookie": "__cfduid=daf9300df4da2584d3a99b5afc474f47f1484654719; expires=Wed, 17-Jan-18 12:05:19 GMT; path=/; domain=.ycombinator.com; HttpOnly",
                    "Content-Type": "text/html; charset=utf-8",
                    "Connection": "close",
                    "Server": "cloudflare-nginx",
                    "Cache-Control": "max-age=0"
                },
                "Headers-Length": "453",
                "Entity-Length": "3502",
                "Entity-Trailing-Slop-Bytes": "0",
                "Response-Message": {
                    "Status": "200",
                    "Version": "HTTP/1.1",
                    "Reason": "OK"
                },
                "HTML-Metadata": {
                    "Links": [{
                        "path": "IMG@/src",
                        "url": "y18.gif"
                    }, {
                        "path": "A@/href",
                        "url": "http://www.ycombinator.com"
                    }, {
                        "text": "Hacker News",
                        "path": "A@/href",
                        "url": "news"
                    }, {
                        "text": "new",
                        "path": "A@/href",
                        "url": "newest"
                    }, {
                        "path": "FORM@/action",
                        "method": "get",
                        "url": "//hn.algolia.com/"
                    }],
                    "Head": {
                        "Link": [{
                            "path": "LINK@/href",
                            "rel": "stylesheet",
                            "type": "text/css",
                            "url": "news.css?vg9HEiw8gAskbHjOLY38"
                        }, {
                            "path": "LINK@/href",
                            "rel": "shortcut icon",
                            "url": "favicon.ico"
                        }],
                        "Scripts": [{
                            "path": "SCRIPT@/src",
                            "type": "text/javascript",
                            "url": "hn.js?vg9HEiw8gAskbHjOLY38"
                        }],
                        "Metas": [{
                            "content": "origin",
                            "name": "referrer"
                        }, {
                            "content": "width=device-width, initial-scale=1.0",
                            "name": "viewport"
                        }],
                        "Title": "What are business hours when you are a developer platform used by developers glo... | Hacker News"
                    }
                },
                "Entity-Digest": "sha1:D6UPIKJTS6XLRLUWTW3HL2S44IE2GUZS"
            }
        }
    },
    "Container": {
        "Compressed": true,
        "Gzip-Metadata": {
            "Footer-Length": "8",
            "Deflate-Length": "2049",
            "Header-Length": "10",
            "Inflated-CRC": "-1732134004",
            "Inflated-Length": "4534"
        },
        "Offset": "822555329",
        "Filename": "CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz"
    }
}

As you can see there is a rich set of metadata all nicely structured in a way that is easy to work with. The warc.wat.gz files are around a third the size of the warc.gz files so they save a substantial amount of bandwidth if you can design your job around not needing the entire page contents.

AWS EMR Up & Running

All the following commands were run on a fresh install of Ubuntu 14.04.3. To start, I'll install the AWS CLI tool and a few dependencies it needs to run.

$ sudo apt update
$ sudo apt install \
    python-pip \
    python-virtualenv
$ virtualenv amazon
$ source amazon/bin/activate
$ pip install awscli

I'll then enter my AWS credentials.

$ read AWS_ACCESS_KEY_ID
$ read AWS_SECRET_ACCESS_KEY
$ export AWS_ACCESS_KEY_ID
$ export AWS_SECRET_ACCESS_KEY

I'll run configure to make sure us-east-1 is my default region.

$ aws configure

AWS Access Key ID [********************]:
AWS Secret Access Key [********************]:
Default region name [us-east-1]: us-east-1
Default output format [None]:

I'll be launching a 5-node Hadoop cluster of m3.xlarge instances using the 5.3.1 release of AWS EMR. This comes with Hadoop 2.7.3, Hive 2.1.1, Spark 2.1.0 and Presto 0.157.1. I don't recommend using spot instances for master or core nodes but in the interest of keeping my costs for this blog post down, all five nodes are spot instances where I've bid to pay at most $0.07 / hour for each node.

$ aws emr create-cluster \
    --applications \
        Name=Hadoop \
        Name=Hive \
        Name=Spark \
        Name=Presto \
    --auto-scaling-role EMR_AutoScaling_DefaultRole \
    --ec2-attributes '{
        "KeyName": "emr",
        "InstanceProfile": "EMR_EC2_DefaultRole",
        "SubnetId": "subnet-0489ed5c",
        "EmrManagedSlaveSecurityGroup": "sg-2d321350",
        "EmrManagedMasterSecurityGroup": "sg-3332134e"
    }' \
    --enable-debugging \
    --instance-groups '[{
        "InstanceCount": 2,
        "BidPrice": "0.07",
        "InstanceGroupType": "TASK",
        "InstanceType": "m3.xlarge",
        "Name": "Task - 3"
    }, {
        "InstanceCount": 1,
        "BidPrice": "0.07",
        "InstanceGroupType": "MASTER",
        "InstanceType": "m3.xlarge",
        "Name": "Master - 1"
    }, {
        "InstanceCount": 2,
        "BidPrice": "0.07",
        "InstanceGroupType": "CORE",
        "InstanceType": "m3.xlarge",
        "Name": "Core - 2"
    }]' \
    --log-uri 's3n://aws-logs-591231097547-us-east-1/elasticmapreduce/' \
    --name 'My cluster' \
    --region us-east-1 \
    --release-label emr-5.3.1 \
    --scale-down-behavior TERMINATE_AT_INSTANCE_HOUR \
    --service-role EMR_DefaultRole \
    --termination-protected

After 12 minutes the machines were up and running and I was able to SSH into the master node.

$ ssh -o ServerAliveInterval=50 \
      -i ~/.ssh/emr.pem \
      hadoop@184.73.27.176

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2016.09-release-notes/
11 package(s) needed for security, out of 17 available
Run "sudo yum update" to apply all updates.

EEEEEEEEEEEEEEEEEEEE MMMMMMMM           MMMMMMMM RRRRRRRRRRRRRRR
E::::::::::::::::::E M:::::::M         M:::::::M R::::::::::::::R
EE:::::EEEEEEEEE:::E M::::::::M       M::::::::M R:::::RRRRRR:::::R
  E::::E       EEEEE M:::::::::M     M:::::::::M RR::::R      R::::R
  E::::E             M::::::M:::M   M:::M::::::M   R:::R      R::::R
  E:::::EEEEEEEEEE   M:::::M M:::M M:::M M:::::M   R:::RRRRRR:::::R
  E::::::::::::::E   M:::::M  M:::M:::M  M:::::M   R:::::::::::RR
  E:::::EEEEEEEEEE   M:::::M   M:::::M   M:::::M   R:::RRRRRR::::R
  E::::E             M:::::M    M:::M    M:::::M   R:::R      R::::R
  E::::E       EEEEE M:::::M     MMM     M:::::M   R:::R      R::::R
EE:::::EEEEEEEE::::E M:::::M             M:::::M   R:::R      R::::R
E::::::::::::::::::E M:::::M             M:::::M RR::::R      R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM             MMMMMMM RRRRRRR      RRRRRR

I need three Python-based dependencies installed on the master and task nodes. I ran the following command by hand but I'd recommend this task be wrapped up in a bootstrap step when launching EMR.

$ sudo pip install \
    boto \
    warc \
    https://github.com/commoncrawl/gzipstream/archive/master.zip

Pointing Spark at Common Crawl

On the master node, I'll download the list of warc.wat.gz URIs for the January 2017 crawl. I'll then pick 100 URIs at random and save them onto HDFS. Note that if you want to work with data over several crawls it's possible to download the paths.gz files for multiple months and concatenate them into a single manifest.

$ curl -O https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-04/wat.paths.gz
$ gunzip -c wat.paths.gz \
    | sort -R \
    | head -n100 \
    | gzip > wat.paths.100.gz
$ hdfs dfs -copyFromLocal \
    wat.paths.100.gz \
    /user/hadoop/

I will then launch pyspark and start the data extraction job on those 100 warc.wat.gz files.

$ pyspark

import json

import boto
from boto.s3.key import Key
from gzipstream import GzipStreamFile
from pyspark.sql.types import *
import warc


def get_servers(id_, iterator):
    conn = boto.connect_s3(anon=True, host='s3.amazonaws.com')
    bucket = conn.get_bucket('commoncrawl')

    for uri in iterator:
        key_ = Key(bucket, uri)
        file_ = warc.WARCFile(fileobj=GzipStreamFile(key_))

        for record in file_:
            if record['Content-Type'] == 'application/json':
                record = json.loads(record.payload.read())

                try:
                    yield record['Envelope']\
                                ['Payload-Metadata']\
                                ['HTTP-Response-Metadata']\
                                ['Headers']\
                                ['Server'].strip().lower()
                except KeyError:
                    yield None

files = sc.textFile('/user/hadoop/wat.paths.100.gz')
servers = files.mapPartitionsWithSplit(get_servers) \
               .map(lambda x: (x, 1)) \
               .reduceByKey(lambda x, y: x + y)

schema = StructType([
    StructField("server_name", StringType(), True),
    StructField("page_count", LongType(), True)
])

sqlContext.createDataFrame(servers, schema=schema) \
          .write \
          .format("parquet") \
          .saveAsTable('servers')

In the above script I read in the 100 WAT URIs off the file stored on HDFS:

files = sc.textFile('/user/hadoop/wat.paths.100.gz')

I then iterated over each line in the file which represents a single URI. These all ran through the get_servers method. Afterwards, I ran a MapReduce job to count how many pages each server was used to serve. To be clear, when I refer to server I'm referring to the software that was reported to handle the request and serve the resulting contents (such as Apache, Nginx, IIS).

servers = files.mapPartitionsWithSplit(get_servers) \
               .map(lambda x: (x, 1)) \
               .reduceByKey(lambda x, y: x + y)

In the get_servers method, an anonymous connection to AWS S3 is made and a handle to the commoncrawl S3 bucket is acquired.

def get_servers(id_, iterator):
    conn = boto.connect_s3(anon=True, host='s3.amazonaws.com')
    bucket = conn.get_bucket('commoncrawl')

I used the warc library to parse each warc.wat.gz file so that each record, which normally sits across multiple lines in the warc.wat.gz file, is fetched as a single item via an iterator. GzipStreamFile is used to read the warc.wat.gz files in as a download rather than wait for the entire file to download first.

for uri in iterator:
    key_ = Key(bucket, uri)
    file_ = warc.WARCFile(fileobj=GzipStreamFile(key_))

    for record in file_:
        if record['Content-Type'] == 'application/json':
            record = json.loads(record.payload.read())

Not every record will contain a server name. If one cannot be found then None will be yielded but if one can be found its white space is stripped and is converted to lower case.

try:
    yield record['Envelope']\
                ['Payload-Metadata']\
                ['HTTP-Response-Metadata']\
                ['Headers']\
                ['Server'].strip().lower()
except KeyError:
    yield None

Once all the server names have been mapped and reduced they're stored in a "servers" table in Hive. That way the dataset can be queried by SparkSQL, Presto, Hive and anything else that integrates with Hive.

schema = StructType([
    StructField("server_name", StringType(), True),
    StructField("page_count", LongType(), True)
])
sqlContext.createDataFrame(servers, schema=schema) \
          .write \
          .format("parquet") \
          .saveAsTable('servers')

The above job completed in 2 hours and 10 minutes. Given that two task nodes managed to process 100 files in that time it'd be safe to assume 4 task nodes should be able to finish the job in just over an hour.

During the job run, I could see on one of the task nodes that the memory wasn't be exhausted. It could be better to run a smaller instance type for the task nodes with only 8 GB of RAM and use twice as many of them in order to speed up the job while still spending the same amount of money.

top - 13:37:51 up  1:10,  1 user,  load average: 1.94, 1.37, 0.88
Tasks: 134 total,   2 running, 132 sleeping,   0 stopped,   0 zombie
Cpu0  : 11.9%us,  0.8%sy,  0.0%ni, 86.8%id,  0.3%wa,  0.0%hi,  0.0%si,  0.1%st
Cpu1  : 13.1%us,  0.8%sy,  0.0%ni, 85.6%id,  0.3%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu2  : 15.3%us,  0.8%sy,  0.0%ni, 83.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.2%st
Cpu3  : 10.2%us,  0.8%sy,  0.0%ni, 87.9%id,  0.7%wa,  0.0%hi,  0.2%si,  0.3%st
Mem:  15407120k total,  5714164k used,  9692956k free,    83524k buffers
Swap:        0k total,        0k used,        0k free,  2459096k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20431 yarn      20   0  298m  46m 6864 R 100.0  0.3   7:17.90 python -m pyspark.daemon
 4930 presto    20   0 15.4g 974m  49m S 44.0  6.5  23:13.41 java -cp /usr/lib/presto/lib/* -verbose:class -server -Xmx12618613916 -Xmn512M -XX:+UseConcMarkSweepGC -XX
 2908 hadoop    20   0 4591m 243m  23m S  8.0  1.6   0:25.42 /etc/alternatives/jre/bin/java -Xmx1024m -XX:OnOutOfMemoryError=kill -9 %p -XX:MinHeapFreeRatio=10 -server
...

Analysing the Derived Data

With the dataset in Hive, I can now query the results. Below is an extract from SparkSQL.

$ spark-sql

SELECT page_count, server_name
FROM servers
ORDER BY page_count DESC
LIMIT 10;

11311889        NULL
1013899 apache
857613  nginx
467948  cloudflare-nginx
238186  microsoft-iis/7.5
168815  microsoft-iis/8.5
126959  gse
107301  nginx/1.10.2
88690   apache/2.2.15 (centos)
80912   nginx/1.10.1

The same query can also be run from Presto.

$ presto-cli \
    --schema default \
    --catalog hive

SELECT page_count, server_name
FROM servers
ORDER BY page_count DESC
LIMIT 10;

 page_count |      server_name
------------+------------------------
   11311889 | NULL
    1013899 | apache
     857613 | nginx
     467948 | cloudflare-nginx
     238186 | microsoft-iis/7.5
     168815 | microsoft-iis/8.5
     126959 | gse
     107301 | nginx/1.10.2
      88690 | apache/2.2.15 (centos)
      80912 | nginx/1.10.1

Looking at the sum of page counts in this sampling and subtracting the NULLs I can see cloudflare-nginx returned results for a little over 9% of the pages crawled. There was a report of Cloudflare Reverse Proxies dumping uninitialised memory recently. I'd hate to think this sampling was without bias and potentially 9%+ of websites were somehow affected.

SELECT SUM(page_count)
FROM servers
WHERE server_name IS NOT NULL;

  _col0
---------
 5030633

SELECT page_count, server_name
FROM servers
WHERE server_name
LIKE '%cloudflare%';

 page_count |   server_name
------------+------------------
     467948 | cloudflare-nginx

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.