Real Time DNS Analytics at Cloudflare · Real Time DNS Analytics at Cloudflare Tom Arnfeld....

16
Real Time DNS Analytics at Cloudflare Tom Arnfeld

Transcript of Real Time DNS Analytics at Cloudflare · Real Time DNS Analytics at Cloudflare Tom Arnfeld....

Real Time DNS Analytics at CloudflareTom Arnfeld

Attackers

Visitors

Crawlers& bots

Your website

Attackers

Visitors

Crawlers& bots

Your website

Cloudflare Protected

100+Data centers globally

2.5BMonthly unique visitors

>10%Internet requests

everyday

<=3MDNS queries/second

websites, apps & APIs in 150 countries

6M+ 5M+HTTP requests/second

Anatomy of a DNS query$ dig www.cloudflare.com

; <<>> DiG 9.8.3-P1 <<>> www.cloudflare.com;; global options: +cmd;; Got answer:;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36582;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:;www.cloudflare.com. IN A

;; ANSWER SECTION:www.cloudflare.com. 5 IN A 198.41.215.162www.cloudflare.com. 5 IN A 198.41.214.162

;; Query time: 34 msec;; SERVER: 192.168.1.1#53(192.168.1.1);; WHEN: Sat Sep 2 10:48:30 2017;; MSG SIZE rcvd: 68

Anatomy of a DNS query$ dig www.cloudflare.com

; <<>> DiG 9.8.3-P1 <<>> www.cloudflare.com;; global options: +cmd;; Got answer:;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36582;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:;www.cloudflare.com. IN A

;; ANSWER SECTION:www.cloudflare.com. 5 IN A 198.41.215.162www.cloudflare.com. 5 IN A 198.41.214.162

;; Query time: 34 msec;; SERVER: 192.168.1.1#53(192.168.1.1);; WHEN: Sat Sep 2 10:48:30 2017;; MSG SIZE rcvd: 68

Fields30+

Powerful customer analytics

FILTER BY DNS QUERY NAME

What did we want?

- Multidimensional query analytics

- Complex ad-hoc queries

- Capable of current and expected future scale

- Gracefully handle late arriving log data

- Roll-ups/aggregations for long term storage

- Highly available and replicated architecture

QueriesPer Second

<=3M

Edge Points of Presence

100+

Query Dimensions

20+

Years of stored aggregation

5+

We tried a few things...

- Kafka + Citusdb + Go

- Kafka + Spark Streaming

- Kafka + Flink

- Kafka + Druid

- Kafka + ClickHouse

ClickHouse

- Tabular, column-oriented data store

- Clustered architecture

- Familiar SQL query interfaceLots of very useful built-in aggregation functions

- Raw log data stored for 3 months~7 trillion rows

- Aggregated data for ∞1m, 5m, 1h aggregations across 3 dimensions

What did we want?

- Multidimensional query analytics

- Complex ad-hoc queries

- Capable of current and expected future scale

- Gracefully handle late arriving log data

- Roll-ups/aggregations for long term storage

- Highly available and replicated architecture

QueriesPer Second

<=3M

Edge Points of Presence

100+

Query Dimensions

20+

Years of stored agg.

5+

Attackers

Visitors

Crawlers& bots

Your website

Cloudflare DNS Server

DNS Query

Log Forwarder

Kafka Topic

Go ClickHouseInserter

ClickHouseCluster

October 2016Began evaluating technologies and architecture

Finalized schema, deployed a production ClickHouse cluster of 6 nodes

November 2016Prototype ClickHouse cluster with 3 nodes, inserting a sample of data

August 2017Migrated to a new cluster with multi-tenancy

Growing interest among other Cloudflare engineering teams, worked on standard tooling

Multi-tenant ClickHouse cluster

Row Insertion/s

8M+Raid-0 Spinning Disks

1PB+Insertion Throughput/s

4GB+Nodes

33

October 2016Began evaluating technologies and architecture

Finalized schema, deployed a production ClickHouse cluster of 6 nodes

November 2016Prototype ClickHouse cluster with 3 nodes, inserting a sample of data

August 2017Migrated to a new cluster with multi-tenancy

Growing interest among other Cloudflare engineering teams, worked on standard tooling

ExampleSELECT toStartOfMinute(datetime) as t, count(*) / 60 AS qpsFROM open.dnslogsWHERE date = '2017-08-01' AND toHour(datetime) = 21 AND ...GROUP BY tORDER BY t

ExampleSELECT toStartOfMinute(datetime) as t, count(*) / 60 AS qps, uniq(srcIPv4) AS ip4, uniq(srcIPv6) AS ip6, uniq(queryName) AS qn, countIf(queryType = 1) AS aCount, countIf(queryType = 28) AS aaaaCountFROM open.dnslogsWHERE date = '2017-08-01' AND ...GROUP BY tORDER BY t

Thanks!

How Cloudflare analyzes >1m DNS queries per secondWednesday @4.55PM – Swift Suite 2