per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly...
Transcript of per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly...
![Page 1: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/1.jpg)
How Cloudflare analyzes >1m DNS queries per secondTom Arnfeld (and Marek Vavrusa )
![Page 2: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/2.jpg)
100+Data centers globally
2.5BMonthly unique visitors
>10%Internet requests
everyday
≦3MDNS queries/second
websites, apps & APIs in 150 countries
6M+ 5M+HTTP requests/second
![Page 3: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/3.jpg)
Anatomy of a DNS query$ dig www.cloudflare.com
; <<>> DiG 9.8.3-P1 <<>> www.cloudflare.com;; global options: +cmd;; Got answer:;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36582;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:;www.cloudflare.com. IN A
;; ANSWER SECTION:www.cloudflare.com. 5 IN A 198.41.215.162www.cloudflare.com. 5 IN A 198.41.214.162
;; Query time: 34 msec;; SERVER: 192.168.1.1#53(192.168.1.1);; WHEN: Sat Sep 2 10:48:30 2017;; MSG SIZE rcvd: 68
Fields30+
![Page 4: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/4.jpg)
Cloudflare DNS Server
Log Forwarder
HTTP & Other Edge Services
AnycastDNS
Logs from all edge services and all PoPs are shipped over TLS to be processed
Logs are received and de-multiplexed
Logs are written into various kafka topics
![Page 5: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/5.jpg)
Cloudflare DNS Server
Log Forwarder
HTTP & Other Edge Services
AnycastDNS
Log messages are serialized with Cap’n’Proto
Logs from all edge services and all PoPs are shipped over TLS to be processed
Logs are written into various kafka topics
Logs are received and de-multiplexed
![Page 6: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/6.jpg)
What did we want?
- Multidimensional query analytics
- Complex ad-hoc queries
- Capable of current and expected future scale
- Gracefully handle late arriving log data
- Roll-ups/aggregations for long term storage
- Highly available and replicated architecture
QueriesPer Second
≦3M
Edge Points of Presence
100+
Query Dimensions
20+
Years of stored aggregation
5+
![Page 7: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/7.jpg)
Logs are written into various kafka topics
Logs are received and de-multiplexed
Kafka, Apache Spark and Parquet
- Scanning firehose is slow and adding filters is time consuming
- Offline analysis is difficult with large amounts of data
- Not a fast or friendly user experience
- Doesn’t work for customersConverted into Parquet and written to HDFS
Download and filter data from Kafka using Apache Spark
![Page 8: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/8.jpg)
Let’s aggregate everything... with streams
Timestamp QName QType RCODE
2017/01/01 01:00:00 www.cloudflare.com A NODATA
2017/01/01 01:00:01 api.cloudflare.com AAAA NOERROR
Time Bucket QName QType RCODE Count p50 Response Time
2017/01/01 01:00 www.cloudflare.com A NODATA 5 0.4876ms
2017/01/01 01:00 api.cloudflare.com AAAA NOERROR 10 0.5231ms
![Page 9: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/9.jpg)
Let’s aggregate everything... with streams
- Counters
- Total number of queries
- Query types
- Response codes
- Top-n query names
- Top-n query sources
- Response time/size quantiles
![Page 10: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/10.jpg)
![Page 11: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/11.jpg)
![Page 12: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/12.jpg)
![Page 13: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/13.jpg)
![Page 14: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/14.jpg)
![Page 15: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/15.jpg)
![Page 16: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/16.jpg)
![Page 17: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/17.jpg)
Logs are written into various kafka topics
Logs are received and de-multiplexed
- Spark experience in-house, though Java/Scala
- Batch-oriented and need a DB to serve online queries
- Difficult to support ad-hoc analysis
- Low resolution aggregates
- Scanning raw data is slow
- Late arriving data
Aggregating with Spark Streaming
Produce low cardinality aggregates with Spark Streaming
![Page 18: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/18.jpg)
Logs are written into various kafka topics
Logs are received and de-multiplexed
- Spark experience in-house, though Java/Scala
- Batch-oriented and need a DB to serve online queries
- Difficult to support ad-hoc analysis
- Low resolution aggregates
- Scanning raw data is slow
- Late arriving data
Aggregating with Spark Streaming
Produce low cardinality aggregates with Spark Streaming
![Page 19: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/19.jpg)
Logs are written into various kafka topics
Logs are received and de-multiplexed
- Distributed time-series DB
- Existing deployments of CitusDB
- High cardinality aggregations are tricky due to insert performance
- Late arriving data
- SQL API
Spark Streaming + CitusDB
Produce low cardinality aggregates with Spark Streaming
Insert aggregate rows into CitusDB cluster for reads
![Page 20: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/20.jpg)
Logs are written into various kafka topics
Logs are received and de-multiplexed
Apache Flink + (CitusDB?)
- Dataflow API and support for stream watermarks
- Checkpoint performance issues
- High cardinality aggregations are tricky due to insert performance
- SQL APIProduce low cardinality aggregates with Flink
Insert aggregate rows into CitusDB cluster for reads
![Page 21: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/21.jpg)
Logs are written into various kafka topics
Logs are received and de-multiplexed
Druid
- Insertion rate couldn’t keep up inour initial tests
- Estimated costs of a suitable cluster were way expensive
- Seemed performant for random reads but not the best we’d seen
- Operational complexity seemed high
Insert into a cluster ofDruid nodes
![Page 22: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/22.jpg)
Let’s aggregate everything... with streams
Timestamp QName QType RCODE
2017/01/01 01:00:00 www.cloudflare.com A NODATA
2017/01/01 01:00:01 api.cloudflare.com AAAA NOERROR
Time Bucket QName QType RCODE Count p50 Response Time
2017/01/01 01:00 www.cloudflare.com A NODATA 5 0.4876ms
2017/01/01 01:00 api.cloudflare.com AAAA NOERROR 10 0.5231ms
- Raw data isn’t easily queried ad-hoc
- Backfilling new aggregates is impossible or can be very difficult without custom tools
- A stream can’t serve actual queries
- Can be costly for high cardinality dimensions
*https://clickhouse.yandex/docs/en/introduction/what_is_clickhouse.html
![Page 23: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/23.jpg)
ClickHouse
- Tabular, column-oriented data store
- Single binary, clustered architecture
- Familiar SQL query interfaceLots of very useful built-in aggregation functions
- Raw log data stored for 3 months~7 trillion rows
- Aggregated data for ∞1m, 1h aggregations across 3 dimensions
![Page 24: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/24.jpg)
Cloudflare DNS Server
Log Forwarder
HTTP & Other Edge Services
AnycastDNS
Log messages are serialized with Cap’n’Proto
Logs from all edge services and all PoPs are shipped over TLS to be processed
Logs are written into various kafka topics
Logs are received and de-multiplexed
Go Inserters write the data in parallel
Multi-tenant ClickHouse cluster stores data
![Page 25: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/25.jpg)
ClickHouse Cluster
TinyLog
dnslogs_2016_01_01_14_30_pN
ReplicatedMergeTree
dnslogs_2016_01_01
ReplicatedMergeTree
dnslogs_2016_01
ReplicatedMergeTree
dnslogs_2016
- Raw logs are inserted into sharded tables
- Sidecar processes aggregates data into day/month/year tables
Initial table design
![Page 26: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/26.jpg)
ClickHouse Cluster
r{0,2}.dnslogs
- Raw logs are inserted into one replicated, sharded table
- Multiple r{0,2} databases to better pack the cluster with shards and replicas
First attempt in prod.
ReplicatedMergeTree
![Page 27: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/27.jpg)
Speeding up typical queries
- SUM() and COUNT() over a few low-cardinality dimensions
- Global overview (trends and monitoring)
- Storing intermediate state for non-additive functions
![Page 28: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/28.jpg)
ClickHouse Cluster
r{0,2}.dnslogs
- Raw logs are inserted into one replicated, sharded table
- Multiple r{0,2} databases to better pack the cluster with shards and replicas
- Aggregate tables for long-term storage
Today...
ReplicatedMergeTree
ReplicatedAggregatingMergeTree
dnslogs_rollup_X
![Page 29: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/29.jpg)
October 2016Began evaluating technologies and architecture, 1 instance in Docker
Finalized schema, deployed a production ClickHouse cluster of 6 nodes
November 2016Prototype ClickHouse cluster with 3 nodes, inserting a sample of data
August 2017Migrated to a new cluster with multi-tenancy
Growing interest among other Cloudflare engineering teams, worked on standard tooling
December 2016ClickHouse visualisations with Superset and Grafana
Spring 2017TopN, IP prefix matching, Go native driver, Analytics library, pkey in monotonic functions
![Page 30: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/30.jpg)
October 2016Began evaluating technologies and architecture, 1 instance in Docker
Finalized schema, deployed a production ClickHouse cluster of 6 nodes
November 2016Prototype ClickHouse cluster with 3 nodes, inserting a sample of data
August 2017Migrated to a new cluster with multi-tenancy
Growing interest among other Cloudflare engineering teams, worked on standard tooling
December 2016ClickHouse visualisations with Superset and Grafana
Spring 2017TopN, IP prefix matching, Go native driver, Analytics library, pkey in monotonic functions
Multi-tenant ClickHouse cluster
Row Insertion/s
8M+Raid-0 Spinning Disks
2PB+Insertion Throughput/s
4GB+Nodes
33
![Page 31: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/31.jpg)
ClickHouse Today… 12 Trillion Rows
SELECT table, sum(rows) AS totalFROM system.cluster_partsWHERE database = 'r0'GROUP BY tableORDER BY total DESC
┌─table──────────────────────────────┬─────────────total─┐│ ███████████████ │ 9,051,633,001,267 ││ ████████████████████ │ 2,088,851,716,078 ││ ███████████████████ │ 847,768,860,981 ││ ██████████████████████ │ 259,486,159,236 ││ … │ … │
![Page 32: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/32.jpg)
- TopK(n) Aggregateshttps://github.com/yandex/ClickHouse/pull/754
- TrieDictionaries (IP Prefix)https://github.com/yandex/ClickHouse/pull/785
- SpaceSaving: internal storage for StringRef{}https://github.com/yandex/ClickHouse/pull/925
- Bug fixes to the Go native driverhttps://github.com/kshvakov/clickhouse
- sumMap(key, value)https://github.com/yandex/ClickHouse/pull/1250
Contributions to ClickHouse
![Page 33: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/33.jpg)
Other Contributions
- Grafana Pluginhttps://github.com/vavrusa/grafana-sqldb-datasource(see also https://github.com/Vertamedia/clickhouse-grafana)
- SQLAlchemy (Superset)https://github.com/cloudflare/sqlalchemy-clickhouse
![Page 34: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/34.jpg)
Python w/ Jupyter Notebooks
import requestsimport pandas as pd
def ch(q, host='127.0.0.1', port=9001): start = timer() r = requests.get( 'https://%s:%d/' % (host, port), params={'user': 'xxx', 'query': q + '\nFORMAT TabSeparatedWithNames'}, stream=True) end = timer()
if not r.ok: raise RuntimeError(r.text)
print 'Query finished in %.02fs' % (end - start) return pd.read_csv(r.raw, sep="\t")
![Page 35: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/35.jpg)
Python w/ Jupyter Notebooks
import requestsimport pandas as pd
def ch(q, host='127.0.0.1', port=9001): start = timer() r = requests.get( 'https://%s:%d/' % (host, port), params={'user': 'xxx', 'query': q + '\nFORMAT TabSeparatedWithNames'}, stream=True) end = timer()
if not r.ok: raise RuntimeError(r.text)
print 'Query finished in %.02fs' % (end - start) return pd.read_csv(r.raw, sep="\t")
![Page 36: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/36.jpg)
Python w/ Jupyter Notebooks
![Page 37: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/37.jpg)
Python w/ Jupyter Notebooks
![Page 38: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/38.jpg)
blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-queries-per-second
Check it
![Page 39: per second How Cloudflare analyzes >1m DNS queries · 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps](https://reader034.fdocuments.net/reader034/viewer/2022042308/5ed4a0a9b072525a841adcef/html5/thumbnails/39.jpg)
Thanks!
@tarnfeld @vavrusamhttps://cloudflare.com/careers/departments/engineering