Probabilistically bounded staleness for practical partial quorums
Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size =...
Transcript of Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size =...
![Page 1: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/1.jpg)
Counting is Hard: Probabilistically Counting Views at RedditKrishnan Chandra, Data Engineer
![Page 2: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/2.jpg)
Overview
● What is probabilistic counting?
● How did probabilistic counting help us scale?
● What issues did we face along the way?
![Page 3: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/3.jpg)
What is Reddit?Reddit is the frontpage of the internet
A social network where there are tens of thousands of communities around whatever passions or interests you might have
It’s where people converse about the things that are most important to them
![Page 4: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/4.jpg)
Reddit by the numbers
Alexa Rank (US/World)
MAU
Active Communities
Posts per month
Screenviews per month
4th/7th
330M+
138K+
10.7M
14B
![Page 5: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/5.jpg)
Counting Views
![Page 6: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/6.jpg)
Why Count Views?
● Includes logged-out users● Better measure of reach than
votes● Currently exposed to
moderators and content creators
![Page 7: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/7.jpg)
Cat Walking a HumanCat Fist Bumping
![Page 8: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/8.jpg)
Why is Counting Hard?
![Page 9: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/9.jpg)
Product Requirements
● Counts are over the life of a post● The same user should not count multiple
times within a short time frame● Should build in some protections against
spamming/cheating (similar to votes)● Should provide (near) real-time feedback
![Page 10: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/10.jpg)
● Exact counting:○ Requires storing state per user per
post
● Approximate counting:○ Requires much less state and storage○ Provides an estimate of reach within a
few percentage points of the exact number
Exact vs. Approximate Counting
![Page 11: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/11.jpg)
● HyperLogLog (HLL)○ Hash-based probabilistic algorithm
published in 2007○ Approximates set cardinality○ Works well for large cardinalities,
but not for small ones
● HyperLogLog++○ Introduced by Google in 2013○ Uses sparse and dense HLL
representations○ Switches over to HLL once needed
HyperLogLog (And Friends)
![Page 12: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/12.jpg)
![Page 13: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/13.jpg)
● Hash table consisting of m registers or buckets, each of width k bits
● Hash the input value, and split the hash value into 2 portions
● First portion (log2m bits) used to index to a register
● Second portion used to count the number of leading zeros and set the register value
How does HLL work?
![Page 14: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/14.jpg)
Assume: m=8 registers, k=3 bits
input hash 1 1 1 0 0 0 1 1
Register# 7 3 leading zeroesRecord 3+1=4 into Register# 7
r0
r1
r2
r3
r4
r5
r6
r7 1 0 0
Adapted from HyperLogLog - A Layman’s Overview
![Page 15: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/15.jpg)
● Estimate of cardinality is computed by taking the harmonic mean of the registers and raising 2 to that power
● Intuition: HLL is like flipping a coin!
● Largest run of heads gives an estimate of total number of flips
Computing Cardinality
![Page 16: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/16.jpg)
Counting Error
● HLL standard error○ Number of registers/hash
buckets m○ Standard error = 1.04/sqrt(m)○ Using Redis’s HLL
implementation, standard error is 0.81%!
![Page 17: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/17.jpg)
Using HLL to Count Views
● 1 HLL per post● HLL inserts are idempotent!
○ Allows reprocessing data if needed
● How to manage de-duping over short time window?○ Store user + truncated timestamp
as the value
![Page 18: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/18.jpg)
![Page 19: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/19.jpg)
Space Usage
● Exact counting:○ User id = 8 byte long○ ~1.5m users * 8 bytes = 12
MB
● HLL (Redis implementation)○ Max size = 12 KB○ 0.1% of the exact counting
storage
![Page 20: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/20.jpg)
Counting Architecture
![Page 21: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/21.jpg)
Architecture Goals
1. Consume a stream of view events and filter out spam/bad events
2. For good events, insert into an HLL in real time
3. Allow clients to consume views values in real time
![Page 22: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/22.jpg)
Counting
Server Side Events
App Servers
Client Side Events
Anti-Spam
![Page 23: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/23.jpg)
Stream Processing Infrastructure
● Kafka○ Main message bus for view events
● Redis○ Used for storing state + HLLs○ Intended as short term storage○ Functions as a cache for Cassandra
● Cassandra○ Used to store the final counts and
HLLs in separate column families○ Intended as long term storage
![Page 24: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/24.jpg)
Counting Application (Part 1)
● Anti-Spam Consumer○ Consumes the stream of views from
Kafka○ Basic rules engine backed by Redis○ Consumer outputs a decision to a
Kafka topic
![Page 25: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/25.jpg)
Counting Application (Part 2)
● Counting Consumer○ Consumes the decisions topic output
by the anti-spam consumer○ Creates/updates the HLL for the post
in Redis.○ Stores both the count and the HLL
filter out to Cassandra.
![Page 26: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/26.jpg)
Scaling Challenges
![Page 27: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/27.jpg)
● Problems○ Rules engine is very memory heavy○ HLL counting is very CPU-heavy○ Rules engine data is generally
time-bound with expiry○ HLL data should be kept in Redis as
long as possible to avoid reading from Cassandra
Redis
![Page 28: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/28.jpg)
● Solutions○ Separate Redis instances for the
2 parts of the application○ Different instance types to reflect
the different workloads○ Allkeys-lru expiration on HLLs,
volatile-ttl expiration on the rules engine
![Page 29: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/29.jpg)
![Page 30: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/30.jpg)
● Problems○ 1 row per post - overwritten
frequently○ Read rate on page loads
overwhelming the cluster○ Issues with load when “catching
up“○ Storage grows forever with the
number of posts!
Cassandra
![Page 31: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/31.jpg)
● Solutions○ Updates to the same row in
Cassandra throttled to every 10 seconds
○ Read caching○ Slow the update rate when
catching up ○ More disk!
![Page 32: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/32.jpg)
![Page 33: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/33.jpg)
● Views on Reddit skew towards newer posts○ Allows most views to be served by
Redis○ Keeps read rate on Cassandra
very low
Observations
![Page 34: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/34.jpg)
![Page 35: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/35.jpg)
● Thanks to HLLs, counting views became much more efficient○ Current storage usage is ~1TB for a
full year of posts!
● Delivery was possible in a quarter with an engineering team of 3 (not always full time)
Takeaways
![Page 36: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/36.jpg)
Thanks to our team!
● /u/gooeyblob - Cassandra + Backend
● /u/d3fect - Backend + API
● /u/powerlanguage - Product Management
![Page 37: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/37.jpg)
Thanks! Krishnan Chandra [email protected]/shrink_and_an_arch
PS: We’re hiring!http://reddit.com/jobs
![Page 38: Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size = 12 KB 0.1% of the exact counting storage. Counting Architecture. Architecture Goals](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c9f859e325293b0d9e26/html5/thumbnails/38.jpg)
References
● View Counting at Reddit (Blog Post from 2017)
● Original HyperLogLog paper● Redis blog announcing HLL support● Google paper announcing HLL++
algorithm● HyperLogLog - A Layman’s Overview