Harvard University1, Peking University , Yale University...

15
Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating Items Yang Zhou 1,2 , Tong Yang 2 , Jie Jiang 2 , Bin Cui 2 , Omid Alipoufard 3 , Minlan Yu 1 , Xiaoming Li 2 , Steve Uhlig 4 Harvard University 1 , Peking University 2 , Yale University 3 , Queen Mary University of London 4

Transcript of Harvard University1, Peking University , Yale University...

Page 1: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating Items

Yang Zhou1,2, Tong Yang2, Jie Jiang2, Bin Cui2, Omid Alipoufard3, Minlan Yu1, Xiaoming Li2, Steve Uhlig4

Harvard University1, Peking University2, Yale University3, Queen Mary University of London4

Page 2: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Data Streams are Pervasive

Network traffic Video streaming Sensor data Web click data (etc.)

2

In many applications, some statistical information is needed !Applications: Network measurement, DBMS optimization, Search engine design, Security, etc.

Information required: flow size, heavy hitters, heavy changes, quantiles, etc.

Page 3: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Accurate and Fast Data Stream Analysis is Challenging

Challenges: 1. Memory constraint

● Fit into cache to boost speed● Hardware on-chip memory limited

2. Single-pass requirement● Data is of huge volume and fast speed: Dumping into disk is hard● Some applications need online analysis

3

Exact statistics (e.g., by using hash tables) are difficult to obtain (and often unnecessary) !

Page 4: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Data Sketches can Help

4

Tasks Data Sketch Algorithms

Frequency estimation Count-Min, CM-CU, Count, ASketch

Top-k Hot itemsCount-Min, CM-CU, Space-Saving

ASketch, FlowRadar, UnivMon

Heavy changes RevSketch, FlowRadar, UnivMon, Space-Saving

Superspreader /DDoS detection TwoLevel

Frequency distribution MRAC, FlowRadar

Cardinality FM, LC, UnivMon

Entropy FlowRadar, UnivMon

Page 5: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Data Sketches can Help

5

Tasks Data Sketch Algorithms

Frequency estimation Count-Min, CM-CU, Count, ASketch

Top-k Hot itemsCount-Min, CM-CU, Space-Saving

ASketch, FlowRadar, UnivMon

Heavy changes RevSketch, FlowRadar, UnivMon, Space-Saving

Superspreader /DDoS detection TwoLevel

Frequency distribution MRAC, FlowRadar

Cardinality FM, LC, UnivMon

Entropy FlowRadar, UnivMon

Page 6: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Count-Min Sketch — Estimating Frequencies

6

+1

+1

+1

+1

Insertion

19

24

26

18

Query

frequency: 18 = Min{19, 24, 26, 18}

… ...

Page 7: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Space-Saving — Finding Top-k Hot Items

7

0:0

0:0 0:0

:1

:2 :3

:2

:2 :3

● Maintaining a heap-like data structure. ● If Space-Saving is full, the smallest item will be replaced by the

new item, whose frequency is initialized to be fmin+1

Page 8: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Limitations of Conventional Data Sketches

8

Sketch

Cold & Hot Items

Real Data Streams: Highly skewed -> Majority: Cold items -> Minority: Hot items

Count-Min: All items use large counters -> A waste of memory

Space-Saving: A great many of replacements caused by cold items are unnecessary -> poor accuracy

Cold & hot items

Page 9: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Methodology of Cold Filter*

9

Cold & hot items

Sketch

Hot items

CF

Cold items

Count-Min: Use small counters in CF -> record cold items Use large counters in sketch -> record hot items

Space-Saving: CF filters many cold items -> reduce # unnecessary replacements

*Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing. Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li and Steve Uhlig. SIGMOD. Jun. 2018

Page 10: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Agg-Evict: Optimizing Speed

10

Count-Min8 insertions

3 insertionsAggregator

Ideally,8/3=2.67 speed-up

-> How to design an efficient Aggregator?

Page 11: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Design of Agg-Evict*

11

1.Using SIMD to query continuous cells in a K-V pair array2.Using Random Eviction for simplicity and speed

*Accelerating Network Measurement in Software. Yang Zhou, Omid Alipoufard, Minlan Yu and Tong Yang. ACM SIGCOMM Computer Communication Review. 2018

Page 12: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Accuracy Improvement

13

All algorithms use the same memory size

Frequency estimation: Varying the

CF size

Ave

rage

Abs

olut

e E

rror

Finding Top-k hot

items: Varying k

Page 13: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Speed Improvement

14

Page 14: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Conclusion

15

Cold Filter

Agg-Evict

Generic

Improving accuracy by filtering the cold

Improving speed by aggregating items

Applicable to many different data sketches

Page 15: Harvard University1, Peking University , Yale University ...minlanyu.seas.harvard.edu/talk/sigmod18.pdf · Making Data Sketches Accurate and Fast by Filtering the Cold and Aggregating

Thanks!

Source Code: https://github.com/zhouyangpkuer/ColdFilter,

https://github.com/zhouyangpkuer/Agg-Evict.

16