Monoids monoids everywhere
-
Upload
kevin-faro -
Category
Data & Analytics
-
view
378 -
download
0
Transcript of Monoids monoids everywhere
Tetra Data Blitz10/1/2015
Monoids Monoids
Everywherein ~5 minutes
Kevin Faro
http://s2.quickmeme.com/img/44/44b0bd758f8ee5c81362923f0d5c8e017c9ddf623925e60c29a4c015b89fbb45.jpg
Oh, that wasn’t clear enough?An operation is considered a monoid if:
1. it is associative a. (a●b)●c=a●(b●c)
2. it has an identity element a. e●a=a●e=a
Examples● Addition
○ associative: (1+2)+3=1+(2+3)=6○ identity: 0+1=1+0=1
● Multiplication○ associative: (1*2)*3=1*(2*3)=6○ identity: 1*2=2*1=2
● Min○ you get the idea ...
● Max● Set Union
Let’s take a look at algebird
http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/
https://izbicki.me/img/uploads/2013/05/fry-300x225.jpg
Why is this so awesome?!?!● Divide and Conquer● Parallelization● Incrementalism
Sound Familiar?
● map/REDUCE○ perfect for the reduce phase ○ see Scalding: expenses.groupBy('shoppingLocation) { _.sum[Double]('cost -> 'totalCost) }
● Streaming○ perfect for maintaining running calculations on streams of data (storm, …)
Approximate Data Structures● HyperLogLog
○ an algorithm for the count-distinct problem, approximating the number of distinct elements in a Set.
● Count-min Sketch○ a probabilistic data structure that provides an approximate frequency table.
● MinHash○ estimates how similar two sets are (approximate Jaccard Similarity)
● Bloom filter○ a probabilistic data structure that is used to test whether an element is a member of a Set ○ can answer definitely No or maybe Yes
Examples● HyperLogLog
○ How many unique twitter handles tweeted @justinbieber in the past month?
● Count-min Sketch○ What are the frequencies of the hashtags in those tweets?
● MinHash○ How similar are the followers of @justinbieber(~70M) to the followers of @katyperry
(~76M)
● Bloom filter○ Did Kevin tweet to @justinbieber in the past month? maybe yes. Must be a false positive,
can you really trust a bloom filter?!?!?
How did that get in there?
https://highlyscalable.files.wordpress.com/2012/04/probabilistic-sizes.png
This is better than Spanks™!
Thanks Twitter
https://github.com/twitter/algebird*
* Sorry, Algebird doesn’t have a cool logo. Don’t blame me, blame Twitter!
Need more?● http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-
monad-for-large-scala-data-analytics/● https://github.com/twitter/algebird/wiki/Learning-Algebird-Monoids-with-
REPL● https://github.com/twitter/algebird● https://github.com/twitter/scalding● https://github.com/twitter/summingbird● https://github.com/twitter/algebird/wiki/Abstract-algebra-definitions