Mining of massive datasets

45
Mining of Massive Datasets Ashic Mahtab @ashic www.heartysoft.com

description

Slides from my talk at DDD Dundee 2014 on some approaches that are used in mining of massive datasets.

Transcript of Mining of massive datasets

Page 1: Mining of massive datasets

Mining of Massive

DatasetsAshic Mahtab

@ashic

www.heartysoft.com

Page 2: Mining of massive datasets

Stream Processing

Page 3: Mining of massive datasets

Stream Processing

Have I already processed this?

How many distinct queries were made?

How many hits did I get?

Page 4: Mining of massive datasets

Stream Processing – Bloom Filters

Guaranteed detection of negatives.

Possible false positive.

Page 5: Mining of massive datasets

Stream Processing – Bloom Filters

Have a collection of hash functions (h1, h2, h3…).

For an input, run the hash functions. Map to bit array.

If all bits are lit in working store, might have been processed (possibility of

false positives).

If any of the lit bits in hashed array are not lit in working store, need to

process this. (Guaranteed…no false negatives).

Page 6: Mining of massive datasets

Stream Processing – Bloom Filters

1 0 0 1 1 0 1 1 1 0

0 0 1 0 0 1 1 0 0 0

0 0 1 1 1 0 0 0 0 1

0 1 0 0 0 1 0 0 0 1

1 0 0 1 1 0 0 0 0 0

Input 1: “Foo” hashes to:

1 0 1 1 1 0 0 0 0 0

Input 2: “Bar” hashes to:

Page 7: Mining of massive datasets

Stream Processing – Bloom Filters

Not just for streams (everything is a stream, right?)

Cassandra uses bloom filters to detect if some data is in a low level storage

file.

Page 8: Mining of massive datasets

Map Reduce

A little smarts goes a l-o-o-o-n-g way.

Page 9: Mining of massive datasets

Map Reduce – Multiway Joins

R join S join T

size(R) = r, size(S) = s, size(T) = t

Probability of match for R and S = p

Probability of match for S and T = p

Which do we join first?

Page 10: Mining of massive datasets

Map Reduce – Multiway Joins

R (A, B) join S(B, C) join T(C, D)

size(R) = r, size(S) = s, size(T) = t

Probability of match for R and S = p

Probability of match for S and T = p

Communication cost:

* If we join R and S first: O(r + s + t + pst)

* If we join S and T first: O(r + s + t + prs)

Page 11: Mining of massive datasets

Map Reduce – Multiway Joins

Can we do better?

Page 12: Mining of massive datasets

Map Reduce – Multiway Joins

Hash B to b buckets, c to C buckets.

bc = k

Cost ~ r + 2s + t + 2 * sqrt(krt)

Usually, can neglect r + t compared to the k term. So,

2s + 2*sqrt(krt)

[Single MR job]

Page 13: Mining of massive datasets

Map Reduce – Multiway Joins

Hash B to b buckets, c to C buckets.

bc = k

Cost ~ r + 2s + t + 2 * sqrt(krt)

Usually, can neglect r + t compared to the k term. So,

2s + 2*sqrt(krt)

[Single MR job]

vs (r + s + t + prs)

[Two MR jobs]

Page 14: Mining of massive datasets

Map Reduce – Multiway Joins

So…is this always better?

Page 15: Mining of massive datasets

Map Reduce – Complexity

Replication Rate (r):

Number of outputs by all Map tasks / number of inputs

Reducer Size (q):

Max number of items per key at reducers

p = number of inputs

For nxn:

qr >= 2n^2

r >= p / q

Page 16: Mining of massive datasets

Map Reduce – Matrix Multiplication

Approach 1

Matrix M, N

M(i, j), N(j, k)

Map1: Map matrices to (j, (M, i, mij)), (j, (N, k, njk))

Reduce1: for each key, output ((i, k), mij*njk)

Map2: Identity

Reduce2: For each key, (i, k) get the sum of values.

Page 17: Mining of massive datasets

Map Reduce – Matrix Multiplication

Approach 2

One step:

Map:

For M, produce ((i, k), (M, j, mij)) for k = 1…Ncolumns_in_N

For M, produce ((i, k), (N, j, njk)) for k = 1…Nrows_in_M

Reduce:

For each key (i, k), multiple values, and sum.

Page 18: Mining of massive datasets

Map Reduce – Matrix Multiplication

Approach 3

Two steps again.

Page 19: Mining of massive datasets

Map Reduce – Matrix Multiplication

One pass:

(4n^4) / q

Two pass:

(4n^3) / sqrt(q)

Page 20: Mining of massive datasets

Similarity - Shingling

“abcdef” -> [“abc”, “bcd”, “cde”…]

Jaccard similarity - > N(intersection) / N(union)

Page 21: Mining of massive datasets

Similarity - Shingling

“abcdef” -> [“abc”, “bcd”, “cde”…]

Jaccard similarity - > N(intersection) / N(union)

Problem?

Size

Page 22: Mining of massive datasets

Similarity - Minhashing

Page 23: Mining of massive datasets

Similarity - Minhashing

h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a

Page 24: Mining of massive datasets

Similarity - Minhashing

h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a

Problem?

Page 25: Mining of massive datasets

Similarity – Minhash Signatures

Page 26: Mining of massive datasets

Similarity – Minhash Signatures

Problem? Still can’t find pairs with greatest similarity efficiently

Page 27: Mining of massive datasets

Similarity – LSH for Minhash Signatures

Page 28: Mining of massive datasets

Clustering – Hierarchical

Page 29: Mining of massive datasets

Clustering – K Means

1. Pick k points (centroids)

2. Assign points to clusters

3. Shift centroids to “centre”.

4. Repeat

Page 30: Mining of massive datasets

Clustering – K Means

Page 31: Mining of massive datasets

Clustering – FBR • 3 sets – Discard, Compressed and Retained

• First two have summaries. N, sum per dimension, sum of squares per dimension

• High dimensional Euclidian space

Mahalanobis Distance

Page 32: Mining of massive datasets

Clustering – CURE

Page 33: Mining of massive datasets

Clustering – CURE • Sample. Run clustering on sample.

• Pick “representatives” from each sample.

• Move representatives about 20% or so to the centre.

• Merge of close.

Page 34: Mining of massive datasets

Dimentionality Reduction

Page 35: Mining of massive datasets

Dimentionality Reduction

Page 36: Mining of massive datasets

Dimentionality Reduction - SVD

Page 37: Mining of massive datasets

Dimentionality Reduction - SVD

Page 38: Mining of massive datasets

Dimensionality Reduction - CUR

SVD results in U and V being dense, even when M is sparse.

O(n^3)

Page 39: Mining of massive datasets

Dimensionality Reduction - CUR

Choose r.

Choose r rows and r columns of M.

Intersection is W.

Run SVD on W (much smaller than M). W = XΣY’

Compute Σ+, the Moore-Penrose pseudoinverse of Σ.

Then, U = Y * (Σ+)^2 * X’

Page 40: Mining of massive datasets

Dimensionality Reduction – CUR

Choosing Rows and Columns

Random, but with bias for importance.

(Frobenius Norm)^2

Probability of picking a row or column:

Sum of squares for row or column / Sum of squares of all elements

Page 41: Mining of massive datasets

Dimensionality Reduction – CUR

Choosing Rows and Columns

Same row / column may get picked (selection with replacement).

Reduces rank.

Page 42: Mining of massive datasets

Dimensionality Reduction – CUR

Choosing Rows and Columns

Same row / column may get picked (selection with replacement).

Reduces rank.

Can be combined: multiply vector by sqrt(k) if it appears k times.

Page 43: Mining of massive datasets

Dimensionality Reduction – CUR

Choosing Rows and Columns

Same row / column may get picked (selection with replacement).

Reduces rank.

Can be combined: multiply vector by sqrt(k) if it appears k times.

Compute pseudo-inverse as before, but transpose the result.

Page 44: Mining of massive datasets

Dimensionality Reduction – CUR

Choosing Rows and Columns

Same row / column may get picked (selection with replacement).

Reduces rank.

Can be combined: multiply vector by sqrt(k) if it appears k times.

Compute pseudo-inverse as before, but transpose the result.

Page 45: Mining of massive datasets

Thanks

Mining of Massive Datasets

Leskovec, Rajaraman, Ullman

Coursera / Stanford Course

Book: http://www.mmds.org/ [free]