Mining of massive datasets

Mining of Massive

DatasetsAshic Mahtab

@ashic

www.heartysoft.com

Stream Processing

Have I already processed this?

How many distinct queries were made?

How many hits did I get?

Stream Processing – Bloom Filters

Guaranteed detection of negatives.

Possible false positive.

Have a collection of hash functions (h1, h2, h3…).

For an input, run the hash functions. Map to bit array.

If all bits are lit in working store, might have been processed (possibility of

false positives).

If any of the lit bits in hashed array are not lit in working store, need to

process this. (Guaranteed…no false negatives).

1 0 0 1 1 0 1 1 1 0

0 0 1 0 0 1 1 0 0 0

0 0 1 1 1 0 0 0 0 1

0 1 0 0 0 1 0 0 0 1

1 0 0 1 1 0 0 0 0 0

Input 1: “Foo” hashes to:

1 0 1 1 1 0 0 0 0 0

Input 2: “Bar” hashes to:

Not just for streams (everything is a stream, right?)

Cassandra uses bloom filters to detect if some data is in a low level storage

Map Reduce

A little smarts goes a l-o-o-o-n-g way.

Map Reduce – Multiway Joins

R join S join T

size(R) = r, size(S) = s, size(T) = t

Probability of match for R and S = p

Probability of match for S and T = p

Which do we join first?

R (A, B) join S(B, C) join T(C, D)

size(R) = r, size(S) = s, size(T) = t

Probability of match for R and S = p

Probability of match for S and T = p

Communication cost:

* If we join R and S first: O(r + s + t + pst)

* If we join S and T first: O(r + s + t + prs)

Can we do better?

Hash B to b buckets, c to C buckets.

bc = k

Cost ~ r + 2s + t + 2 * sqrt(krt)

Usually, can neglect r + t compared to the k term. So,

2s + 2*sqrt(krt)

[Single MR job]

Hash B to b buckets, c to C buckets.

bc = k

Cost ~ r + 2s + t + 2 * sqrt(krt)

Usually, can neglect r + t compared to the k term. So,

2s + 2*sqrt(krt)

[Single MR job]

vs (r + s + t + prs)

[Two MR jobs]

So…is this always better?

Map Reduce – Complexity

Replication Rate (r):

Number of outputs by all Map tasks / number of inputs

Reducer Size (q):

Max number of items per key at reducers

p = number of inputs

For nxn:

qr >= 2n^2

r >= p / q

Map Reduce – Matrix Multiplication

Approach 1

Matrix M, N

M(i, j), N(j, k)

Map1: Map matrices to (j, (M, i, mij)), (j, (N, k, njk))

Reduce1: for each key, output ((i, k), mij*njk)

Map2: Identity

Reduce2: For each key, (i, k) get the sum of values.

Approach 2

One step:

For M, produce ((i, k), (M, j, mij)) for k = 1…Ncolumns_in_N

For M, produce ((i, k), (N, j, njk)) for k = 1…Nrows_in_M

Reduce:

For each key (i, k), multiple values, and sum.

Approach 3

Two steps again.

One pass:

(4n^4) / q

Two pass:

(4n^3) / sqrt(q)

Similarity - Shingling

“abcdef” -> [“abc”, “bcd”, “cde”…]

Jaccard similarity - > N(intersection) / N(union)

Similarity - Shingling

“abcdef” -> [“abc”, “bcd”, “cde”…]

Jaccard similarity - > N(intersection) / N(union)

Problem?

Similarity - Minhashing

h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a

Similarity - Minhashing

h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a

Problem?

Similarity – Minhash Signatures

Problem? Still can’t find pairs with greatest similarity efficiently

Similarity – LSH for Minhash Signatures

Clustering – Hierarchical

Clustering – K Means

1. Pick k points (centroids)

2. Assign points to clusters

3. Shift centroids to “centre”.

4. Repeat

Clustering – K Means

Clustering – FBR • 3 sets – Discard, Compressed and Retained

• First two have summaries. N, sum per dimension, sum of squares per dimension

• High dimensional Euclidian space

Mahalanobis Distance

Clustering – CURE

Clustering – CURE • Sample. Run clustering on sample.

• Pick “representatives” from each sample.

• Move representatives about 20% or so to the centre.

• Merge of close.

Dimentionality Reduction

Dimentionality Reduction - SVD

Dimensionality Reduction - CUR

SVD results in U and V being dense, even when M is sparse.

O(n^3)

Dimensionality Reduction - CUR

Choose r.

Choose r rows and r columns of M.

Intersection is W.

Run SVD on W (much smaller than M). W = XΣY’

Compute Σ+, the Moore-Penrose pseudoinverse of Σ.

Then, U = Y * (Σ+)^2 * X’

Dimensionality Reduction – CUR

Choosing Rows and Columns

Random, but with bias for importance.

(Frobenius Norm)^2

Probability of picking a row or column:

Sum of squares for row or column / Sum of squares of all elements

Same row / column may get picked (selection with replacement).

Reduces rank.

Can be combined: multiply vector by sqrt(k) if it appears k times.

Reduces rank.

Compute pseudo-inverse as before, but transpose the result.

Reduces rank.

Compute pseudo-inverse as before, but transpose the result.

Thanks

Mining of Massive Datasets

Leskovec, Rajaraman, Ullman

Coursera / Stanford Course

Book: http://www.mmds.org/ [free]

Mining of massive datasets

Data & Analytics

Transcript of Mining of massive datasets

CS246: Mining Massive Datasets Jure ... - Stanford University

CS246: Mining Massive Datasets Winter 2016 Hadoop …snap.stanford.edu/class/cs246-2017/homeworks/hw0/tutorialv3.pdf(For an easy way to set up a cluster, ... CS246: Mining Massive

Tools for Mining Massive Datasetscastle.uprm.edu/cise2013.pdf · 2015. 4. 12. · Motivation: Google Example CISE 2013 Mining Massive Datasets Edgar Acuña 24 Google searches in more

CS246: Mining Massive Datasets Jure Leskovec, … · 2015-12-14 · Mining query streams Google wants to know what queries are more frequent today than yesterday Mining click streams

Mining Massive Datasets: a (Randomized) Linear Algebraic ... · Data are represented by matrices (or tensors) Numerous modern datasets are in matrix form. Data in the form of tensors

CS246: Mining Massive Datasets Winter 2018 Spark Tutorial ...

N COMP 465: Data Mining More on PageRankcs.rhodes.edu/welshc/COMP465_S15/Lecture16.pdf4/14/2015 1 COMP 465: Data Mining More on PageRank Slides Adapted From: (Mining Massive Datasets)

Introduction to mining massive datasets

Mining of Massive Datasets - Stanford Universityinfolab.stanford.edu/~ullman/mmds/booka.pdf · Massive Datasets Anand Rajaraman ... 3.7.4 LSH Families for Euclidean Distance . . .

CS246: Mining Massive Datasets Winter 2014 Problem Set 0 · CS246: Mining Massive Datasets - Problem Set 0 3 2.1 Creating a Hadoop project in Eclipse (There is a plugin for Eclipse

Chapter 1 MASSIVE DATASETS IN ASTRONOMY · of massive datasets in astronomy. Keywords: Astronomy, Digital Sky Survey, Space Telescope, Data-Mining, Knowledge Discovery in Databases,

Compositional Mining of Multi-relational Biological Datasets › ~naren › papers › tkdd-cdm.pdf · individually, to gain signiﬁcant insight into massive datasets. Using CDM,

Clustering Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman.

Mining Security Risks from Massive Datasetsmassive datasets. On the other hand, it is becoming more and more di cult for existing tools to handle massive datasets with various data

Mining of Massive Datasets - University of Central Floridaial.eecs.ucf.edu/Reading/Papers/Mining Large Datasets.pdfCS345A, titled “Web Mining,” was designed as an advanced graduate

CS246: Mining Massive Datasets Jure Leskovec, ...snap.stanford.edu/class/cs246-2015/slides/09-pagerank.pdf · 2/2/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 3 . Facebook

CS246: Mining Massive Datasets Jure Leskovec, ...

Mining of Massive Datasets - Tel Aviv Universityamir1/SEMINAR/Massive data book.pdf · Algorithms for clustering very large, high-dimensional datasets. iii. iv PREFACE 7. Two key

Tools for Mining Massive Datasets - UPRMacademic.uprm.edu/~eacuna/cise2013.pdf · Tools for Mining Massive Datasets Dr. Edgar Acuna Departament of Mathematical Science University

Mining of Massive Datasets