Mining hot spot and related hot spots over web links

Mining Hot Spot and Related hot Spots

over Web Links

Chengeng MaStony Brook University

2016/04/17

Online hot spots are of commercial interests• The online hot spots, where so

many people visit everyday, are attractive locations for both the Internet Corp. and advertisers.• Finding these hot spots and their

relationship is of important commercial interests.

What’s my plan?• Firstly, we try to find these popular

web pages using their Pageranks.• Then we try to find some related

hot spots through the frequent itemset technique (known as “trawling”), through a high order A-Priori.• Both tasks are implemented by

Pseudo mode Hadoop, as a training for Map-reduce programming.

• We provide 2 summarizing slides about Pagerank and Frequent Itemsets, separately, in:• http://www.slideshare.net/Chen

GengMa/a-hadoop-implementation-of-pagerank• http://www.slideshare.net/Chen

GengMa/hadoop-implementation-for-algorithms-apriori-pcy-son• We highly recommend the book

Mining Massive Datasets, written by Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman.

http://www.slideshare.net/ChenGengMa/a-hadoop-implementation-of-pagerank



http://www.slideshare.net/ChenGengMa/hadoop-implementation-for-algorithms-apriori-pcy-son



Datasets

• Part of 2002 Google Programming Contest Web Graph Data.• The data is provided by Jure

Leskovec and can be found in http://snap.stanford.edu/data/web-Google.html• It contains 875,713 nodes and

5,105,039 links, 72MB txt files.• The data is not large, but as a

training, I will use Hadoop to finish the two tasks.

http://snap.stanford.edu/data/web-Google.html

http://snap.stanford.edu/data/web-Google.html

What’s Pagerank?• Pagerank is an importance

measuring of web pages based on the link structure of web, introduced by Google to prevent spammers.• People usually like to add a tag or

a link to a page he/she thinks is correct, useful or reliable.• And they usually don’t put links to

spam sites that controlled by spammers.

• The Pagerank simulates a random walker’ probabilities of being on each web node. • The random walker usually has the

chance to follow the outgoing links (Markov process) or teleport to whatever node it wants (taxiation).

For example, a Chinese web user who see the left picture online will probably add a tag as “Milk Tea Beauty” (A Chinese young celebrity whose reputation is disputed).

• By using partitioned matrix and vector, the calculation can be paralleled onto a computing cluster that has more than thousands of nodes.• And such large a magnitude of

computing is usually managed by a mapreduce system, like Hadoop.

beta=0 1 2 3 4

alpha=0

1

2

3

4

beta=

0

1

2

3

4

For loop iterations:

The idea is simple:However, the real life web has billions of pages, multiplication between matrix and vector is overhead.

Mapreduce for the left program

• Hadoop program iterates 75 times (“For the web itself, 50-75 iterations are sufficient to converge to within the error limits of double precision”).• as the possibility to follow the

web links and 0.15 possibility to teleport.

• The program has a structure of for loop, each of which has 4 map-reduce job inside.• The first 2 MR job are for matrix

multiplying vector. • The 3rd MR job is to calculate the sum

of the product vector . • And the final MR job does the

shifting.

For loop iterations:

Matrix multiply Vector

• 1st mapper:

where , where the represents interval. where , represents the group number.

• 1st reducer gets input as: { (); [ , ] }

partion partion

• 1st reducer outputs:

{ }

• 2nd mapper: Pass

• 2nd reducer gets input as:

{ }

• 2nd reducer outputs { }

The Pagerank result is compared with Python• A Python program is written to compare :

Pagerank results

• The largest 1/9 web pages contains 60% of PageRank importance over the whole dataset.

• X axis is the index sorted by PageRank value.

K hottest spots:A Top K Hadoop program written to find the K hottest spots:• 1st column is the index used in

computation;• 2nd column is the web node ID

within the original data;• 3rd column is the PageRank value.The right table shows the top 15 PageRank value.

Array Index | Web node ID | Pagerank

Some online hot spots are related• These pages can be frequently

linked together by other pages.

How does Frequent Itemsets come in?• If they’re frequently linked

together by other web pages, then the frequent itemset technique can be used to discover them.• This idea is known as “trawling”,

introduced as a method to find complete bipartite graphs.

• However, the key of using classic frequent itemset methods is that, the threshold must be a large enough value, so that the number of verified frequent itemsets cannot be too large.• Otherwise, A-Priori, PCY, … all these

methods cannot benefits too much from the monotonicity.• That’s why we try to find related hot

spots, rather than mining a lot of small cliques.• When you want to find a lot of small

cliques, where the threshold can be small, then other frequent itemset techniques should be used.

A-Priori MapperAssume the length k frequent itemsets has been read through Distributed Cache by all mappers, stored in L(k), now you want to find L(k+1). (if k>1, please also read in L(1))• Mapper input: {, [ …, ]}, a basket

of nodes that are linked by node i.• If k>1, remove items from the

basket that are not in L(1).• If length(basket)<k+1: exit

• Form a set of itemsets from L(k) that are associated with updated basket, and call it as myLocalLK.• L=length(myLocalLK); Ck1=set()

x=myLocalLK[i] y=myLocalLK[j]; union= x | y if length(union)==k+1

Ck1.add(union)

• For aSet in Ck1:if the basket contains aSet:

Output( sort(list(aSet)), 1)

This will be disused later.

A-Priori Reducer

• For initial point k==0, another Mapper that simply counts for each item need to be written separately.

• Reducer input: {key, [a, b, c, … ]};• Take the sum of the list as t;• if t>=s: Output {key, t}.• Combiner is the same as the

reducer, except only taking the sum but not filtering infrequent itemsets.

Form a set of itemsets from L(k) that are associated with updated basket, and call it as myLocalLK.

• When threshold is small, the size of L(k) can be very large, above job can be difficult to do.• We provide 2 ways: • (1) Go through all itemset in L(k)

and add the ones that are subset of our basket; • (2) Generate all possible ()

combinations from the basket and add the ones that lives in L(k).

• m=length(basket); myLocalLK=set()• If for aSet in L(k): if basket.containsAll(aSet): myLocalLK.add(aSet)• Else myCom=getAllCombines(basket, k) for aSet in myCom: if L(k).containsAll(aSet): myLocalLK.add(aSet)

Things can be bad when both the length of L(k) and basket are large!

We also tried some small threshold case• Working on pseudo mode

Hadoop, • when threshold is 20, we cannot

generate longer itemsets than L3.• when threshold is 100, L5 is the

largest we can get within one or two hours.• Things can be relieved when you

have a large computing cluster.

• However, if your goal is to find small cliques, maybe the itemsets no more longer than 6 is enough for you.• You don’t really need to see the

A-Priori goes to the final end.

For the results of A-Priori

• For finding related hot spots, the threshold should be large.• Here we use a threshold as

500, only 314 frequent singletons are found.• We believe the frequently

linked web pages should also have large pageranks.

Average Pagerank of frequent itemsets• We show the number of

frequent itemsets found by our A-Priori, together with their average pagerank.• For example, L6 has 80 itemsets,

the average of 480 (6*80) pagerank values is the output.• Let’s say L2 contains {(a,b), (b,c),

(a,c), (a,d)}, L3 only contains {(a,b,c)}, and L2 has larger average pagerank than L2. What does it represents?

• It means some frequently linked nodes (d), even though not widely related with other nodes (b & c), however, have larger pagerank than those widely related nodes. • Maybe because the page links to (a, b, c)

shares only 1/3 its pagerank to each of them, but the page links to (a, d) shares 1/2 of its pagerank to a & d.

Reference• 1. Mining of Massive Dataset, Jure Leskovec, Anand Rajaramanand

Jeffrey D. Ullman• 2. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew

Tomkins. 1999. Trawling the Web for emerging cyber-communities. In Proceedings of the eighth international conference on World Wide Web (WWW '99), Philip H. Enslow, Jr. (Ed.). Elsevier North-Holland, Inc., New York, NY, USA, 1481-1493.• 3. J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney.

Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics 6(1) 29--123, 2009.• 4. Google programming contest, 2002

http://arxiv.org/abs/0810.1355

http://arxiv.org/abs/0810.1355

Mining hot spot and related hot spots over web links

Data & Analytics

Transcript of Mining hot spot and related hot spots over web links