Distributed Frequent Pattern Mining · Client1 Client2 Client3 ... ECE750-T11 Presentation G8....

14
Distributed Frequent Pattern Mining Md. Akhter Hosen Babu (20399173) Farhan Mohammad Reza (20397774) Sabbeer Ahmed Abeer (20387997)

Transcript of Distributed Frequent Pattern Mining · Client1 Client2 Client3 ... ECE750-T11 Presentation G8....

Page 1: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

Distributed Frequent Pattern Mining

Md. Akhter Hosen Babu (20399173)Farhan Mohammad Reza (20397774)Sabbeer Ahmed Abeer (20387997)

Page 2: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

Contents• Terminologies• Motivation• Problem definition• What we propose• Components and architecture• Client components• Implementation approach• Results• Usage• References

2ECE750-T11 Presentation G8

Page 3: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

Terminologies• Data mining

Data mining is the process of extracting hidden patterns from data. It identifies trends within datathat go beyond simple data analysis.

• Scope of data miningData StreamsAdvanced Database Systems(i.e. Relational Databases)Transactional DatabasesFlat Files, etc

• Frequent PatternPatterns (such as itemsets, subsequences, or structures) that appear in a data set frequently.For example, a set of items, such as computer and printer, that appear frequently together in atransaction data set is a frequent itemset.

3ECE750-T11 Presentation G8

Page 4: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

Motivation

• Motivation of working with frequent pattern mining is to makehuge amount of transaction data more useful.

• This process analyzes customer buying habits by findingassociations between the different items that customers placein their “shopping baskets”.

• The discovery of such associations can help retailers developmarketing strategies by gaining insight into which items arefrequently purchased together by customers.

4ECE750-T11 Presentation G8

Page 5: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

Problem definition• For mining frequent pattern we chose a very famous lgorithm called “Apriori

algorithm”.• For example we have following information:

Computer=>antivirus software [support = 2%]• This means only 2% of all transaction in which computers are sold with

antivirus.• Apriori calls an itemsemset frequent if it has the minimum support(i.e. 60%) or

min-sup.• Apriori property: All nonempty subsets of a frequent itemset must also be

frequent.• For a low min-sup and large number of items, there will be huge number of

combination of items which will take hours or even days to dig by running thealgorithm on a single machine.

5ECE750-T11 Presentation G8

Page 6: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

What we propose• We intend to run the Apriori algorithm in an distributed environment to find

frequent patterns in data.• It can be done by setting up a grid and run the algorithm on it.• But we have two objectives:

Use available resources on a local network.Make this implementation a really useful and easy to use applicationfor any managerial person of a business and assist him to take crucialbusiness decisions.

• So we propose an application that will provide the users frequent itemsetsby mining all their transactions, and will use its available local resourcesefficiently to finish this task faster.

6ECE750-T11 Presentation G8

Page 7: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

Components and Architecture

< <Su b n e t  P oo l  B u ild e r> >

< < S in k> >{E v en tC la ss= T ru e }

< <Sp lit te r> > < < C lie n t  D is t r ib u to r> >C lie n t1

C lie n t2

C lie n t3

< <M e rge r> > < <O u tp u t  R e ce iv e r> >

< <d e le g a te > >< <d e le g a te > >

< < d e le g a te > >< <d e le g a te> >

C lie n t..n

7ECE750-T11 Presentation G8

Page 8: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

Client components

Info ProviderInput Receiver Apriori Process

8ECE750-T11 Presentation G8

Page 9: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

• Resource allocationScan to find all available subnet.Gather available subnets resource info.Award weight to clients against their performance.Sort client list based on awarded weight.Create subnet pool based on threshold weight.Publish this information to the components those have been subscribed to it

Implementation approach

9ECE750-T11 Presentation G8

Page 10: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

• Work DistributionGenerate specific inputs for each client.Send these inputs to the clients for processing.

• Run AprioriBased on received inputs the apriori algorithm residing on the clients is executed.

• Return OutputReturn generated output from clients to server.

• Merge Final ResultServer waits for output files from all clients.Collects all output.Merge into a single output file.

Implementation approach(cont.)

10ECE750-T11 Presentation G8

Page 11: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

Results

• The graph shows us a comparison between apriori implementation on a single machine and distributed environment for different input sets.

0

10

20

30

40

50

60

70

1 2 3 4 5

time

time

11ECE750-T11 Presentation G8

Page 12: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

• Currently in our implementation user can view all the itemsets thatmeets his minimum support requirement.

• In our next phase of implementation we intend to make users able touse their own itemsets.

Usage

12ECE750-T11 Presentation G8

Page 13: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

• Jiawei han, Micheline Kamber Harcourt, “Data Mining Concepts andTechniques” , 2nd Edition.

• R. Sumithra, Dr . Sujni Pau, “Using distributed apriori association rule andclassical apriori mining algorithms for grid based knowledge discovery”,Second International conference on Computing, Communication andNetworking Technologies, 2010.

• M. A. Mottalib, Kazi Shamsul Arefin, Mohammad Majharul Islam, Md. ArifRahman, and Sabbeer Ahmed Abeer, “Performance Analysis of DistributedAssociation Rule Mining with Apriori Algorithm”, International journal ofcomputer theory and engineering Vol. 3, No. 4, August 2011.

References

13ECE750-T11 Presentation G8

Page 14: Distributed Frequent Pattern Mining · Client1 Client2 Client3  ... ECE750-T11 Presentation G8. Results • The graph shows us a comparison between apriori implementation

Questions?

14ECE750-T11 Presentation G8