Distributed Frequent Pattern Mining · Client1 Client2 Client3 ... ECE750-T11 Presentation G8....
Transcript of Distributed Frequent Pattern Mining · Client1 Client2 Client3 ... ECE750-T11 Presentation G8....
Distributed Frequent Pattern Mining
Md. Akhter Hosen Babu (20399173)Farhan Mohammad Reza (20397774)Sabbeer Ahmed Abeer (20387997)
Contents• Terminologies• Motivation• Problem definition• What we propose• Components and architecture• Client components• Implementation approach• Results• Usage• References
2ECE750-T11 Presentation G8
Terminologies• Data mining
Data mining is the process of extracting hidden patterns from data. It identifies trends within datathat go beyond simple data analysis.
• Scope of data miningData StreamsAdvanced Database Systems(i.e. Relational Databases)Transactional DatabasesFlat Files, etc
• Frequent PatternPatterns (such as itemsets, subsequences, or structures) that appear in a data set frequently.For example, a set of items, such as computer and printer, that appear frequently together in atransaction data set is a frequent itemset.
3ECE750-T11 Presentation G8
Motivation
• Motivation of working with frequent pattern mining is to makehuge amount of transaction data more useful.
• This process analyzes customer buying habits by findingassociations between the different items that customers placein their “shopping baskets”.
• The discovery of such associations can help retailers developmarketing strategies by gaining insight into which items arefrequently purchased together by customers.
4ECE750-T11 Presentation G8
Problem definition• For mining frequent pattern we chose a very famous lgorithm called “Apriori
algorithm”.• For example we have following information:
Computer=>antivirus software [support = 2%]• This means only 2% of all transaction in which computers are sold with
antivirus.• Apriori calls an itemsemset frequent if it has the minimum support(i.e. 60%) or
min-sup.• Apriori property: All nonempty subsets of a frequent itemset must also be
frequent.• For a low min-sup and large number of items, there will be huge number of
combination of items which will take hours or even days to dig by running thealgorithm on a single machine.
5ECE750-T11 Presentation G8
What we propose• We intend to run the Apriori algorithm in an distributed environment to find
frequent patterns in data.• It can be done by setting up a grid and run the algorithm on it.• But we have two objectives:
Use available resources on a local network.Make this implementation a really useful and easy to use applicationfor any managerial person of a business and assist him to take crucialbusiness decisions.
• So we propose an application that will provide the users frequent itemsetsby mining all their transactions, and will use its available local resourcesefficiently to finish this task faster.
6ECE750-T11 Presentation G8
Components and Architecture
< <Su b n e t P oo l B u ild e r> >
< < S in k> >{E v en tC la ss= T ru e }
< <Sp lit te r> > < < C lie n t D is t r ib u to r> >C lie n t1
C lie n t2
C lie n t3
< <M e rge r> > < <O u tp u t R e ce iv e r> >
< <d e le g a te > >< <d e le g a te > >
< < d e le g a te > >< <d e le g a te> >
C lie n t..n
7ECE750-T11 Presentation G8
Client components
Info ProviderInput Receiver Apriori Process
8ECE750-T11 Presentation G8
• Resource allocationScan to find all available subnet.Gather available subnets resource info.Award weight to clients against their performance.Sort client list based on awarded weight.Create subnet pool based on threshold weight.Publish this information to the components those have been subscribed to it
Implementation approach
9ECE750-T11 Presentation G8
• Work DistributionGenerate specific inputs for each client.Send these inputs to the clients for processing.
• Run AprioriBased on received inputs the apriori algorithm residing on the clients is executed.
• Return OutputReturn generated output from clients to server.
• Merge Final ResultServer waits for output files from all clients.Collects all output.Merge into a single output file.
Implementation approach(cont.)
10ECE750-T11 Presentation G8
Results
• The graph shows us a comparison between apriori implementation on a single machine and distributed environment for different input sets.
0
10
20
30
40
50
60
70
1 2 3 4 5
time
time
11ECE750-T11 Presentation G8
• Currently in our implementation user can view all the itemsets thatmeets his minimum support requirement.
• In our next phase of implementation we intend to make users able touse their own itemsets.
Usage
12ECE750-T11 Presentation G8
• Jiawei han, Micheline Kamber Harcourt, “Data Mining Concepts andTechniques” , 2nd Edition.
• R. Sumithra, Dr . Sujni Pau, “Using distributed apriori association rule andclassical apriori mining algorithms for grid based knowledge discovery”,Second International conference on Computing, Communication andNetworking Technologies, 2010.
• M. A. Mottalib, Kazi Shamsul Arefin, Mohammad Majharul Islam, Md. ArifRahman, and Sabbeer Ahmed Abeer, “Performance Analysis of DistributedAssociation Rule Mining with Apriori Algorithm”, International journal ofcomputer theory and engineering Vol. 3, No. 4, August 2011.
References
13ECE750-T11 Presentation G8
Questions?
14ECE750-T11 Presentation G8