Distributed Frequent Pattern Mining · Client1 Client2 Client3 ... ECE750-T11 Presentation G8....

Distributed Frequent Pattern Mining

Md. Akhter Hosen Babu (20399173)Farhan Mohammad Reza (20397774)Sabbeer Ahmed Abeer (20387997)

Contents• Terminologies• Motivation• Problem definition• What we propose• Components and architecture• Client components• Implementation approach• Results• Usage• References

2ECE750-T11 Presentation G8

Terminologies• Data mining

Data mining is the process of extracting hidden patterns from data. It identifies trends within datathat go beyond simple data analysis.

• Scope of data miningData StreamsAdvanced Database Systems(i.e. Relational Databases)Transactional DatabasesFlat Files, etc

• Frequent PatternPatterns (such as itemsets, subsequences, or structures) that appear in a data set frequently.For example, a set of items, such as computer and printer, that appear frequently together in atransaction data set is a frequent itemset.


Motivation

• Motivation of working with frequent pattern mining is to makehuge amount of transaction data more useful.

• This process analyzes customer buying habits by findingassociations between the different items that customers placein their “shopping baskets”.

• The discovery of such associations can help retailers developmarketing strategies by gaining insight into which items arefrequently purchased together by customers.


Problem definition• For mining frequent pattern we chose a very famous lgorithm called “Apriori

algorithm”.• For example we have following information:

Computer=>antivirus software [support = 2%]• This means only 2% of all transaction in which computers are sold with

antivirus.• Apriori calls an itemsemset frequent if it has the minimum support(i.e. 60%) or

min-sup.• Apriori property: All nonempty subsets of a frequent itemset must also be

frequent.• For a low min-sup and large number of items, there will be huge number of

combination of items which will take hours or even days to dig by running thealgorithm on a single machine.


What we propose• We intend to run the Apriori algorithm in an distributed environment to find

frequent patterns in data.• It can be done by setting up a grid and run the algorithm on it.• But we have two objectives:

Use available resources on a local network.Make this implementation a really useful and easy to use applicationfor any managerial person of a business and assist him to take crucialbusiness decisions.

• So we propose an application that will provide the users frequent itemsetsby mining all their transactions, and will use its available local resourcesefficiently to finish this task faster.


Components and Architecture

< <Su b n e t P oo l B u ild e r> >

< < S in k> >{E v en tC la ss= T ru e }

< <Sp lit te r> > < < C lie n t D is t r ib u to r> >C lie n t1

C lie n t2

C lie n t3

< <M e rge r> > < <O u tp u t R e ce iv e r> >

< <d e le g a te > >< <d e le g a te > >

< < d e le g a te > >< <d e le g a te> >

C lie n t..n


Client components

Info ProviderInput Receiver Apriori Process


• Resource allocationScan to find all available subnet.Gather available subnets resource info.Award weight to clients against their performance.Sort client list based on awarded weight.Create subnet pool based on threshold weight.Publish this information to the components those have been subscribed to it

Implementation approach


• Work DistributionGenerate specific inputs for each client.Send these inputs to the clients for processing.

• Run AprioriBased on received inputs the apriori algorithm residing on the clients is executed.

• Return OutputReturn generated output from clients to server.

• Merge Final ResultServer waits for output files from all clients.Collects all output.Merge into a single output file.

Implementation approach(cont.)


Results

• The graph shows us a comparison between apriori implementation on a single machine and distributed environment for different input sets.

0

10

20

30

40

50

60

70

1 2 3 4 5

time

time


• Currently in our implementation user can view all the itemsets thatmeets his minimum support requirement.

• In our next phase of implementation we intend to make users able touse their own itemsets.

Usage


• Jiawei han, Micheline Kamber Harcourt, “Data Mining Concepts andTechniques” , 2nd Edition.

• R. Sumithra, Dr . Sujni Pau, “Using distributed apriori association rule andclassical apriori mining algorithms for grid based knowledge discovery”,Second International conference on Computing, Communication andNetworking Technologies, 2010.

• M. A. Mottalib, Kazi Shamsul Arefin, Mohammad Majharul Islam, Md. ArifRahman, and Sabbeer Ahmed Abeer, “Performance Analysis of DistributedAssociation Rule Mining with Apriori Algorithm”, International journal ofcomputer theory and engineering Vol. 3, No. 4, August 2011.

References


Questions?


Distributed Frequent Pattern Mining · Client1 Client2 Client3 ... ECE750-T11 Presentation G8....

Documents

Transcript of Distributed Frequent Pattern Mining · Client1 Client2 Client3 ... ECE750-T11 Presentation G8....