Dryad: Distributed Data-Parallel Programs from Sequential ...
Parallel Mining of Closed Sequential Patterns
description
Transcript of Parallel Mining of Closed Sequential Patterns
![Page 1: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/1.jpg)
1
Parallel Mining of Closed Sequential Patterns
Shengnan Cong, Jiawei Han, David Padua
Proceeding of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining Chicago, Illinois, USA, 2005
Advisor : Jia-Ling Koh Speaker : Chun-Wei Hsieh
![Page 2: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/2.jpg)
2
Introduction
Numerous applications:– DNA sequences, Analysis of web log, customer shopping
sequences, XML query access patterns…
Closed Sequential patterns– have All information– are more compact
Many applications are time-critical and involve huge volumes of data.
![Page 3: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/3.jpg)
3
Sequential Algorithm-BIDE
Step 1: Identify the frequent 1-sequences Step 2: Project the dataset along each
frequent 1-sequence Step 3: Mine each resulting projected dataset
![Page 4: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/4.jpg)
4
Sequential Algorithm-BIDE
The projected dataset forsequence AB is {C,CB,C,BCA}.
![Page 5: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/5.jpg)
5
Task Decomposition
1. Each processor counts the occurrence of 1-sequences in a different part of the dataset. A global add reduction is executed to obtain the overall counts.
2. Build pseudoprojections. This is done in parallel by assigning a different part of the dataset to each processor. The pseudo-projections are communicated to all processors via an all-to-all broadcast.
3. Dynamic scheduling to distribute the processing of the projections across processors.
![Page 6: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/6.jpg)
6
Task Decomposition
In the second step, it is more efficient to implement the broadcast using a virtual ring structure.
Assume there are N processor, and
Processor K – Only receives the package from Processor ((K-1) mod N)– Only Sends the package to Processor ((K+1) mod N)
It needs (N-1) send-receive steps and consumes no more than 0.5% of the mining time.
![Page 7: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/7.jpg)
7
Task Scheduling
1. A master processor maintains a queue of pseudo- projection identifiers. Other processors is initially assigned a projection.
2. After mining a projection, a processor sends a request to the master processor for another projection.
3. This process continues until the queue of projections is empty.
![Page 8: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/8.jpg)
8
Task Scheduling
If the largest subtask takes 25% of the total mining time, the best possible speedup is only 4 regardless of the number of processors available.
To improve the dynamic scheduling, the approach is to find which projections require long mining time, and to
decompose them.
![Page 9: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/9.jpg)
9
Relative Mining Time Estimation
Random sampling – selects random subset of the projections– is not accurate if the overhead is kept small
Selective sampling – uses every sequence of the projections– discards infrequent 1-sequences and the last L frequent 1-
sequences ( L = a given fraction t * the average length of the sequences in the dataset )
![Page 10: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/10.jpg)
10
Selective sampling
For example,– assume (A : 4), (B : 4), (C : 4), (D :3), (E : 3), (F : 3), (G : 1) are the
1-sequences– the support threshold = 4 – the average length of the sequences in the dataset = 4 – Suppose t = 75%
L = 4 0 .∗ 75 = 3 Given a sequence as AABCACDCFDB, selective sampling will reduce this sequence to AABCA
![Page 11: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/11.jpg)
11
Relative Mining Time Estimation
![Page 12: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/12.jpg)
12
Par-CSP Algorithm
![Page 13: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/13.jpg)
13
Experiments
64 nodes OS: Redhat Linux 7.2 CPU: 1GHz Intel Pentium 3 RAM: 1GB Compiler: GNU g++ 2.96
![Page 14: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/14.jpg)
14
Experiments
•Synthetic Dataset: IBM dataset generator
•Real Dataset: Gazelle, Web click-stream
![Page 15: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/15.jpg)
15
Experiments
![Page 16: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/16.jpg)
16
Experiments
![Page 17: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/17.jpg)
17
Experiments
![Page 18: Parallel Mining of Closed Sequential Patterns](https://reader036.fdocuments.net/reader036/viewer/2022062520/568157ed550346895dc5642b/html5/thumbnails/18.jpg)
18
Experiments