Hayat e tayyaba main pir kay din ki ahmiyat by shahnaz kausar
5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz...
-
Upload
allen-logan -
Category
Documents
-
view
212 -
download
0
Transcript of 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz...
![Page 1: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/1.jpg)
5/29/2008AI Lab @ UEC in Japan
Chapter 12
Clustering: Large DatabasesWritten by Farial Shahnaz
Presented by Zhao Xinyou
Data Mining Technology
![Page 2: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/2.jpg)
Contents
Introduction Idea for there major approaches for scalable
clustering {Divide-and-Conquer, Incremental, Parallel}
There approaches for scalable clustering { BIRCH, DSBCAN, CURE}
Application
![Page 3: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/3.jpg)
Introduction –Common method Common method for clustering: visit all data f
rom database and analyze the data, just like:
Time: Computational Complexities: O(n*n). Memory: Need to load all data to main memo
ry
PP133
huge, huge number millions
Time/Memory
Data
![Page 4: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/4.jpg)
Motivation—Clustering for large database
f(x): O(n*n).
f(x): O(n).
Time/Memory
Data
Time/Memory
Data
Method ???
PP134
![Page 5: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/5.jpg)
Requirement—Clustering for large database
f(x): O(n*n).
f(x): O(n).
Time/Memory
Data
Time/Memory
Data
Method ???
PP134
No more (preferably less) than one scan of the database.
Process each [record] only once With limited memory
Can suspend, stop, and resume Can update the results when new
data inserted or removed Can perform different technology to
scan the database During execution, method should
provide status and ‘best’ answer.
![Page 6: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/6.jpg)
Major approach for scalable clustering Divide-and-Conquer approach Parallel clustering approach Incremental clustering approach
PP135
![Page 7: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/7.jpg)
Divide-and Conquer approach Definition.
Divide-and-conquer is a problem-solving approach in which we:
divide the problem into sub-problems, recursively conquer or solve each sub-problem, and
then combine the sub-problem solutions to obtain a
solution to the original problem.
PP135
Key Assumptions1.Problem solutions can be constructed using subproblem solutions. 2.Subproblem solutions are independent of one another.
9*9 数独
![Page 8: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/8.jpg)
Parallel clustering approach
Idea: Divide data into small set and then run small set on different machine (Come from Divide-and-Conquer)
PP136-137
![Page 9: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/9.jpg)
Explanation about Divide-and-Conquer
n/p
n/p
n/p
Divide
K clusters
K clusters
K clusters
Conquer
Conquer
Record: nAim: k cluster
kp clusters
Conquer
P0
P1
Pi
K clustersMerging
n/p n/p n/pDivide
K clusters
K clusters
Conquer
kp clusters
P0P1Pi
K clusters
Merging
Divide is some algorithmsConquer is some algorithms
![Page 10: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/10.jpg)
Application
Sorting: quick-sort and merge sort Fast Fourier transforms Tower of Hanoi puzzle matrix multiplication …..
PP135
![Page 11: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/11.jpg)
CURE- Divide-and-Conquer
1.Get the size n of set D and partition D into p group (contain n/p elements)
2.To each group pi, clustered into k groups by using Heap and k-d tree
3.delete some no relationship node in Heap and k-d tree
4. Cluster the partial clusters and get the final cluster
PP140-141
![Page 12: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/12.jpg)
Heap PP140-141
![Page 13: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/13.jpg)
k-D Tree
Technically, the letter k refers to the number of dimensions
PP140-141
3-dimensional kd-tree
![Page 14: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/14.jpg)
K-D TreePP140-141
![Page 15: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/15.jpg)
CURE- Divide-and-ConquerPP140-141
Nearest Merge
Nearest
Merge
![Page 16: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/16.jpg)
Incremental clustering approach Idea: scan all data in database, Compare with the existing clusters,
if find similar cluster, assign it to with cluster, or else, create a new cluster. Go on till no data
Steps: 1. S={};//set cluster = NULL 2. do{ 3. read one record d; 4. r = find_simiarity_cluster(d, S); 5. if (r exists) 6. assign d to the cluster r 6. else 7. Add_cluster(d, S); 8. } untill (no record in database);
PP135-136
![Page 17: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/17.jpg)
Application--Incremental clustering approach BIRCH
Balanced Iterative Reducing and Clustering using Hierarchies
DBSCAN
Density-Based Spatial Clustering of Application with Noise
![Page 18: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/18.jpg)
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies ) Based on distance measurement, compute
the similarity between record and cluster and give the clusters.
Inner Cluster
Among Cluster
PP137-138
![Page 19: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/19.jpg)
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies )Inner Cluster Among Cluster
PP137-138
![Page 20: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/20.jpg)
Related Definiation
Cluster: {xi}, where i = 1, 2, …, N CF(Clustering Feature) : is a triple, (N,LS,S
S) , N : number of data ; LS : linear sum of N data ; SS : Square sum
![Page 21: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/21.jpg)
Related Definiation
CF tree = (B,T), B = (CFi, childi), if is internal node in a cluster
B = (CFi, prev, next) if is external or leaf node in a cluster.
T: threshold for all leaf node, which should satisfy mean distance D < T
![Page 22: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/22.jpg)
Algorithm for BIRCH
![Page 23: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/23.jpg)
DBSCAN
DBSCAN: Density-Based Spatial Clustering of Application with Noise
Ex1: We want to class house along with river from one
spatial photo Ex2:
![Page 24: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/24.jpg)
Definition for DBSCAN
Eps-neighborhood of a point The Eps-neighborhood of a point p, denoted
by NEps(p), is defined by NEps(p)={q∈D|dist(p,q) ≤ Eps}
Minimum Number (MinPts) The MinPts is the minimum number of data p
oints in any cluster.
![Page 25: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/25.jpg)
Definition for DBSCAN
Directly density-reachable A point p is directly density-reachable from a
point q. Eps and MinPts if 1): p ∈ NEps(q);
2): |NEps(q)|≥MinPts;
![Page 26: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/26.jpg)
Definition for DBSCAN
Density-reachable A point p is density-reachable from a point q.
Eps and MinPts if there is a chain of points p
1,p2,…,pn,p=p1,q=pn such as pi+1 is directly desity-reachable from pi;
![Page 27: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/27.jpg)
Definition for DBSCAN
Density-reachable A point p is density-reachable from a point q.
Eps and MinPts if there is a chain of points p
1,p2,…,pn,p=p1,q=pn such as pi+1 is directly desity-reachable from pi;
![Page 28: 5/29/2008AI Lab @ UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.](https://reader035.fdocuments.net/reader035/viewer/2022070413/5697bfbf1a28abf838ca3454/html5/thumbnails/28.jpg)
Algorithm of DBSCAN
Input D={t1,t2,…,tn} MinPts EpsOutput K=K1,K2,…Kk
k = 0; for i =1 to n do
if ti is not in a cluster then
X={ti| tj is density-reachable from ti} end if if X is a valid cluster then k= k+1; Kk = X; end if end for