Very large data sets
-
Upload
riley-foreman -
Category
Documents
-
view
44 -
download
4
description
Transcript of Very large data sets
![Page 1: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/1.jpg)
Very large data sets
Pasi Fränti
Clustering methods: Part 10
Speech and Image Processing UnitSchool of Computing
University of Eastern Finland
5.5.2014
![Page 2: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/2.jpg)
Methods for large data sets
• Birch
• Clarans
• On-line EM
• Scalable EM
• GMG
Let’s study this(no material for the others)
![Page 3: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/3.jpg)
Gradual model generator (GMG) [Kärkkäinen & Fränti, 2007: Pattern Recognition]
D at a B u ffer M o d el
M o d el s iz ered u ct io n
M o d el gen erat io n
G en erat edm o d el
P o s t p ro ces s in gO u t p u t m o d els
S elec tio n
![Page 4: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/4.jpg)
EM GMG
Goal of the GMG algorithm
![Page 5: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/5.jpg)
EM GMG
Contours of probability density distributions
![Page 6: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/6.jpg)
Before update After update
Model update
• New data points are mapped immediately when input.• Points too far (from any model) will remain in buffer.• Buffered points are re-tested when new models created.
![Page 7: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/7.jpg)
Selected points and a new component
Data in buffer
Generating new components• When buffer full, selected points are used to generate new
components.• Most compact k-neighborhood is selected as seed for a new
component.
![Page 8: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/8.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Example
![Page 9: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/9.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Example
![Page 10: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/10.jpg)
Example
![Page 11: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/11.jpg)
Example
![Page 12: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/12.jpg)
Example
![Page 13: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/13.jpg)
Example
![Page 14: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/14.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Post-processing
Model before processing
![Page 15: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/15.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Post-processing
Model before processing Updated model
![Page 16: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/16.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Post-processing
Model before processing Updated model + data
![Page 17: Very large data sets](https://reader036.fdocuments.net/reader036/viewer/2022062308/56812ddc550346895d932de8/html5/thumbnails/17.jpg)
Literature
1. I. Kärkkäinen and P. Fränti, "Gradual model generator for single-pass clustering", Pattern Recognition, 40 (3), 784-795, March 2007.
2. P. Bradley, U. Fayyad, C. Reina, Clustering Very Large Databases Using EM Mixture Models, Proc. of the 15th Int. Conf. on Pattern Recognition, vol. 2, 2000, pp. 76-80.
3. R. Ng, J. Han, CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Trans. Knowledge & Data Engineering 14(5) (2002) 1003-1016.
4. M. Sato, S. Ishii, On-line EM Algorithm for the Normalized Gaussian Network, Neural Computation 12(2) (2000) 407-432.
5. T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1(2) (1997) 141-182.