Statistical Data Mining: A Short Course for the Army ...
description
Transcript of Statistical Data Mining: A Short Course for the Army ...
![Page 1: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/1.jpg)
Statistical Data Mining: A Short
Course for the Army Conference on Applied Statistics
Edward J. WegmanGeorge Mason University
Jeffrey L. SolkaNaval Surface Warfare Center
![Page 2: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/2.jpg)
Statistical Data Mining Agenda
Introduction and ComplexityData Preparation and CompressionDatabases and Data Mining via Association RulesClustering, Classification, and DiscriminationPattern Recognition and Intrusion DetectionColor Theory and DesignVisual Data MiningCrystalVision Installation and Practice
![Page 3: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/3.jpg)
Introduction to Data Mining
![Page 4: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/4.jpg)
Introduction to Data Mining
What is Data Mining All AboutHierarchy of Data Set SizeComputational Complexity and FeasibilityData Mining Defined & Contrasted with EDAExamples
![Page 5: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/5.jpg)
Introduction to Data Mining
Why Data MiningWhat is Knowledge Discovery in
DatabasesPotential Applications
Fraud Detection Manufacturing Processes Targeting Markets Scientific Data Analysis Risk Management Web Intelligence
![Page 6: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/6.jpg)
Introduction to Data Mining
Data Mining: On what kind of data? Relational Databases Data Warehouses Transactional Databases Advanced
Object-relationalSpatial, Temporal, SpatiotemporalText, wwwHeterogeneous, Legacy, Distributed
![Page 7: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/7.jpg)
Introduction to Data Mining
Data Mining: Why now? Confluence of multiple disciplines
Database systems, data warehouses, OLAPMachine learningStatistical and data analysis methodsVisualizationMathematical programmingHigh performance computing
![Page 8: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/8.jpg)
Introduction to Data Mining
Why do we need data mining? Large number of records (cases) (108-1012
bytes) High dimensional data (variables) (10-104
attributes)How do you explore millions of records, tens or
hundreds of fields, and find patterns?
![Page 9: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/9.jpg)
Introduction to Data Mining
Why do we need data mining?Only a small portion, typically 5% to 10%, of
the collected data is ever analyzed.Data that may never be explored continues
to be collected out of fear that something that may prove important in the future may be missing.
Magnitude of data precludes most traditional analysis (more on complexity later).
![Page 10: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/10.jpg)
Introduction to Data Mining
KDD and data mining have roots in traditional database technology
As database grow, the ability of the decision support process to exploit traditional (I.e. Boolean) query languages is limited.
• Many queries of interest are difficult/impossible to state in traditional query languages
• “Find all cases of fraud in IRS tax returns.”• “Find all individuals likely to ignore Census
questionnaires.”• “Find all documents relating to this customer’s
problem.”
![Page 11: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/11.jpg)
Complexity
![Page 12: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/12.jpg)
Complexity
Descriptor Data Set Size in Bytes Storage Mode Tiny 102 Piece of Paper Small 104 A Few Pieces of Paper Medium 106 A Floppy Disk Large 108 Hard Disk Huge 1010 Multiple Hard Disks Massive 1012 Robotic Magnetic Tape Storage Silos Supermassive 1015 Distributed Data Archives
The Huber-Wegman Taxonomy of Data Set Sizes
![Page 13: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/13.jpg)
Complexity
O( n ) Calculate Means, Variances, Kernel Density Estimates
O(n log(n)) Calculate Fast Fourier TransformsO(n c) Calculate Singular Value Decomposition of
an r x c Matrix; Solve a Multiple Linear Regression
O( n2 ) Solve most Clustering AlgorithmsO( an ) Detect Multivariate Outliers
Algorithmic Complexity
![Page 14: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/14.jpg)
Complexity
Table 2: Number of Operations for Algorithms of VariousComputational Complexities and Various Data Set Sizes
n n1/2 n n log(n) n3/2 n2
tiny 10 102 2x102 103 104
small 102 104 4x104 106 108
medium 103 106 6x106 109 1012
large 104 108 8x108 1012 1016
huge 105 1010 1011 1015 1020
![Page 15: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/15.jpg)
Complexity
Table 4: Computational Feasibility on a Pentium PC10 megaflop performance assumed
n n1/2 n n log(n) n3/2 n2
tiny 10-6
seconds10-5
seconds2x10-5
seconds.0001
seconds.001
seconds
small 10-5
seconds.001
seconds.004
seconds.1
seconds10
seconds
medium .0001seconds
.1seconds
.6seconds
1.67minutes
1.16days
large .001seconds
10seconds
1.3minutes
1.16days
31.7years
huge .01seconds
16.7minutes
2.78hours
3.17years
317,000 years
![Page 16: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/16.jpg)
Complexity
Table 5: Computational Feasibility on a Silicon Graphics Onyx Workstation300 megaflop performance assumed
n n1/2 n n log(n) n3/2 n2
tiny 3.3x10-8
seconds3.3x10-7
seconds6.7x10-7
seconds3.3x10-6
seconds3.3x10-5
seconds
small 3.3x10-7
seconds3.3x10-5
seconds1.3x10-4
seconds3.3x10-3
seconds.33
seconds
medium 3.3x10-6
seconds3.3x10-3
seconds.02
seconds3.3
seconds55
minutes
large 3.3x10-5
seconds.33
seconds2.7
seconds55
minutes1.04years
huge 3.3x10-4
seconds33
seconds5.5
minutes38.2days
10,464years
![Page 17: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/17.jpg)
Complexity
Table 6: Computational Feasibility on an Intel Paragon XP/S A44.2 gigaflop performance assumed
n n1/2 n n log(n) n3/2 n2
tiny 2.4x10-9
seconds2.4x10-8
seconds4.8x10-8
seconds2.4x10-7
seconds2.4x10-6
seconds
small 2.4x10-8
seconds2.4x10-6
seconds9.5x10-6
seconds2.4x10-4
seconds.024
seconds
medium 2.4x10-7
seconds2.4x10-4
seconds.0014
seconds.24
seconds4.0
minutes
large 2.4x10-6
seconds.024
seconds.19
seconds4.0
minutes27.8days
huge 2.4x10-5
seconds2.4
seconds24
seconds66.7
hours761
years
![Page 18: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/18.jpg)
Complexity
Table 7: Computational Feasibility on a Teraflop Grand Challenge Computer1000 gigaflop performance assumed
n n1/2 n n log(n) n3/2 n2
tiny 10-11
seconds10-10
seconds2x10-10
seconds10-9
seconds10-8
seconds
small 10-10
seconds10-8
seconds4x10-8
seconds10-6
seconds10-4
seconds
medium 10-9
seconds10-6
seconds6x10-6
seconds.001
seconds1
second
large 10-8
seconds10-4
seconds8x10-4
seconds1
second2.8
hours
huge 10-7
seconds.01
seconds.1
seconds16.7
minutes3.2
years
![Page 19: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/19.jpg)
Complexity
Table 8: Types of Computers for Interactive FeasibilityResponse Time < 1 second
n n1/2 n n log(n) n3/2 n2
tiny PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
small PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
SuperComputer
medium PersonalComputer
PersonalComputer
PersonalComputer
Super Computer TeraflopComputer
large PersonalComputer
Workstation Super Computer TeraflopComputer
---
huge PersonalComputer
SuperComputer
TeraflopComputer
--- ---
![Page 20: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/20.jpg)
Complexity
Table 9: Types of Computers for FeasibilityResponse Time < 1 week
n n1/2 n n log(n) n3/2 n2
tiny PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
small PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
medium PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
large PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
TeraflopComputer
huge PersonalComputer
PersonalComputer
PersonalComputer
Super Computer ---
![Page 21: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/21.jpg)
Complexity
Table 10: Transfer Rates for a Variety of Data Transfer Regimes
n standardethernet10 mega-bits/sec
fastethernet
100 mega-bits/sec
hard disktransfer
2027 kilo-bytes/sec
cachetransfer @ 200
megahertz
1.25x106
bytes/sec1.25x107
bytes/sec2.027x106
bytes/sec2x108
bytes/sec
tiny 8x10-5
seconds8x10-6
seconds4.9x10-5
seconds5x10-6
seconds
small 8x10-3
seconds8x10-4
seconds4.9x10-3
seconds5x10-5
seconds
medium .8seconds
.08seconds
.49seconds
5x10-3
seconds
large 1.3minutes
8seconds
49seconds
.5seconds
huge 2.2hours
13.3minutes
1.36hours
50seconds
![Page 22: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/22.jpg)
Complexity
Table 11: Resolvable Number of Pixels AcrossScreen for Several Viewing Scenarios
19 inchmonitor @24 inches
25 inchTV @12 feet
15 footscreen @
20 feet
immersion
Angle 39.005o 9.922o 41.112o 140o
5 seconds of arcresolution(Valyus)
28,084 7,144 29,601 100,800
1 minute of arcresolution
2,340 595 2,467 8,400
3.6 minute of arcresolution(Wegman)
650 165 685 2,333
4.38 minutesof arc resolution
(Maar 1)
534 136 563 1,918
.486 minutes ofarc/foveal cone
(Maar 2)
4,815 1,225 5,076 17,284
![Page 23: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/23.jpg)
Complexity
ScenariosTypical high resolution workstations,
1280x1024 = 1.31x106 pixelsRealistic using Wegman, immersion, 4:5 aspect ratio,
2333x1866 = 4.35x106 pixels Very optimistic using 1 minute arc, immersion, 4:5
aspect ratio, 8400x6720 = 5.65x107 pixelsWildly optimistic using Maar(2), immersion, 4:5 aspect ratio, 17,284x13,828 = 2.39x108 pixels
![Page 24: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/24.jpg)
Massive Data Sets
One Terabyte Datasetvs
One Million Megabyte Data Sets
Both difficult to analyzebut for different reasons
![Page 25: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/25.jpg)
Massive Data Sets: Commonly Used Language
Data Mining = DMKnowledge Discovery in Databases =
KDDMassive Data Sets = MDData Analysis = DA
![Page 26: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/26.jpg)
Massive Data Sets
DM MDDM DA
Even DA + MD DM
1. Computationally Feasible Algorithms2. Little or No Human Intervention
![Page 27: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/27.jpg)
Data Mining of Massive Datasets
Data Mining is a kind of Exploratory Data Analysis with Little or No Human Interaction using Computationally Feasible Techniques,
i.e., the Attempt to find Interesting Structure unknown a priori
![Page 28: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/28.jpg)
Massive Data Sets
Major Issues Complexity Non-homogeneity
Examples Huber’s Air Traffic Control Highway Maintenance Ultrasonic NDE
![Page 29: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/29.jpg)
Massive Data Sets
Air Traffic Control 6 to 12 Radar stations, several hundred
aircraft, 64-byte record per radar per aircraft per antenna turn
megabyte of data per minute
![Page 30: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/30.jpg)
Massive Data Sets
Highway Maintenance Records of maintenance records and
measurements of road quality for several decades
Records of uneven quality Records missing
![Page 31: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/31.jpg)
Massive Data Sets
NDE using Ultrasound Inspection of cast iron projectiles Time series of length 256, 360 degrees,
550 levels = 50,688,000 observations per projectile
Several thousand projectiles per day
![Page 32: Statistical Data Mining: A Short Course for the Army ...](https://reader033.fdocuments.net/reader033/viewer/2022061120/546c52f6af795953298b4e69/html5/thumbnails/32.jpg)
Massive Data Sets: A Distinction
Human Analysis of the Structure of Data and Pitfalls
vs Human Analysis of the Data Itself
Limits of HVS and computational complexity limit the latter
Former is the basis for design of the analysis engine