Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.
-
Upload
catherine-franklin -
Category
Documents
-
view
226 -
download
4
Transcript of Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.
Statistical Data Mining - 3
Edward J. Wegman
A Short Course for Interface ‘01
Visual Data Mining
Outline of Lecture
Visual Complexity Description of Basic Techniques
Parallel Coordinates Grand Tour Saturation Brushing
Illustrations of Basic Techniques Rapid Data Editing, Density Estimation (Pollen Data) Inverse Regression, Tree Structured Decision Rules (Bank Data) Classification & Clustering (SALAD Data & Artificial Nose ) Structural Inference (PRIM 7 Data) Data Mining (BLS Cereal Scanner Data) Cluster Trees (Oronsay Sand Particle Size Data)
Visual Complexity
ScenariosScenarios
Typical high resolution workstations, Typical high resolution workstations,
1280x1024 = 1.31x101280x1024 = 1.31x1066 pixels pixels
Realistic using Wegman, immersion, 4:5 Realistic using Wegman, immersion, 4:5 aspect ratio,aspect ratio,
2333x1866 = 4.35x102333x1866 = 4.35x1066 pixels pixels
Very optimistic using 1 minute arc, immersion, Very optimistic using 1 minute arc, immersion, 4:5 aspect ratio, 8400x6720 = 5.65x104:5 aspect ratio, 8400x6720 = 5.65x1077 pixels pixels
Wildly optimistic using Maar(2), immersion, 4:5 Wildly optimistic using Maar(2), immersion, 4:5 aspect ratio, 17,284x13,828 = 2.39x10aspect ratio, 17,284x13,828 = 2.39x108 8 pixelspixels
Visual Complexity Visual Complexity
Visualization for Data Mining can realistically hope to deal with somewhere on the order of 106 to 107 observations. This coincides with the approximate limits for interactive computing of O(n2) algorithms and for data transfer. This also roughly corresponds to the number of foveal cones in the eye.
Methodologies for Visual Data Mining
Parallel CoordinatesParallel Coordinates Effective Method for High Dimensional DataEffective Method for High Dimensional Data High Dimensions = Multiple AttributesHigh Dimensions = Multiple Attributes
Grand TourGrand Tour Generalized Rotation in High DimensionsGeneralized Rotation in High Dimensions In Depth Study of High Dimensional Data In Depth Study of High Dimensional Data
Saturation BrushingSaturation Brushing Effective Method for Large Data SetsEffective Method for Large Data Sets
Visual Data Mining Techniques
Multidimensional Data Visualization Scatterplot matrix Parallel coordinate plots 3-D stereoscopic scatterplots Grand tour on all plot devices Density plots Linked views Saturation brushing Pruning and cropping
Crystal Vision
Crystal Vision
Crystal Vision
Crystal Vision
Data Editing and Density Estimation
Pollen Data 3848 points 5 dimensions
C
Pollen Data
Pollen Data
Pollen Data
Pollen Data
Pollen Data
Pollen Data
Inverse Regression and Tree Structured Decision Rules with
Financial Data
Bank Demographic Data in 8 Dimensions Bank Demographic Data in 8 Dimensions with 12,000+ pointswith 12,000+ points
Inverse Regression and Tree Structured Decision Rules with Financial Data
Inverse Regression and Tree Structured Decision Rules with Financial Data
Inverse Regression and Tree Structured Decision Rules with
Financial Data
Inverse Regression and Tree Structured Decision Rules with
Financial Data
Inverse Regression and Tree Structured Decision Rules with
Financial Data
Inverse Regression and Tree Structured Decision Rules with
Financial Data
Classification and Clustering Using SALAD Data
Chemical Agent Detection Data in 13 Chemical Agent Detection Data in 13 Dimensions with 10,000+ pointsDimensions with 10,000+ points
Classification and Clustering Using SALAD Data
Classification and Clustering Using SALAD Data
Classification and Clustering Using SALAD Data
Classification and Clustering Using SALAD Data
Artificial Dog Nose
19 dimensional time series in 2 spectral bands 60 time steps for 300 chemical species
c
Artificial Dog Nose
Time series in two spectral bands for same chemical species
Artificial Dog Nose
Phase loop
Artificial Dog Nose
Orthogonal components
Artificial Dog Nose
After grand tour, orthogonal variables x2*, x9*, x15*, x16*, x18* separate the two spectral bands
Artificial Dog Nose
Four chemical species, target highlighted in red
Artificial Dog Nose
Target species separated by x1*, x3*, x5*, x6*, x11*, x15*
PRIM-7
7 dimensional high energy physics data
500 data points
pi-meson proton interaction
Structural Inference Using Structural Inference Using PRIM 7 DataPRIM 7 Data
Structural Inference Using Structural Inference Using PRIM 7 DataPRIM 7 Data
Structural Inference Using PRIM 7 Data
Structural Inference Using PRIM 7 Data
Structural Inference Using PRIM 7 Data
Structural Inference Using PRIM 7 Data
Structural Inference Using PRIM 7 Data
Structural Inference Using PRIM 7 Data
Structural Inference Using PRIM 7 Data
Structural Inference Using PRIM 7 Data
Scanner Data for Breakfast Cereals
5.5 gigabytes of scanner data in relational database Price, sales volume, promotion, store, chain, PSU, UPC Work done at BLS
Phase 1 – Basic Data Analysis – Single MonthPhase 2 – Price Relative Effects – 1 YearPhase 3 – Churning Effects – 5 Years
Scanner Data for Breakfast Cereals
Promotion has huge impact on sales volume
Scanner Data for Breakfast Cereals
Stores not randomized
Scanner Data for Breakfast Cereals
Aggressive promotion pays
Scanner Data for Breakfast Cereals
Scanner Data for Breakfast Cereals
Scanner Data for Breakfast Cereals
Phase 2
Scanner Data for Breakfast Cereals
Scanner Data for Breakfast Cereals
Outliers belong to same chain
Scanner Data for Breakfast Cereals
Promotion both years
Scanner Data for Breakfast Cereals
Range of items with no promotion
Scanner Data for Breakfast Cereals
One chain ceased promotions
Scanner Data for Breakfast Cereals
Phase 3
Scanner Data for Breakfast Cereals
Churning comes from both new items and new stores
Scanner Data for Breakfast Cereals
Churning Effects: Red: PR=0, Blue: PR>0, Green PR=infinity
Scanner Data for Breakfast Cereals
New items tend to have higher prices
Scanner Data for Breakfast Cereals
Many discontinued items have high expenditures
Scanner Data for Breakfast Cereals
Effect of item churning
Scanner Data for Breakfast Cereals
Removing Store Birth-Death Effects
Scanner Data for Breakfast Cereals
Outlier due to price coding error
Scanner Data for Breakfast Cereals
Effects of Cereal Types
Scanner Data for Breakfast Cereals
Quantity Effects
Sands of Time Data
300 Samples of Sand Data from Oronsay Island in the Scotch Hebrides
Sands of Time - Objective
“The mesolithic shell middens on the island of Oronsay are one of the most important archeological sites in Britain. It is of considerable interest to determine their position with respect to the mesolithic coastline. If the sand below the midden were beach sand and the sand from the upper layers dune sand, this would indicate a seaward shift of the beach-dune interface.”
Flenley and Olbricht, 1993
Sands of Time - Objective
Cluster samples of modern sand into “beach-like” or “dune-like” sand.
Classify archeological sand samples as to whether they are beach sand or dune sand.
Sands of Time – Parametric Analysis
Historical strategy is to fit parametric distributions and compare modern and archeological sands based on parameters.
Weibull, 1933; lognormal (breakage models), log-hyperbolic, log-skew-Laplace, 1937, Barndorff-Nielsen, 1977.
Models 2 to 4 parameters, theory developed, practice problematic.
Sands of Time - Graphical Analysis
Multidimensional Parallel Coordinate Display Combined with Grand Tour.
BRUSH-TOUR strategy Clusters recognized by gaps in any horizontal axis. Brush existing clusters with colors. Execute grand tour until new clusters appear, brush again. Continue until clusters are exhausted.
Mining the Sands of TimeMining the Sands of Time
Mining the Sands of TimeMining the Sands of Time
Mining the Sands of TimeMining the Sands of Time
Mining the Sands of TimeMining the Sands of Time
Mining the Sands of TimeMining the Sands of Time
Mining the Sands of TimeMining the Sands of Time
Mining the Sands of TimeMining the Sands of Time
Sands of Time - Conclusions
Sands from the CC site and the CNG site have considerably different particle size distributions and cannot be effectively aggregated.
Data at small and at large particle dimensions is too quantized to be used effectively.
The visual based BRUSH-TOUR strategy is extremely effective at clustering.
Sands of Time - Conclusions Continued
Midden sands are neither modern beach sands nor modern dune sands.
Midden sands are more similar to modern dune sands.This result does not support the seaward-shift-of-the-beach-
dune-interface hypothesis, but suggests the middens were always in the dunes