Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

74
Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01

Transcript of Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Page 1: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Statistical Data Mining - 3

Edward J. Wegman

A Short Course for Interface ‘01

Page 2: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Visual Data Mining

Page 3: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Outline of Lecture

Visual Complexity Description of Basic Techniques

Parallel Coordinates Grand Tour Saturation Brushing

Illustrations of Basic Techniques Rapid Data Editing, Density Estimation (Pollen Data) Inverse Regression, Tree Structured Decision Rules (Bank Data) Classification & Clustering (SALAD Data & Artificial Nose ) Structural Inference (PRIM 7 Data) Data Mining (BLS Cereal Scanner Data) Cluster Trees (Oronsay Sand Particle Size Data)

Page 4: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Visual Complexity

ScenariosScenarios

Typical high resolution workstations, Typical high resolution workstations,

1280x1024 = 1.31x101280x1024 = 1.31x1066 pixels pixels

Realistic using Wegman, immersion, 4:5 Realistic using Wegman, immersion, 4:5 aspect ratio,aspect ratio,

2333x1866 = 4.35x102333x1866 = 4.35x1066 pixels pixels

Very optimistic using 1 minute arc, immersion, Very optimistic using 1 minute arc, immersion, 4:5 aspect ratio, 8400x6720 = 5.65x104:5 aspect ratio, 8400x6720 = 5.65x1077 pixels pixels

Wildly optimistic using Maar(2), immersion, 4:5 Wildly optimistic using Maar(2), immersion, 4:5 aspect ratio, 17,284x13,828 = 2.39x10aspect ratio, 17,284x13,828 = 2.39x108 8 pixelspixels

Page 5: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Visual Complexity Visual Complexity

Visualization for Data Mining can realistically hope to deal with somewhere on the order of 106 to 107 observations. This coincides with the approximate limits for interactive computing of O(n2) algorithms and for data transfer. This also roughly corresponds to the number of foveal cones in the eye.

Page 6: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Methodologies for Visual Data Mining

Parallel CoordinatesParallel Coordinates Effective Method for High Dimensional DataEffective Method for High Dimensional Data High Dimensions = Multiple AttributesHigh Dimensions = Multiple Attributes

Grand TourGrand Tour Generalized Rotation in High DimensionsGeneralized Rotation in High Dimensions In Depth Study of High Dimensional Data In Depth Study of High Dimensional Data

Saturation BrushingSaturation Brushing Effective Method for Large Data SetsEffective Method for Large Data Sets

Page 7: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Visual Data Mining Techniques

Multidimensional Data Visualization Scatterplot matrix Parallel coordinate plots 3-D stereoscopic scatterplots Grand tour on all plot devices Density plots Linked views Saturation brushing Pruning and cropping

Page 8: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Crystal Vision

Page 9: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Crystal Vision

Page 10: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Crystal Vision

Page 11: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Crystal Vision

Page 12: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Data Editing and Density Estimation

Pollen Data 3848 points 5 dimensions

C

Page 13: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Pollen Data

Page 14: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Pollen Data

Page 15: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Pollen Data

Page 16: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Pollen Data

Page 17: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Pollen Data

Page 18: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Pollen Data

Page 19: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Inverse Regression and Tree Structured Decision Rules with

Financial Data

Bank Demographic Data in 8 Dimensions Bank Demographic Data in 8 Dimensions with 12,000+ pointswith 12,000+ points

Page 20: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Inverse Regression and Tree Structured Decision Rules with Financial Data

Inverse Regression and Tree Structured Decision Rules with Financial Data

Page 21: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Inverse Regression and Tree Structured Decision Rules with

Financial Data

Inverse Regression and Tree Structured Decision Rules with

Financial Data

Page 22: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Inverse Regression and Tree Structured Decision Rules with

Financial Data

Inverse Regression and Tree Structured Decision Rules with

Financial Data

Page 23: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Classification and Clustering Using SALAD Data

Chemical Agent Detection Data in 13 Chemical Agent Detection Data in 13 Dimensions with 10,000+ pointsDimensions with 10,000+ points

Page 24: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Classification and Clustering Using SALAD Data

Classification and Clustering Using SALAD Data

Page 25: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Classification and Clustering Using SALAD Data

Classification and Clustering Using SALAD Data

Page 26: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Artificial Dog Nose

19 dimensional time series in 2 spectral bands 60 time steps for 300 chemical species

c

Page 27: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Artificial Dog Nose

Time series in two spectral bands for same chemical species

Page 28: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Artificial Dog Nose

Phase loop

Page 29: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Artificial Dog Nose

Orthogonal components

Page 30: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Artificial Dog Nose

After grand tour, orthogonal variables x2*, x9*, x15*, x16*, x18* separate the two spectral bands

Page 31: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Artificial Dog Nose

Four chemical species, target highlighted in red

Page 32: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Artificial Dog Nose

Target species separated by x1*, x3*, x5*, x6*, x11*, x15*

Page 33: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

PRIM-7

7 dimensional high energy physics data

500 data points

pi-meson proton interaction

Page 34: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Structural Inference Using Structural Inference Using PRIM 7 DataPRIM 7 Data

Structural Inference Using Structural Inference Using PRIM 7 DataPRIM 7 Data

Page 35: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Structural Inference Using PRIM 7 Data

Structural Inference Using PRIM 7 Data

Page 36: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Structural Inference Using PRIM 7 Data

Structural Inference Using PRIM 7 Data

Page 37: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Structural Inference Using PRIM 7 Data

Structural Inference Using PRIM 7 Data

Page 38: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Structural Inference Using PRIM 7 Data

Structural Inference Using PRIM 7 Data

Page 39: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

5.5 gigabytes of scanner data in relational database Price, sales volume, promotion, store, chain, PSU, UPC Work done at BLS

Phase 1 – Basic Data Analysis – Single MonthPhase 2 – Price Relative Effects – 1 YearPhase 3 – Churning Effects – 5 Years

Page 40: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Promotion has huge impact on sales volume

Page 41: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Stores not randomized

Page 42: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Aggressive promotion pays

Page 43: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Page 44: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Page 45: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Phase 2

Page 46: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Page 47: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Outliers belong to same chain

Page 48: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Promotion both years

Page 49: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Range of items with no promotion

Page 50: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

One chain ceased promotions

Page 51: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Phase 3

Page 52: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Churning comes from both new items and new stores

Page 53: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Churning Effects: Red: PR=0, Blue: PR>0, Green PR=infinity

Page 54: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

New items tend to have higher prices

Page 55: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Many discontinued items have high expenditures

Page 56: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Effect of item churning

Page 57: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Removing Store Birth-Death Effects

Page 58: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Outlier due to price coding error

Page 59: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Effects of Cereal Types

Page 60: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Scanner Data for Breakfast Cereals

Quantity Effects

Page 61: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Sands of Time Data

300 Samples of Sand Data from Oronsay Island in the Scotch Hebrides

Page 62: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Sands of Time - Objective

“The mesolithic shell middens on the island of Oronsay are one of the most important archeological sites in Britain. It is of considerable interest to determine their position with respect to the mesolithic coastline. If the sand below the midden were beach sand and the sand from the upper layers dune sand, this would indicate a seaward shift of the beach-dune interface.”

Flenley and Olbricht, 1993

Page 63: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Sands of Time - Objective

Cluster samples of modern sand into “beach-like” or “dune-like” sand.

Classify archeological sand samples as to whether they are beach sand or dune sand.

Page 64: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Sands of Time – Parametric Analysis

Historical strategy is to fit parametric distributions and compare modern and archeological sands based on parameters.

Weibull, 1933; lognormal (breakage models), log-hyperbolic, log-skew-Laplace, 1937, Barndorff-Nielsen, 1977.

Models 2 to 4 parameters, theory developed, practice problematic.

Page 65: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Sands of Time - Graphical Analysis

Multidimensional Parallel Coordinate Display Combined with Grand Tour.

BRUSH-TOUR strategy Clusters recognized by gaps in any horizontal axis. Brush existing clusters with colors. Execute grand tour until new clusters appear, brush again. Continue until clusters are exhausted.

Page 66: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Mining the Sands of TimeMining the Sands of Time

Page 67: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Mining the Sands of TimeMining the Sands of Time

Page 68: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Mining the Sands of TimeMining the Sands of Time

Page 69: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Mining the Sands of TimeMining the Sands of Time

Page 70: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Mining the Sands of TimeMining the Sands of Time

Page 71: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Mining the Sands of TimeMining the Sands of Time

Page 72: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Mining the Sands of TimeMining the Sands of Time

Page 73: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Sands of Time - Conclusions

Sands from the CC site and the CNG site have considerably different particle size distributions and cannot be effectively aggregated.

Data at small and at large particle dimensions is too quantized to be used effectively.

The visual based BRUSH-TOUR strategy is extremely effective at clustering.

Page 74: Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Sands of Time - Conclusions Continued

Midden sands are neither modern beach sands nor modern dune sands.

Midden sands are more similar to modern dune sands.This result does not support the seaward-shift-of-the-beach-

dune-interface hypothesis, but suggests the middens were always in the dunes