Data Mining Technology

83
Copyright Insightful Corporation. All rights reserved. www.insightful.com Data Mining Technology Conference Insightful Corporation Jim Walter, Vice President of Research & Development Brand Niemann, Computer Scientist, US EPA

Transcript of Data Mining Technology

Page 1: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Data Mining Technology Conference

Insightful Corporation

Jim Walter, Vice President of Research & Development

Brand Niemann, Computer Scientist, US EPA

Page 2: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Agenda

A Brief History Of AnalysisData Mining Content Types Numbers Text Image & Signal

Brand Niemann – XML

Time PermittingTechnology adoption - what To look forDemonstration

Page 3: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Historical Context

SimpleSummarization

ComplexCalculation

Storage &Retrieval

ComplexReporting

DataMining

Fusion &Synthesis

1900… 1950 1960 1970 1980 1990 2000

Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning

Page 4: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Historical Context

SimpleSummarization

ComplexCalculation

Storage &Retrieval

ComplexReporting

DataMining

Fusion &Synthesis

1900… 1950 1960 1970 1980 1990 2000

Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning

Page 5: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Historical Context

SimpleSummarization

ComplexCalculation

Storage &Retrieval

ComplexReporting

DataMining

Fusion &Synthesis

1900… 1950 1960 1970 1980 1990 2000

Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning

Page 6: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Historical Context

SimpleSummarization

ComplexCalculation

Storage &Retrieval

ComplexReporting

DataMining

Fusion &Synthesis

1900… 1950 1960 1970 1980 1990 2000

Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning

Page 7: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Historical Context

SimpleSummarization

ComplexCalculation

Storage &Retrieval

ComplexReporting

DataMining

Fusion &Synthesis

1900… 1950 1960 1970 1980 1990 2000

Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning

Page 8: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Historical Context

SimpleSummarization

ComplexCalculation

Storage &Retrieval

ComplexReporting

DataMining

Fusion &Synthesis

1900… 1950 1960 1970 1980 1990 2000

Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning

Page 9: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Trend: Increasing Complexity

Time

Collection

Complexity

Aggregation

SimpleReporting

FlexibleReporting

Relationships &Interactions

Clerk

InformationWorker

Page 10: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Technology Adoption ModelAd

optio

n D

ensi

ty

Techies

Time

Page 11: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Technology Adoption Model

EarlyAdopter

Adop

tion

Den

sity

Techies

Time

Page 12: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Technology Adoption Model

EarlyAdopter

EarlyMajority

Adop

tion

Den

sity

Techies

Time

Page 13: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Technology Adoption Model

EarlyAdopter

EarlyMajority

LateMajority

Adop

tion

Den

sity

Techies

Time

Page 14: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Technology Adoption Model

EarlyAdopter

EarlyMajority

LateMajority

TechnologyLaggards

Adop

tion

Den

sity

Techies

Time

Page 15: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Technology Adoption Model

EarlyAdopter

EarlyMajority

LateMajority

TechnologyLaggards

Adop

tion

Den

sity

Techies

Chasm

Time

Page 16: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Technology Adoption Model

EarlyAdopter

EarlyMajority

LateMajority

TechnologyLaggards

Adop

tion

Den

sity

Time

DBMSOLAP

BI

NumericData

Mining

Text & Image (Unstructured)

Numeric (Structured)

Page 17: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Technology Adoption Model

EarlyAdopter

EarlyMajority

LateMajority

TechnologyLaggards

Adop

tion

Den

sity

Time

DBMSOLAP

BI

KeywordSearch

NumericData

Mining

TextCat

Image MiningText Mining

Q&A

Text & Image (Unstructured)

Numeric (Structured)

Page 18: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Technology Adoption Model

EarlyAdopter

EarlyMajority

LateMajority

TechnologyLaggards

Adop

tion

Den

sity

Time

DBMSOLAP

BI

KeywordSearch

NumericData

Mining

TextCat

Image MiningText Mining

Q&AFusion

Text & Image (Unstructured)

Numeric (Structured)

Fusion (future)

Page 19: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Trend: From big to HUGE!

Year 2001 2002 2003 2004

Total Warehouse Size 500 GB 1 TB 10 TB 20 TB

Weekly Transaction Volume 15 GB 50 GB 200 GB 500 GB

Daily Web Log Volume 2GB 8GB 20GB 30GB

Weekly Promotion Data Records 400 Million 1.2 Billion 3 Billion 4 Billion

Daily Promotion Data Records 50 Million 100 Million 200 Million 300 Million

Weekly Batch Window Constraint 8 Hours 8 Hours 8 Hours 8 Hours

Daily Batch Window Constraint 10 Hours 10 Hours 10 Hours 10 Hours

Customers Demographic Behavioral Time-Based Explode100,000 300 450 1,000 8,750 7,000,000,000 7GB

1,000,000 300 450 1,000 8,750 70,000,000,000 70GB10,000,000 300 450 1,000 8,750 700,000,000,000 700GB

Data Size

Page 20: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Why is Data Mining Needed?

Very large data (large N)Many dimensions (large P)Complicating factors Time Space Seasonality Interactions

Key relationships not yet known Frequently changing

Need to project forward

Trends

Increasing complexity

Increasing diversity

Increasing scale

Page 21: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Content Types

Numeric

Text

Image & Signal

Fusion XML

Page 22: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

SCM

EnterpriseData Silos

ERP

CRM

Flat FileExtracts

Preprocess

SCM

ERP

CRM

Clean

Merge

Transform

Aggregate

Rollup

DMDW

MnthlyTrxSum

Cust Rec

BuildView

• De-Normalize• Cust = Row• Create feature

BuildModel

EvaluateModel

Deploy

Access

Clean, Interpret & Extract

Train

Evaluate

Predict

Page 23: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

SCM

EnterpriseData Silos

ERP

CRM

Flat FileExtracts

Preprocess

SCM

ERP

CRM

Clean

Merge

Transform

Aggregate

Rollup

DMDW

MnthlyTrxSum

Cust Rec

BuildView

• De-Normalize• Cust = Row• Create feature

BuildModel

EvaluateModel

Deploy

Page 24: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Data Access

Delimited textFixed format textCommon vendor fmts S-PLUS, SAS, SPSS… Excel, Lotus, Access…

DBMS ODBC/JDBC Native

Domain specific User - created

50%

Page 25: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Data Access

Delimited textFixed format textCommon vendor fmts S-PLUS, SAS, SPSS… Excel, Lotus, Access…

DBMS ODBC/JDBC Native

Domain specific User - created

50%

Cases

Customers

Citizens

Page 26: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Data Access

Delimited textFixed format textCommon vendor fmts S-PLUS, SAS, SPSS… Excel, Lotus, Access…

DBMS ODBC/JDBC Native

Domain specific User - created

50%

Cases

Customers

Citizens

Features, variables, columns

Page 27: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Data Access

Delimited textFixed format textCommon vendor fmts S-PLUS, SAS, SPSS… Excel, Lotus, Access…

DBMS ODBC/JDBC Native

Domain specific User - created

50%

Cases

Customers

Citizens

Features, variables, columns

Independent, response

Page 28: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Data Access

Delimited textFixed format textCommon vendor fmts S-PLUS, SAS, SPSS… Excel, Lotus, Access…

DBMS ODBC/JDBC Native

Domain specific User - created

50%

Cases

Customers

Citizens

Features, variables, columns

Independent, response

Dependent, predictor

Page 29: Data Mining Technology

Initial Exploration

Scatter Plot

Page 30: Data Mining Technology

Secondary Exploration

Color Plot

Note “noise”at the boundary

Page 31: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Training

Page 32: Data Mining Technology

Evaluate the Training effectiveness

Page 33: Data Mining Technology
Page 34: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Page 35: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Data Mining Life Cycle(1) Access (2) Exploration & Feature Extraction

(3) Training (4) Evaluation (5) Prediction

Page 36: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

VISUALIZATION

INTERPRETATION

DATA EXTRACTION

OTHER TEXT

MINING

QUESTION ANSWERING

EXPLORATORY SEARCH

INFORMATION EXTRACTION

GUI

DATABASE

DOCUMENTS

LINGUISTIC PRIMITIVES

Text Mining Applications

Page 37: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

KEY DIFFERENTIATOR: Creating Structure from Unstructured Text

DATA EXTRACTIONINFORMATION EXTRACTION

DATABASE

DOCUMENTS

LINGUISTIC PRIMITIVES

Page 38: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

VISUALIZATION

INTERPRETATION

LINGUISTIC PRIMITIVES

REGRESSION

NEURAL NETS

CART…

PATTERN MATCHING

LINGUISTIC NORMALIZATION

GUI

DATABASE

Classical Structured Analysis Techniques

Page 39: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Current State of Art

data

features

structuralrelationships

information

• morphological normalization• semantic normalization

syntactic normalization: governing verb of each sentence, subject, object, etc.

facts databanks

“NEXT GENERATION” SEARCH STOPS HERE!

Page 40: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

SEARCH ENGINES STORE WORDS DEEP EXTRACTION STORES FACTS

Nothing can reconstruct original facts from a count of keywords.

Instead, InFact store facts…

"From 1949 to 1960 China was in alliance with the Soviet Union, although this relationship was already under severe strain in the late 1950s. There followed, in 1960-72, a period of isolation, during which China sought to identify itself as a natural leader of the developing world in its resistance to "US imperialism". From 1972 China found itself in de facto alliance with the US against perceived Soviet expansionism. That epoch came to a definitive end in 1989, when relations with the Soviet Union were normalized and the Beijing massacre introduced new and severe strains into Sino-US relations."

Page 41: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Q&A Example

Page 42: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Page 43: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Page 44: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

EXPLORATORY SEARCH

Page 45: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

TEXT MINING BASED ON SHALLOW IE

QUANTITIES

COMPANY,

ORGANIZATION ,

COUNTRY NAMES

PRODUCTS

UBL

China

176,000 60 cents

BoeingCEA

LVNL

JDAM UK Army Helicopter Wedgetail Aircraft

Page 46: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

TEXT MINING BASED ON DEEP IE

QUANTITIES

COMPANY,

ORGANIZATION ,

COUNTRY NAMES

PRODUCTS

UBL

China

176,000 60 cents

Boeing

CEA

LVNL

JDAM UK Army Helicopter Wedgetail Aircraft

cooperate

collaborate

roll out

buy

test

droplay off

mothball

Page 47: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Integrating Text Mining & Visualization

Page 48: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Integrating Text Mining & Visualization

Page 49: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Image Mining Common Themes

Application Domains:

• Medical Imaging CT, MR,US• Microscopic Imaging• Video Processing • Machine Vision• Document Imaging • Remote Sensing• Tactical Imaging IR, EO• More ….

Insightful Imaging Library

Segmentation

Enhancement

Feature Extraction

ClusteringClassification

Registration

Page 50: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Problem: Noisy image data is hard to interpret and

leads to inconsistent outcomes.

Application: Prostate outlining in ultrasound images.

Solution: Non-linear model fitting enables more

efficient and consistent delineation for improving cancer treatment planning.

Technology suitable for other image processing applications with high noise such as SONAR images

Segmentation of Noisy Images

Page 51: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Two Observers’ Delineation of the Prostate on Ultrasound Images

Manual Delineation:Note the large variationbetween the observers

Delineation Using Imaging Library Technology: The inter-observer variation is significantly lower

Page 52: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

BRINGING IMAGES INTO THE REALM OF STATISTICS

Searching for patterns and objects in images

Analyzing image properties with statistical tools

Organizing databases of images (satellite/medical)

Interpreting and Classifying visual information

Page 53: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

0 1

3

20

210

211213

212

0 1 2 3

20 21 22 23

210 211 212 213

Index 1

Index k

MULTIPLE

FEATURES

Index 1

Index k

MULTIPLE SCALES

AUTOMATED FEATURE EXTRACTION FROM IMAGES

Page 54: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

SEARCH FOR AIR STRIPS

ANALYZE WITH S-PLUS

CLASSIFY TISSUES AND DISEASES

COURTESY: U. OF DELAWARE & EAST TENNESSEE STATE UNIVERSITY

Page 55: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Fusion: Future Model for Analyzing Data

Text Mining: information extraction from unstructured data

Text Image Video

Presentation and Analysis

Data Mining & Prediction

Data Warehouse

Data Integration

Page 56: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Critical Issues

Support all required content Numeric, text, image, signal…

Support full life cycleVisualization“Platform” – look for language-based toolsScalable – look for a pipeline constructExtensible – check out the architecture

Page 57: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Scalability: Pipeline

Page 58: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Scalability: Pipeline

Page 59: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Scalability: Pipeline

Page 60: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Scalability: Pipeline

Page 61: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Scalability: Pipeline

Page 62: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Scalability: Pipeline

Page 63: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Scalability: Pipeline

Page 64: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Scalability: Pipeline

Page 65: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Scalability: Pipeline

Page 66: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Scalability: Pipeline

Training

Page 67: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Scalability: Pipeline

Training

Scoring

Page 68: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

1-D Visualization

HistogramsBarchartsPiechartsDensityDot…

Page 69: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

2-D Visualization

ScatterplotLineBoxStripQQ…

Page 70: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

3-D Visualization

ContourLevelSurfaceCloud…

Page 71: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Useful Variations

Color plotShape plotScatterplot matrixOverlaysTrellis (conditioning)Should be automatedShould be extensibleRotationOverlays

Page 72: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Eye Candy

Slick-lookingUnused dimensionsHard to interpret3-D Bar3-D PieBrush & spinMulti-plane plots

Page 73: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Issues

Language based vs Menu-drivenPlatform vs application (build vs. use)Open ended vs point solutionCommandlineExtensibility Open methods C++/Java extensibility XML/Web Services

Visual programming metaphorPipeline architecturePL1 vs. object orientedScalability & Interactivity

Page 74: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Architecture & Technologies

Application

Platform

Library

Graphical User Interface

AlgorithmI/O Data

Pipeline Interpreter

Viz

Page 75: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Architecture & Technologies

Application

Platform

Library

Graphical User Interface

Library API

AlgorithmI/O Data

Platform API

Pipeline Interpreter

Viz

Page 76: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Architecture & Technologies

Application

Platform

Library

Graphical User Interface

Library API

AlgorithmI/O Data

Platform API

Pipeline Interpreter

Viz

User

User

Page 77: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Architecture & Technologies

Application

Platform

Library

Graphical User Interface

Library API

AlgorithmI/O Data

Platform API

Pipeline Interpreter

Viz

User

User

COM ASPCOM ASP OLE JSPOLE JSP DDE EJBDDE EJB

C++, Java,C++, Java,some Ftn, some 4gl some Ftn, some 4gl

Java & C++Java & C++

XMLXML

Page 78: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Extensible Architecture

EngineEngineVendor APIVendor API

User Code (Algorithms)User Code (Algorithms)

User Code (GUI)User Code (GUI)

Page 79: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

What influences Success?

Data mining is typically I/O bound RAID disk systems Locate analysis near data (i.e. not across network) Databases & warehouses often too slow – ETL Use sampling – especially during exploration

Data is often very dirty Data mining tools typically offer sophisticated

methods – use them Discarding data can skew results

Page 80: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Desktop PerformanceComponents should be validated on data sets as large as 25GB, with as many as 125,000,000 rows and 5,000 columns.

Reading and writing files operate at ~5.0 MB/s. E.g., read or write 7,000,000 rows by 30 columns data set (1.2 GB) in less than 3 ½ minutes.

~ 6 ½ minutes to train an ensemble of trees on a 1,000,000 rows by 30 columns data set (180 MB).

Scoring components (predictors) should perform at ~500,000 rows per minute.

Page 81: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Conclusions

Data is becoming larger, more complex and more diverse in formData mining is needed to extract complex relationships from large, high dimension dataData mining is being applied to all content types including numeric, text, image and signal dataUnstructured data (e.g. text, image) must be structured to analyzed…

Page 82: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Deciding on a Solution

Content breadth of offerings and skills Numeric, text, image & signal

Product features Support full life cycle Language-based Scalable Visualization Extensible

Page 83: Data Mining Technology

Copyright Insightful Corporation. All rights reserved. www.insightful.com

Questions for audience

Who knows something about ML? Ignorant Knowledgeable Expert

Tree example Customer data Reporting (list all customers, age, income, purchase) Sort and report Create new variable (age*income) Viz (scatter, color plots) Tree learning