Data Mining Session 5 – Sub-Topic Data Cube Technology Dr. Jean ...
Data Mining Technology
Transcript of Data Mining Technology
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Data Mining Technology Conference
Insightful Corporation
Jim Walter, Vice President of Research & Development
Brand Niemann, Computer Scientist, US EPA
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Agenda
A Brief History Of AnalysisData Mining Content Types Numbers Text Image & Signal
Brand Niemann – XML
Time PermittingTechnology adoption - what To look forDemonstration
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Historical Context
SimpleSummarization
ComplexCalculation
Storage &Retrieval
ComplexReporting
DataMining
Fusion &Synthesis
1900… 1950 1960 1970 1980 1990 2000
Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Historical Context
SimpleSummarization
ComplexCalculation
Storage &Retrieval
ComplexReporting
DataMining
Fusion &Synthesis
1900… 1950 1960 1970 1980 1990 2000
Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Historical Context
SimpleSummarization
ComplexCalculation
Storage &Retrieval
ComplexReporting
DataMining
Fusion &Synthesis
1900… 1950 1960 1970 1980 1990 2000
Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Historical Context
SimpleSummarization
ComplexCalculation
Storage &Retrieval
ComplexReporting
DataMining
Fusion &Synthesis
1900… 1950 1960 1970 1980 1990 2000
Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Historical Context
SimpleSummarization
ComplexCalculation
Storage &Retrieval
ComplexReporting
DataMining
Fusion &Synthesis
1900… 1950 1960 1970 1980 1990 2000
Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Historical Context
SimpleSummarization
ComplexCalculation
Storage &Retrieval
ComplexReporting
DataMining
Fusion &Synthesis
1900… 1950 1960 1970 1980 1990 2000
Hollerith card Language Hierarchical & OLAP / BI Machine ?? (e.g. FTN) Relational Learning
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Trend: Increasing Complexity
Time
Collection
Complexity
Aggregation
SimpleReporting
FlexibleReporting
Relationships &Interactions
Clerk
InformationWorker
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Technology Adoption ModelAd
optio
n D
ensi
ty
Techies
Time
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Technology Adoption Model
EarlyAdopter
Adop
tion
Den
sity
Techies
Time
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Technology Adoption Model
EarlyAdopter
EarlyMajority
Adop
tion
Den
sity
Techies
Time
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Technology Adoption Model
EarlyAdopter
EarlyMajority
LateMajority
Adop
tion
Den
sity
Techies
Time
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Technology Adoption Model
EarlyAdopter
EarlyMajority
LateMajority
TechnologyLaggards
Adop
tion
Den
sity
Techies
Time
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Technology Adoption Model
EarlyAdopter
EarlyMajority
LateMajority
TechnologyLaggards
Adop
tion
Den
sity
Techies
Chasm
Time
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Technology Adoption Model
EarlyAdopter
EarlyMajority
LateMajority
TechnologyLaggards
Adop
tion
Den
sity
Time
DBMSOLAP
BI
NumericData
Mining
Text & Image (Unstructured)
Numeric (Structured)
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Technology Adoption Model
EarlyAdopter
EarlyMajority
LateMajority
TechnologyLaggards
Adop
tion
Den
sity
Time
DBMSOLAP
BI
KeywordSearch
NumericData
Mining
TextCat
Image MiningText Mining
Q&A
Text & Image (Unstructured)
Numeric (Structured)
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Technology Adoption Model
EarlyAdopter
EarlyMajority
LateMajority
TechnologyLaggards
Adop
tion
Den
sity
Time
DBMSOLAP
BI
KeywordSearch
NumericData
Mining
TextCat
Image MiningText Mining
Q&AFusion
Text & Image (Unstructured)
Numeric (Structured)
Fusion (future)
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Trend: From big to HUGE!
Year 2001 2002 2003 2004
Total Warehouse Size 500 GB 1 TB 10 TB 20 TB
Weekly Transaction Volume 15 GB 50 GB 200 GB 500 GB
Daily Web Log Volume 2GB 8GB 20GB 30GB
Weekly Promotion Data Records 400 Million 1.2 Billion 3 Billion 4 Billion
Daily Promotion Data Records 50 Million 100 Million 200 Million 300 Million
Weekly Batch Window Constraint 8 Hours 8 Hours 8 Hours 8 Hours
Daily Batch Window Constraint 10 Hours 10 Hours 10 Hours 10 Hours
Customers Demographic Behavioral Time-Based Explode100,000 300 450 1,000 8,750 7,000,000,000 7GB
1,000,000 300 450 1,000 8,750 70,000,000,000 70GB10,000,000 300 450 1,000 8,750 700,000,000,000 700GB
Data Size
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Why is Data Mining Needed?
Very large data (large N)Many dimensions (large P)Complicating factors Time Space Seasonality Interactions
Key relationships not yet known Frequently changing
Need to project forward
Trends
Increasing complexity
Increasing diversity
Increasing scale
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Content Types
Numeric
Text
Image & Signal
Fusion XML
Copyright Insightful Corporation. All rights reserved. www.insightful.com
…
SCM
EnterpriseData Silos
ERP
CRM
Flat FileExtracts
Preprocess
…
SCM
ERP
CRM
Clean
Merge
Transform
Aggregate
Rollup
DMDW
MnthlyTrxSum
Cust Rec
…
BuildView
• De-Normalize• Cust = Row• Create feature
BuildModel
EvaluateModel
Deploy
Access
Clean, Interpret & Extract
Train
Evaluate
Predict
Copyright Insightful Corporation. All rights reserved. www.insightful.com
…
SCM
EnterpriseData Silos
ERP
CRM
Flat FileExtracts
Preprocess
…
SCM
ERP
CRM
Clean
Merge
Transform
Aggregate
Rollup
DMDW
MnthlyTrxSum
Cust Rec
…
BuildView
• De-Normalize• Cust = Row• Create feature
BuildModel
EvaluateModel
Deploy
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Data Access
Delimited textFixed format textCommon vendor fmts S-PLUS, SAS, SPSS… Excel, Lotus, Access…
DBMS ODBC/JDBC Native
Domain specific User - created
50%
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Data Access
Delimited textFixed format textCommon vendor fmts S-PLUS, SAS, SPSS… Excel, Lotus, Access…
DBMS ODBC/JDBC Native
Domain specific User - created
50%
Cases
Customers
Citizens
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Data Access
Delimited textFixed format textCommon vendor fmts S-PLUS, SAS, SPSS… Excel, Lotus, Access…
DBMS ODBC/JDBC Native
Domain specific User - created
50%
Cases
Customers
Citizens
Features, variables, columns
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Data Access
Delimited textFixed format textCommon vendor fmts S-PLUS, SAS, SPSS… Excel, Lotus, Access…
DBMS ODBC/JDBC Native
Domain specific User - created
50%
Cases
Customers
Citizens
Features, variables, columns
Independent, response
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Data Access
Delimited textFixed format textCommon vendor fmts S-PLUS, SAS, SPSS… Excel, Lotus, Access…
DBMS ODBC/JDBC Native
Domain specific User - created
50%
Cases
Customers
Citizens
Features, variables, columns
Independent, response
Dependent, predictor
Initial Exploration
Scatter Plot
Secondary Exploration
Color Plot
Note “noise”at the boundary
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Training
Evaluate the Training effectiveness
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Data Mining Life Cycle(1) Access (2) Exploration & Feature Extraction
(3) Training (4) Evaluation (5) Prediction
Copyright Insightful Corporation. All rights reserved. www.insightful.com
VISUALIZATION
INTERPRETATION
DATA EXTRACTION
OTHER TEXT
MINING
QUESTION ANSWERING
EXPLORATORY SEARCH
INFORMATION EXTRACTION
GUI
DATABASE
DOCUMENTS
LINGUISTIC PRIMITIVES
Text Mining Applications
Copyright Insightful Corporation. All rights reserved. www.insightful.com
KEY DIFFERENTIATOR: Creating Structure from Unstructured Text
DATA EXTRACTIONINFORMATION EXTRACTION
DATABASE
DOCUMENTS
LINGUISTIC PRIMITIVES
Copyright Insightful Corporation. All rights reserved. www.insightful.com
VISUALIZATION
INTERPRETATION
LINGUISTIC PRIMITIVES
REGRESSION
NEURAL NETS
CART…
PATTERN MATCHING
LINGUISTIC NORMALIZATION
GUI
DATABASE
Classical Structured Analysis Techniques
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Current State of Art
data
features
structuralrelationships
information
• morphological normalization• semantic normalization
syntactic normalization: governing verb of each sentence, subject, object, etc.
facts databanks
“NEXT GENERATION” SEARCH STOPS HERE!
Copyright Insightful Corporation. All rights reserved. www.insightful.com
SEARCH ENGINES STORE WORDS DEEP EXTRACTION STORES FACTS
Nothing can reconstruct original facts from a count of keywords.
Instead, InFact store facts…
"From 1949 to 1960 China was in alliance with the Soviet Union, although this relationship was already under severe strain in the late 1950s. There followed, in 1960-72, a period of isolation, during which China sought to identify itself as a natural leader of the developing world in its resistance to "US imperialism". From 1972 China found itself in de facto alliance with the US against perceived Soviet expansionism. That epoch came to a definitive end in 1989, when relations with the Soviet Union were normalized and the Beijing massacre introduced new and severe strains into Sino-US relations."
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Q&A Example
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Copyright Insightful Corporation. All rights reserved. www.insightful.com
EXPLORATORY SEARCH
Copyright Insightful Corporation. All rights reserved. www.insightful.com
TEXT MINING BASED ON SHALLOW IE
QUANTITIES
COMPANY,
ORGANIZATION ,
COUNTRY NAMES
PRODUCTS
UBL
China
176,000 60 cents
BoeingCEA
LVNL
JDAM UK Army Helicopter Wedgetail Aircraft
Copyright Insightful Corporation. All rights reserved. www.insightful.com
TEXT MINING BASED ON DEEP IE
QUANTITIES
COMPANY,
ORGANIZATION ,
COUNTRY NAMES
PRODUCTS
UBL
China
176,000 60 cents
Boeing
CEA
LVNL
JDAM UK Army Helicopter Wedgetail Aircraft
cooperate
collaborate
roll out
buy
test
droplay off
mothball
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Integrating Text Mining & Visualization
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Integrating Text Mining & Visualization
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Image Mining Common Themes
Application Domains:
• Medical Imaging CT, MR,US• Microscopic Imaging• Video Processing • Machine Vision• Document Imaging • Remote Sensing• Tactical Imaging IR, EO• More ….
Insightful Imaging Library
Segmentation
Enhancement
Feature Extraction
ClusteringClassification
Registration
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Problem: Noisy image data is hard to interpret and
leads to inconsistent outcomes.
Application: Prostate outlining in ultrasound images.
Solution: Non-linear model fitting enables more
efficient and consistent delineation for improving cancer treatment planning.
Technology suitable for other image processing applications with high noise such as SONAR images
Segmentation of Noisy Images
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Two Observers’ Delineation of the Prostate on Ultrasound Images
Manual Delineation:Note the large variationbetween the observers
Delineation Using Imaging Library Technology: The inter-observer variation is significantly lower
Copyright Insightful Corporation. All rights reserved. www.insightful.com
BRINGING IMAGES INTO THE REALM OF STATISTICS
Searching for patterns and objects in images
Analyzing image properties with statistical tools
Organizing databases of images (satellite/medical)
Interpreting and Classifying visual information
Copyright Insightful Corporation. All rights reserved. www.insightful.com
0 1
3
20
210
211213
212
0 1 2 3
20 21 22 23
210 211 212 213
Index 1
Index k
MULTIPLE
FEATURES
Index 1
Index k
MULTIPLE SCALES
AUTOMATED FEATURE EXTRACTION FROM IMAGES
Copyright Insightful Corporation. All rights reserved. www.insightful.com
SEARCH FOR AIR STRIPS
ANALYZE WITH S-PLUS
CLASSIFY TISSUES AND DISEASES
COURTESY: U. OF DELAWARE & EAST TENNESSEE STATE UNIVERSITY
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Fusion: Future Model for Analyzing Data
Text Mining: information extraction from unstructured data
Text Image Video
Presentation and Analysis
Data Mining & Prediction
Data Warehouse
Data Integration
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Critical Issues
Support all required content Numeric, text, image, signal…
Support full life cycleVisualization“Platform” – look for language-based toolsScalable – look for a pipeline constructExtensible – check out the architecture
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Scalability: Pipeline
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Scalability: Pipeline
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Scalability: Pipeline
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Scalability: Pipeline
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Scalability: Pipeline
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Scalability: Pipeline
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Scalability: Pipeline
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Scalability: Pipeline
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Scalability: Pipeline
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Scalability: Pipeline
Training
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Scalability: Pipeline
Training
Scoring
Copyright Insightful Corporation. All rights reserved. www.insightful.com
1-D Visualization
HistogramsBarchartsPiechartsDensityDot…
Copyright Insightful Corporation. All rights reserved. www.insightful.com
2-D Visualization
ScatterplotLineBoxStripQQ…
Copyright Insightful Corporation. All rights reserved. www.insightful.com
3-D Visualization
ContourLevelSurfaceCloud…
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Useful Variations
Color plotShape plotScatterplot matrixOverlaysTrellis (conditioning)Should be automatedShould be extensibleRotationOverlays
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Eye Candy
Slick-lookingUnused dimensionsHard to interpret3-D Bar3-D PieBrush & spinMulti-plane plots
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Issues
Language based vs Menu-drivenPlatform vs application (build vs. use)Open ended vs point solutionCommandlineExtensibility Open methods C++/Java extensibility XML/Web Services
Visual programming metaphorPipeline architecturePL1 vs. object orientedScalability & Interactivity
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Architecture & Technologies
Application
Platform
Library
Graphical User Interface
AlgorithmI/O Data
Pipeline Interpreter
Viz
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Architecture & Technologies
Application
Platform
Library
Graphical User Interface
Library API
AlgorithmI/O Data
Platform API
Pipeline Interpreter
Viz
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Architecture & Technologies
Application
Platform
Library
Graphical User Interface
Library API
AlgorithmI/O Data
Platform API
Pipeline Interpreter
Viz
User
User
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Architecture & Technologies
Application
Platform
Library
Graphical User Interface
Library API
AlgorithmI/O Data
Platform API
Pipeline Interpreter
Viz
User
User
COM ASPCOM ASP OLE JSPOLE JSP DDE EJBDDE EJB
C++, Java,C++, Java,some Ftn, some 4gl some Ftn, some 4gl
Java & C++Java & C++
XMLXML
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Extensible Architecture
EngineEngineVendor APIVendor API
User Code (Algorithms)User Code (Algorithms)
User Code (GUI)User Code (GUI)
Copyright Insightful Corporation. All rights reserved. www.insightful.com
What influences Success?
Data mining is typically I/O bound RAID disk systems Locate analysis near data (i.e. not across network) Databases & warehouses often too slow – ETL Use sampling – especially during exploration
Data is often very dirty Data mining tools typically offer sophisticated
methods – use them Discarding data can skew results
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Desktop PerformanceComponents should be validated on data sets as large as 25GB, with as many as 125,000,000 rows and 5,000 columns.
Reading and writing files operate at ~5.0 MB/s. E.g., read or write 7,000,000 rows by 30 columns data set (1.2 GB) in less than 3 ½ minutes.
~ 6 ½ minutes to train an ensemble of trees on a 1,000,000 rows by 30 columns data set (180 MB).
Scoring components (predictors) should perform at ~500,000 rows per minute.
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Conclusions
Data is becoming larger, more complex and more diverse in formData mining is needed to extract complex relationships from large, high dimension dataData mining is being applied to all content types including numeric, text, image and signal dataUnstructured data (e.g. text, image) must be structured to analyzed…
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Deciding on a Solution
Content breadth of offerings and skills Numeric, text, image & signal
Product features Support full life cycle Language-based Scalable Visualization Extensible
Copyright Insightful Corporation. All rights reserved. www.insightful.com
Questions for audience
Who knows something about ML? Ignorant Knowledgeable Expert
Tree example Customer data Reporting (list all customers, age, income, purchase) Sort and report Create new variable (age*income) Viz (scatter, color plots) Tree learning