Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining...
Transcript of Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining...
![Page 1: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/1.jpg)
June 2010 Indo-Spain ICT Workshop 1
Data Mining for Information Retrieval, Business and Scientific Applications
Jayant HaritsaDatabase Systems Lab, SERCIndian Institute of Science
![Page 2: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/2.jpg)
DM Research Map in India
June 2010 Indo-Spain ICT Workshop 2
IIIT BangaloreSrinath Srinivasa [Graph,Doc]
IISc BangaloreCSA, EE, SERC
M Narasimha Murthy [ML,ARM]C Bhattacharyya [ML,Bio,Hand]
Shirish Shevade [SVM]P S Sastry [Temporal, DecTrees]
Jayant Haritsa [ARM, Privacy]
IIT BombayCSE, Chemical
Sunita Sarawagi [IR]Soumen Chakrabarti [Web]
Saketa Nath [ML]Pramod Wangikar [Bio]
IIT MadrasCSE
P Sreenivasa Kumar [Text, XML]D Janaki Ram [S/W Engg]
B Ravindran [Text, Tutoring, Motion]
IIIT HyderabadKamal Karlapalem [Ecom, Cluster]Vikram Pudi [ ARM, Multimedia]
P Krishna Reddy [ARM]
IIT DelhiCSE
S K Gupta [ARM,Security]
IIT KanpurCSE
Arnab Bhattacharya[Spatial, Temporal, Bio]
IIT KharagpurCSE
Pabitra Mitra [ML]
![Page 3: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/3.jpg)
Temporal Data Mining[P S Sastry]
June 2010 Indo-Spain ICT Workshop 3
![Page 4: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/4.jpg)
Temporal Data Mining:Analyzing Symbolic Time Series Data
A temporal data mining framework Data is a sequence of events. Each event has a type and a time of occurrence‘Patterns’ in the formalism are episodes –partially ordered sets of event types. Episode occurrence: events in the data that conform to the partial order
June 2010 4Indo-Spain ICT Workshop
![Page 5: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/5.jpg)
TDM (contd)
Algorithms are developed to discover frequent episodes of different structures. Can handle additional temporal constraints.Statistical theory developed to assess significance under various null hypothesis models. Unified view of counting-based CS and model-based Stats approaches [equivalence between frequent episodes and HMMs]
June 2010 5Indo-Spain ICT Workshop
![Page 6: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/6.jpg)
Example Data Sequences
Fault report logs in manufacturing plants:root cause diagnostics Multi-Neuronal Spike train data:discovering microcircuits or groups of neurons with ‘strong’ interactionsWeb navigation data:Prediction of user behaviour
June 2010 6Indo-Spain ICT Workshop
![Page 7: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/7.jpg)
An Example Application
Data: Sequence of fault codes in an assembly plantEvent types: fault codes, date and timeEpisodes: Causative (?) ChainsRules derived from episodes provide help in root cause diagnosisCurrently in use in some engine assembly plants of General Motors
June 2010 7Indo-Spain ICT Workshop
![Page 8: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/8.jpg)
Fault Correlations
Root cause diagnostics
Application: Status logs from Assembly Plants
Fault Logs
June 2010 8Indo-Spain ICT Workshop
![Page 9: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/9.jpg)
Associated PublicationsS Laxman, PS Sastry & KP Unnikrishnan, Discovering frequent episodes and learning HMMs: A formal connection, IEEE Trans Knowledge Data Engg, Vol 17 pp 1505-1517, November 2005S Laxman, PS Sastry & KP Unnikrishnan, Learning frequent generalized episodes when events persist for different durations, IEEE Trans Knowledge Data Engg, Vol 19, pp 1188-1201, September 2007D. Patnaik, PS Sastry & KP Unnikrishnan, Inferring Neuronal network connectivity from spike data: A temporal data mining approach, Scientific Programming (special issue on biological data mining), Vol16, pp 49-77, 2008
June 2010 9Indo-Spain ICT Workshop
![Page 10: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/10.jpg)
Relevant Publications (contd)
C Diekman, PS Sastry & KP Unnikrishnan,Statistical Significance of sequential firing patterns in multi-neuronal spike trains, J. Neuroscience Methods, Vol 182, pp 279-284, 2009.PS Sastry & KP Unnikrishnan,Conditional probability based significance test for sequential patterns in multi-neuronal spike trains,Neural Computation, Vol 22, pp1025-1059, April 2010.KP Unnikrishnan, BQ Shadid, PS Sastry & S Laxman, Root cause diagnostics using temporal data mining, US Patent No 7509234, issued on 24 March 2009.
June 2010 10Indo-Spain ICT Workshop
![Page 11: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/11.jpg)
June 2010 Indo-Spain ICT Workshop 11
Online Generation of Web Tables[Prof. Sunita Sarawagi ]
![Page 12: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/12.jpg)
The semi-structured webTable
List
Regular page
Formatted list
![Page 13: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/13.jpg)
Queries in WWTQuery by example
Query by description
Alan Turing Turing MachineE. F. Codd Relational Databases
Inventor Computer science concept
![Page 14: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/14.jpg)
Answer: Table with ranked rows
Inventor Computer Science Concept
Alan Turing Turing Machine
Seymour Cray Supercomputer
E. F. Codd Relational Databases
Tim Berners-Lee WWW
Charles Babbage Babbage Engine
![Page 15: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/15.jpg)
WWT Architecture
Index Query Builder
Web
Extract record sources
Query Table
Content+context index
Offline
Store
Key
wor
d Q
uery
Sou
rce
L1 ,…
,Lk
Type Inference
ResolverResolver builder
Typesystem Hierarchy
Extractor
Record labeler
CRF modelsConsolidator
Tables T1,…,Tk
Consolidated TableCell resolver Row resolver
Ranker
Row and cell scores
Final consolidated tableUser
Annotate
Ontology
![Page 16: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/16.jpg)
Experiments
Aim: Reconstruct Wikipedia tables from only a few sample rows.Sample queries− TV Series: Character name, Actor name, Season − Oil spills: Tanker, Region, Time− Golden Globe Awards: Actor, Movie, Year− Dadasaheb Phalke Awards: Person, Year− Parrots: common name, scientific name, family
![Page 17: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/17.jpg)
Accuracy of joint labelingDataset− Manually labeled
450 tables spanning general Web and Wikipedia− Automatically labeled
650 tables from Wikipedia where cells have entity links
0
0,2
0,4
0,6
0,8
1
Entity Types relations Entity
Manual Automatic
F1 A
ccur
acy
LCAMajorityOurs
![Page 18: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/18.jpg)
SummaryAmazing amount of quality information on the semi-structured web.WWT− Online: structure interpretation at query time− Domain-independent methods for extraction and
coreference resolution− Relies heavily on unsupervised statistical learning
Joint training for exploiting overlap during extractionGraphical model for table annotationCollective column labeling for descriptive queriesBayesian network for consolidationPage rank + confidence from a probabilistic extractor for ranking
![Page 19: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/19.jpg)
Further DetailsGupta and Sarawagi, VLDB 2009
![Page 20: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/20.jpg)
June 2010 Indo-Spain ICT Workshop 23
Association Rule Mining[Jayant Haritsa]
![Page 21: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/21.jpg)
June 2010 Slide 24Indo-Spain ICT Workshop
• Table Organization– Horizontal (transactions / rows) – Vertical (items / columns)
• Data Representation– Value List (only presence)– Bit Vector (presence and absence)
– Horizontal (transactions / rows)
Data Layouts for AR Mining
1 2 5
1 1 0 10
![Page 22: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/22.jpg)
June 2010 Slide 25Indo-Spain ICT Workshop
Data Layout Combinations
OurApproach
Apriori
![Page 23: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/23.jpg)
June 2010 Slide 26Indo-Spain ICT Workshop
Why Vertical Organization?
• “Natural” for association rule mining’s goal of discovering correlated items (columns)
• Support counting simple (set intersections)• No excess baggage from disk (automatic and
immediate reduction of database after each scan)• Ideal for parallel implementations (asynchronous,
not level-wise, computation)– counting of AB can start before item C has been fully
counted
![Page 24: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/24.jpg)
June 2010 Slide 27Indo-Spain ICT Workshop
Why Bit-Vector Representation?
• Tremendous scope for compression, especially since databases are typically sparse
• Vertical orientation offers better compression than horizontal since column lengths are proportional to size of database whereas row lengths are proportional to size of schema
• In fact, compressed VTV occupies much less space than HIL
![Page 25: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/25.jpg)
June 2010 Slide 28Indo-Spain ICT Workshop
Contributions
• Compressed VTV data layout — “Snake”• VIPER snake mining algorithm
– (Vertical Itemset Partitioning for Efficient Rule-extraction)– several optimizations for snake generation, intersection,
counting and storage– “general-purpose’’ (prior vertical algorithms have
restrictions on DB size, shape, contents, mining process)– substantial performance improvement
(response time, disk space, disk traffic)– in some cases, beats “optimal” horizontal !
• Presented in ACM SIGMOD 2000
![Page 26: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/26.jpg)
June 2010 Indo-Spain ICT Workshop 29
Privacy Preserving Mining
![Page 27: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/27.jpg)
JUNE 2008 Slide 30S N Bose Centre
A Typical Web-Service Form(e.g. Amazon.com)
![Page 28: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/28.jpg)
JUNE 2008 Slide 31S N Bose Centre
State-of-the-Art
• To cater to such privacy concerns, users are forced to resort to falsifying their data– (E.g.: There appear to be numerous grandmothers from
Bangalore downloading rock music, because theactual clients ― young male IISc students ― have falsified their age and gender)
• But then, the models become meaningless …– Garbage in, Garbage out
![Page 29: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/29.jpg)
JUNE 2008 Slide 32S N Bose Centre
Design Wish-list
• High Privacy– User-visible (i.e. should be done at user site)
• Highly accurate models– Association Rules correctly identified
• Efficiency– Data collection / Mining-process
Privacy
Accuracy
Efficiency
User → Service → MiningProvider
Company
![Page 30: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/30.jpg)
JUNE 2008 Slide 33S N Bose Centre
The Digital Divide
Data Privacy Accurate ModelsVersus
Desire “Globally accurate, locally private”, but
![Page 31: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/31.jpg)
JUNE 2008 Slide 34S N Bose Centre
Bridging the Divide
User Data
Mining Algorithm
Accurate Models
Data DistortionProcedure
Distribution ReconstructionProcedure
![Page 32: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/32.jpg)
JUNE 2008 Slide 35S N Bose Centre
Optimal Distortion Matrix
• Symmetric positive-definite Toeplitz matrix, with Gamma diagonal
• Results in matrix with lowest “condition number” (ratio of maximum and minimum eigen-values)– makes reconstruction least sensitive to the variance in the
distribution of the distorted database– order-of-magnitude accuracy improvements
• Dependent column perturbation
γ determines privacy level
![Page 33: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/33.jpg)
June 2010 Slide 36Indo-Spain ICT Workshop
Summary
• By careful mathematical design, it is possible to simultaneously achieve
user data privacy,accurate statistical models, and good runtime efficiency.
![Page 34: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/34.jpg)
June 2010 Slide 37Indo-Spain ICT Workshop
Further Details
• “Maintaining Data Privacy in Association RuleMining” [VLDB 2002]
• “On Addressing Efficiency Concerns inPrivacy-Preserving Mining” [DASFAA 2004]
• “A Framework for High-Accuracy Privacy-Preserving Mining” [ICDE 2005]
• All available at http://dsl.serc.iisc.ernet.in
![Page 35: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/35.jpg)
June 2010 Indo-Spain ICT Workshop 38
Questions?
![Page 36: Data Mining for Information Retrieval, Business and Scientific … · 2010-10-21 · Data Mining for Information Retrieval, Business and Scientific Applications. Jayant Haritsa Database](https://reader036.fdocuments.net/reader036/viewer/2022062923/5f0c25307e708231d433f62b/html5/thumbnails/36.jpg)
Experiments
Aim: Reconstruct Wikipedia tables from only a few sample rows.Sample queries– TV Series: Character name, Actor name,
Season – Oil spills: Tanker, Region, Time– Golden Globe Awards: Actor, Movie, Year– Dadasaheb Phalke Awards: Person, Year– Parrots: common name, scientific name, family