10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data...
-
Upload
barbara-barber -
Category
Documents
-
view
219 -
download
3
Transcript of 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data...
![Page 1: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/1.jpg)
04/19/23 1
Data Mining: Concepts and Techniques
— Chapter 10 —10.3.1 Mining Text and Web Data (I)
Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
Acknowledgements: Based on the slides by students at CS512 (Spring 2009)
![Page 2: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/2.jpg)
Outline
Introduction to Information Retrieval (Rui Li)
Text categorization (Parikshit Sondhi)
Web link analysis (Kavita Ganesan)
Mining and Searching Structured Data on
the Web (Bo Zhao)
![Page 4: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/4.jpg)
What’s the Information Retrieval ? Information Retrieval
There exists a collection of text documents User gives a query to express the information need A retrieval system returns relevant documents to users
Typical IR systems Online library catalogs Online document management systems Web Search Engine (Google)
Information Retrieval vs. Database System Unstructured/free text vs. structured data Ambiguous vs. well-defined semantics Incomplete vs. complete specification Relevant documents vs. matched records No transaction VS transaction management,
![Page 5: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/5.jpg)
Typical IR System Architecture
5
User
querydocs
results
Query RepDoc Rep (Index)
ScorerIndexer
Tokenizer
Index
![Page 6: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/6.jpg)
Document Representation
A document can be described by a set of representative keywords called index terms.
Different index terms have varying relevance when used to describe document contents.
Steps: Tokenize the document into the words Remove stop words from stop word list E.g., “is”
“a” “or” Words stemmer: Several words are small syntactic
variants of each other since they share a common word stem E.g., drug, drugs, drugged
Calculate the term weight based on the word frequency
Query Representation is a similar process
![Page 7: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/7.jpg)
Indexing Inverted index
Maintains two hash- or B+-tree indexed tables: document_table: A set of document records <doc_id,
postings_list> term_table: A set of term records, <term, postings_list>
Answer query: Find all docs associated with one or a set of terms
+ easy to implement + effective to fetch documents with specific term – do not handle well synonymy and polysemy, and posting
lists could be too long (storage could be very large) Other index techniques: e.g., signature file
![Page 8: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/8.jpg)
Ranking Model The basic question: Given a query, how do we know if document A
is more relevant than B? Relevance = Similarity
Query and document are represented similarly A query can be regarded as a “document” Relevance(d, q) similarity(d, q)
Key issues How to represent query/document? How to define the similarity measure ?
Typical Models Boolean Model Vector Space Model Language Model
![Page 9: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/9.jpg)
9
The Notion of RelevanceRelevance
(Rep(q), Rep(d)) Similarity
P(r=1|q,d) r {0,1} Probability of relevance
P(d q) or P(q d) Probabilistic inference
Different rep & similarity
Vector spacemodel(Salton et al., 75)
Prob. distr.model(Wong & Yao, 89)
…
GenerativeModel
RegressionModel(Fox 83)
Classicalprob. model(Robertson & Sparck Jones, 76)
Docgeneration
Querygeneration
LMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)
Prob. conceptspace model(Wong & Yao, 95)
Differentinference system
Inference network model(Turtle & Croft, 91)
![Page 10: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/10.jpg)
10
Vector Space Model Represent a doc/query by a term vector
Term: basic concept, e.g., word or phrase Each term defines one dimension and N terms
define a high-dimensional space Element of vector corresponds to term weight, i.e.,
the “importance” of the term
Java
Microsoft
Starbucks
D6
D10
D9
D4
D7D8
D5
D11
D2 ? ?
D1
? ?
D3? ?Query
![Page 11: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/11.jpg)
04/19/23Data Mining: Principles and Algorithms
11
How to Assign Weights
Two-fold heuristics based on frequency TF (Term frequency)
More frequent within a document more relevant to semantics
e.g., “query” vs. “commercial”
IDF (Inverse document frequency) Less frequent among documents more discriminative e.g. “algebra” vs. “science”
TF-IDF weighting: weight(t, d) = TF(t, d) * IDF(t) Frequent within doc high tf high weight Selective among docs high idf high weight
![Page 12: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/12.jpg)
04/19/23Data Mining: Principles and Algorithms
12
How to Measure Similarity? Given two document
Similarity definition dot product
normalized dot product (or cosine)
![Page 13: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/13.jpg)
13
Advantages and Disadvantages of VS Model Advantages:
Empirically effective! (Top TREC performance) Intuitive Easy to implement Well-studied/most evaluated
Disadvantages: Assume term independence Assume query and document be the same Lack of “predictive adequacy” Arbitrary term weighting Arbitrary similarity measure
![Page 14: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/14.jpg)
14
Language Models for Retrieval(Ponte & Croft 98)
Document
Text miningpaper
Food nutritionpaper
Language Model
…text ?mining ?assocation ?clustering ?…food ?…
…food ?nutrition ?healthy ?diet ?…
Query = “data mining algorithms”
?Which model would most likely have generated this query?
![Page 15: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/15.jpg)
15
Text Generation with Unigram LM
(Unigram) Language Model p(w| )
…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…
Topic 1:Text mining
…food 0.25nutrition 0.1healthy 0.05diet 0.02…
Topic 2:Health
Document
Text miningpaper
Food nutritionpaper
Sampling
![Page 16: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/16.jpg)
16
Estimation of Unigram LM
(Unigram) Language Model p(w|) = ?
Document
text 10mining 5association 3database 3algorithm 2…query 1efficient 1
…text ?mining ?association ?database ?…query ?…
Estimation
A “text mining paper”(total #words=100)
10/1005/1003/1003/100
1/100
![Page 17: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/17.jpg)
17
Ranking Docs by Query Likelihood
d1
d2
dN
qd1
d2
dN
Doc LM
p(q| d1)
p(q| d2)
p(q| dN)
Query likelihood
![Page 18: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/18.jpg)
18
Retrieval as Language Model Estimation
Document ranking based on query likelihood
n
ii
wwwqwhere
dwpdqp
...,
)|(log)|(log
21
• Retrieval problem Estimation of p(wi|d)
• Smoothing is an important issue, and distinguishes different approaches
Document language model
![Page 19: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/19.jpg)
Basic Measures for Text Retrieval
04/19/23Data Mining: Principles and Algorithms
19
Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)
Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved
|}{|
|}{}{|
Relevant
RetrievedRelevantrecall
|}{||}{}{|
RetrievedRetrievedRelevant
precision
Relevant Relevant & Retrieved Retrieved
All Documents
![Page 20: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/20.jpg)
Acknowledge Some slides are coming from Professor Jiawei
Han’ s CS512 course slides and from Professor Chengxiang Zhai’s CS410 course slides (Language Model Part)
![Page 21: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/21.jpg)
By:Parikshit Sondhi
Computer ScienceUniversity of Illinois at Urbana Champaign
Some slides have been adapted from Prof. Han's presentation
Text Categorization
![Page 22: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/22.jpg)
Document Classification: Motivation
News article classification Automatic email filtering Webpage classification Word sense disambiguation … …
![Page 23: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/23.jpg)
04/19/23Data Mining: Principles and Algorithms23
Text Categorization Pre-given categories and labeled document
examples (Categories may form hierarchy) Classify new documents A standard classification (supervised learning )
problem
CategorizationSystem
…
Sports
Business
Education
Science…
SportsBusiness
Education
![Page 24: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/24.jpg)
Document Classification: Problem Definition
Need to assign a boolean value {0,1} to each entry of the decision matrix
C = {c1,....., cm} is a set of pre-defined categories D = {d1,..... dn} is a set of documents to be
categorized 1 for aij : dj belongs to ci 0 for aij : dj does not belong to ci
A Tutorial on Automated Text Categorisation, Fabrizio Sebastiani, Pisa (Italy)
![Page 25: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/25.jpg)
Flavors of Classification Single Label
For a given di at most one (di, ci) is true Train a system which takes a di and C as input and outputs
a ci
Multi-label For a given di zero, one or more (di, ci) can be true Train a system which takes a di and C as input and outputs
C’, a subset of C
Binary Build a separate system for each ci, such that it takes in as
input a di and outputs a boolean value for (di, ci) The most general approach Based on assumption that decision on (di, ci) is independent
of (di, cj)
![Page 26: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/26.jpg)
Classification Methods
04/19/23Data Mining: Principles and Algorithms26
Manual: Typically rule-based (KE Approach) Does not scale up (labor-intensive, rule inconsistency) May be appropriate for special data on a particular
domain Automatic: Typically exploiting machine learning
techniques Vector space model based
Prototype-based (Rocchio) K-nearest neighbor (KNN) Decision-tree (learn rules) Neural Networks (learn non-linear classifier) Support Vector Machines (SVM)
Probabilistic or generative model based Naïve Bayes classifier
![Page 27: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/27.jpg)
Steps in Document Classification Classification Process
Data preprocessingE.g., Term Extraction, Dimensionality
Reduction, Feature Selection, etc.Definition of training set and test setsCreation of the classification model using
the selected classification algorithmClassification model validationClassification of new/unknown text
documents
![Page 28: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/28.jpg)
Taking an Example : TFIDF Classifiers
![Page 29: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/29.jpg)
Vector Space Model
04/19/23Data Mining: Principles and Algorithms29
Represent a doc by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define an N-dimensional space Element of vector corresponds to term weight
E.g., d = (x1,…,xN), xi is “importance” of term i
New document is assigned to the most likely category based on vector similarity (e.g., based on cosine formula).
![Page 30: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/30.jpg)
VS Model: Illustration
04/19/23Data Mining: Principles and Algorithms30
Java
Microsoft
StarbucksC2 Category 2
C1 Category 1
C3
Category 3
new doc
![Page 31: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/31.jpg)
TFIDF Classifier The basic idea of the algorithm is to represent
each document d as a vector d = (d(1),...., d(|F|)) in a vector space so that documents with similar content have similar vectors.
Each dimension of the vector space represents a word selected by the feature selection process.
d(i) for a document d is calculated as a combination of the statistics TF(w, d) and DF(w).d(i) is called weight of word wi in document d.
A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Thorsten Joachims, Carnegie Mellon University, Pittsburgh, PA
![Page 32: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/32.jpg)
Representation
Each distinct word is a feature with the number of times the word occurs in the document as its value. This value is usually a function of TF(w,d) and IDF(w,d).
To avoid unnecessarily large feature vectors words are considered as features only if they occur in the training data at least m times (e.g., m = 3).
A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Thorsten Joachims, Carnegie Mellon University, Pittsburgh, PA
![Page 33: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/33.jpg)
Preprocessing: Feature Selection All available features vs. "good" subset The problem of finding a "good" subset of
features is called feature selection Feature selection methods;
1- pruning of infrequent words Words are only considered as features, if they occur at
least a few times in the training data. 2- Pruning of high frequency words
This technique is supposed to eliminate non content words like "the", "and", "for".
A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Thorsten Joachims, Carnegie Mellon University, Pittsburgh, PA
![Page 34: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/34.jpg)
Classification: TFIDF Classifier
A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Thorsten Joachims, Carnegie Mellon University, Pittsburgh, PA
![Page 35: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/35.jpg)
Evaluations
04/19/23Data Mining: Principles and Algorithms35
Effectiveness measure Classic: Precision & Recall
Precision
Recall
![Page 36: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/36.jpg)
Evaluation (con’t)
04/19/23Data Mining: Principles and Algorithms36
Benchmarks Classic: Reuters collection
A set of newswire stories classified under categories related to economics
Effectiveness Difficulties of strict comparison
different parameter setting different “split” (or selection) between training and testing various optimizations … …
However, widely recognizable Best: Boosting-based committee classifier & SVM Worst: Naïve Bayes classifier
Need to consider other factors, especially efficiency
![Page 37: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/37.jpg)
Document Classification: Approach Comparisons
![Page 38: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/38.jpg)
Document Clustering
04/19/23Data Mining: Principles and Algorithms38
Motivation Automatically group related documents based on
their contents No predetermined training sets or taxonomies Generate a taxonomy at runtime
Most popular clustering methods are: K-Means clustering Agglomerative hierarchical clustering EM (Gaussian Mixture) …
![Page 39: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/39.jpg)
The Steps and Algorithms Clustering Process
Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc.
Hierarchical clustering: compute similarities by applying clustering algorithms
Model-Based clustering (Neural Network Approach): clusters are represented by “exemplars” (e.g., SOM)
![Page 40: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/40.jpg)
K-Means clustering Given:
set of documents (e.g., TFIDF vectors), distance measure (e.g., cosine) K (number of groups)
For each of K groups, initialize its centroid with a random document
While not converging Each document is assigned to the nearest group
(represented by its centroid) For each group, calculate new centroid (group
mass point, average document in the group)
![Page 41: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/41.jpg)
Slide adapted from Dr. Andrew Moore’s Presentation
![Page 42: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/42.jpg)
Summary: Text Categorization
04/19/23Data Mining: Principles and Algorithms42
Wide application domain
Comparable effectiveness to professionals
Manual TC is not 100% and unlikely to improve
substantially
A.T.C. is growing at a steady pace
Prospects and extensions
Very noisy text, such as text from O.C.R.
Speech transcripts
![Page 43: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/43.jpg)
References
04/19/23Data Mining: Principles and Algorithms43
Fabrizio Sebastiani, “Machine Learning in Automated Text
Categorization”, ACM Computing Surveys, Vol. 34, No.1,
March 2002
Yiming Yang, “An evaluation of statistical approaches to text
categorization”, Journal of Information Retrieval, 1:67-88,
1999.
Yiming Yang and Xin Liu, “A re-examination of text
categorization methods”, Proceedings of ACM SIGIR
Conference on Research and Development in Information
Retrieval (SIGIR'99, pp 42--49), 1999.
![Page 44: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/44.jpg)
Thank You
![Page 45: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/45.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 45
Web Link Analysis
By Kavita Ganesan
![Page 46: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/46.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 46
RECAP
What is ranking in information retrieval?
Doc 1Doc 1
Doc 2Doc 2
Doc 3Doc 3
Doc 4Doc 4 perform searchon google
![Page 47: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/47.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 47
RECAP
What is ranking in information retrieval?
Doc 1Doc 1
Doc 2Doc 2
Doc 3Doc 3
Doc 4Doc 4
Ranked 1st
Ranked 2nd
Ranked 3rd
Ranked 4th perform searchon google
![Page 48: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/48.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 48
Why is ranking important?
Users tend to look at top few results make sure that good matches are at the very
top
Fast access to information! savvy users want results immediately
What happens if pages are poorly ranked?
important matches missed
poor user retention
![Page 49: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/49.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 49
Ranking in Text Information Retrieval
Before web existed Each document treated as a bag of words Minimal structure Ranking heuristics
Solely based on words in the documents E.g., term frequency, inverse document
frequency
After the web was born Documents
have structure contain hyperlinks contain components like title, author,
abstract, sections, references
Question is: Can we leverage this information to improve ranking?Question is: Can we leverage this information to improve ranking?
![Page 50: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/50.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 50
Exploiting inter-document links
Description(“anchor text”)
Hub Authority
Links indicate the utility of a doc
What does a link tell us?
show
![Page 51: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/51.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 51
Links Analysis Algorithms
PageRankPageRank HITSHITS
Hyperlink analysis to rank documents
![Page 52: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/52.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 52
PageRank
Based on the idea of a ‘random surfer’ the likelihood that a person randomly clicking on
links will arrive at any particular page
Pages represented as Markov Chain states
Probability of moving from one page to another is modelled as a state transition probability
![Page 53: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/53.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 53
PageRank
Ex:
BBAA
CC DD
0 1/2 1/2 0
1/2 0 0 1/2
1 0 0 01/2 0 1/2 0
State transition matrix
ABCD
A B C D
PR(A) =½*PR(B) + 1*PR(C)+ ½ PR(D)
1/2
1/21
![Page 54: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/54.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 54
PageRank PageRank value for any page u can be expressed
as
PR(u) VEBu
PR(v) L(v)
L(v) = number of outbound links of page vPR(v) = PageRank of page vBu = set of pages linking to page u
![Page 55: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/55.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 55
HITS
HITS = Hyperlink-Induced Topic Search
Developed by Jon Michael Kleinberg from Cornell
The algorithm produces two types of pages: Authority: pages that provide an important, trustworthy
information on a given topic Hub: pages that contain links to authorities
Authorities and hubs exhibit a mutually reinforcing relationship:
a better hub points to many good authorities, and a better authority is pointed to by many good hubs
![Page 56: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/56.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 56
HITS algorithm
Start with each node(page) having a hub score and authority score of 1
Run the Authority Update Rule
Run the Hub Update Rule
Normalize the values: divide each Hub score by the sum of all Hub scores divide each Authority score by the sum of all Authority
scores
Repeat from the second step as necessary
![Page 57: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/57.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 57
HITS algorithm—Authority Update
Node's Authority score = the sum of the Hub Score's of each node that points to it.
A page has high authority if it is linked to by pages that are recognized as Hubs for information.
1 A
B
C
D
authority(A) = h(B) + h(C) + h(D)
![Page 58: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/58.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 58
HITS algorithm—Hub Update
Node’s Hub Score = the sum of the Authority Score's of each node that it points to. A page is a good hub if it links to pages that
have high authority
A
5
6
7
E
G
F
hub(A) = a(E) + a(F) + a(G)
![Page 59: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/59.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 59
PageRank vs HITS
HITS PageRankiterative algorithm based on linkage of documents on the web
iterative algorithm based on linkage of documents on the web
HITS is executed at query time (authority and hub scores are query specific) takes a perfomance hit
PageRank is pre-computed
Computes two scores per document, hub and authority
Computes a single score
![Page 60: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/60.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 60
End
![Page 61: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/61.jpg)
April 19, 2023Data Mining: Concepts and
Techniques 61
AUTHORITY PAGE
Kevin Chang
Cheng Zhai
Marianne Winslet
ibm.com
berkeley.edu
Stanford.edu
If a page is popular, then it must be an important page [back]
If a page is popular, then it must be an important page [back]
![Page 62: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/62.jpg)
Mining and Searching Structured Data on the Web
Bo Zhao ([email protected])
Department of Computer Science
University of Illinois at Urbana-Champaign
![Page 63: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/63.jpg)
Structured Data are EVERYWHERE!
Deep Web: databases behind websites (aa.com) Web 2.0 Contents: Flickr, Del.icio.us tags Google Base: structured data portals Surface Web: emails, org, country…
![Page 64: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/64.jpg)
Solutions
Deep Web Data Integration Vertical Search Engines On-the-fly Meta-querying Systems Pay-As-You-Go Integration
Deep Web Surfacing Entity Search on the Surface Web
![Page 65: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/65.jpg)
Vertical Search Engines—”Warehousing” approach
Academic Search Libra@MSRA DBLife@WISC ArnetMiner@Tsinghua
Many other domains Shopping Events Apartments …
![Page 66: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/66.jpg)
66
Integrating information from multiple types of sources Ranking papers, conferences, and authors for a given query Handling structured queries
WebDatabase
WebDatabase
WebDatabase
WebDatabase
WebDatabase
…
PS DOC
JournalHomepage
AuhtorHomepage
Conf.Homepage
Vertical Search Engines—”Warehousing” approach e.g., Libra Academic Search [NieZW+05] (courtesy MSRA)
![Page 67: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/67.jpg)
On-the-fly Meta-querying Systems
MetaQuerier@UIUC WISE-Integrator
http://www.data.binghamton.edu:8080/wise-integrator/ Commercial Systems
http://www.cheaptickets.com http://pipl.com …
![Page 68: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/68.jpg)
68
On-the-fly Meta-querying Systems—e.g., WISE [HeMYW03], MetaQuerier [ChangHZ05]
FIND sources
QUERY sources
db of dbs
unified query interface
Amazon.comCars.com
411localte.com
Apartments.com
MetaQuerier@UIUC :
![Page 69: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/69.jpg)
69
Technical Challenges.
Source Modeling & Selection How to describe a source and find right sources for query answering?
Schema Matching How to match the schematic structures between sources?
Source Querying, Crawling, and Object Ranking How to query a source? How to crawl all objects and to search them?
Data Extraction How to extract result pages into relations?
![Page 70: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/70.jpg)
70
Source Modeling & Selection: How to describe a source and find right sources for query answering
Focus: Discovery of sources. Focused crawling to collect query interfaces [BarbosaF05, ChangHZ05].
Focus: Extraction of source models. Hidden grammar-based parsing [ZhangHC04]. Proximity-based extraction [HeMY+04]. Classification to align with given taxonomy [HessK03, Kushmerick03].
Focus: Organization of sources and query routing Offline clustering [HeTC04, PengMH+04]. Online search for query routing [KabraLC05].
![Page 71: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/71.jpg)
71
Form Extraction: the Problem
Output all the conditions, for each: Grouping elements (into query conditions) Tagging elements with their “semantic roles”
attribute operator value
![Page 72: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/72.jpg)
72
Schema Matching: How to match the schematic structures between sources
Focus: Matching large number of interface schemas, often in a holistic way. Statistical model discovery [HeC03]; correlation mining [HeCH04, HeC05]. Query probing [WangWL+04]. Clustering [HeMY+03, WuYD+04]. Corpus-assisted [MadhavanBD+05]; Web-assisted [WuDY06].
Focus: Constructing unified interfaces. As a global generative model [HeC03]. Cluster-merge-select [HeMY+03].
![Page 73: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/73.jpg)
73
WISE-Integrator: Cluster-Merge-Represent [HeMY+03]
![Page 74: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/74.jpg)
74
Source Querying: How to query a source? How to crawl all objects and to search them?
1. Metaquerying model: Focus: On-the-fly Querying.
MetaQuerier Query Assistant [ZhangHC05].
2. Vertical-search-engine model: Focus: Source crawling to collect objects.
Form submission by query generation/selection e.g., [RaghavanG01, WuWLM06].
Focus: Object search and ranking [NieZW+05]
![Page 75: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/75.jpg)
75
On-the-fly Querying: [ZhangHC05] Type-locality based Predicate Translation
Target template P
Target Predicate t*
Type Recognizer
Domain Specific Handler
Text Handler
Numeric Handler
Datetime Handler
Predicate Mapper
Source predicate s
Correspondences occur within localities
Translation by type-handler
![Page 76: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/76.jpg)
76
Source Crawling by Query Selection [WuWL+06]
Author Title Category
Ullman Complier System
Ullman Data Mining Application
Ullman Automata Theory
Han Data Mining ApplicationUllman
Han
Compiler
Automata
Data Mining
Application
TheorySystem
Conceptually, the DB as a graph: Node: Attributes Edge: Occurrence relationship
Crawling is transformed into graph traversal problem:Find a set of nodes N in the graph G such that for every node i in G, there exists a node j in N, j->i. And the summation of the cost of nodes in N should be minimum.
![Page 77: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/77.jpg)
77
Object Ranking - Object Relationship Graph [NieZW+05]
Popularity Propagation Factor for each type of relationship link
Popularity of an object is also affected by the popularity of the Web pages containing the object
![Page 78: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/78.jpg)
78
Data Extraction: How to extract result pages into relations
Mediator
Wrapper Wrapper Wrapper
Focus: Semi-automatic wrapper construction
Techniques: Wrapper-mediator architecture [Wiederhold92] . Manual construction: Semi-automatic: Learning-based
HLRT [KushmerickWD97], Stalker [MusleaMK99], Softmealy [HsuD98];
![Page 79: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/79.jpg)
79
Data Extraction: How to extract result pages into relations
Mediator
Wrapper Wrapper Wrapper
Focus: Even more automatic approaches.
Techniques: Semi-automatic: Learning-based
[ZhaoMWRY05], [IRMKS06]. Automatic: Syntax-based
RoadRunner [MeccaCM01], ExAlg [ArasuG03],DEPTA [LiuGZ03, ZhaiL05].
![Page 80: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/80.jpg)
You can only afford to Pay As You Go
Data Integration Solution Build data integration systems with deep web sources Reformulate user queries at search-time Build data integration for every domain of interest
Impractical for web search! Cannot query sources too often Precise content description required Too many domains of interest? Mediated schema design is infeasible!
![Page 81: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/81.jpg)
Web Search Queries and Users
Web Queries are typically keyword queries
Data integration solutions assume structured queries
Web users do not typically care if results are structured or unstructured
User attention restricted to small number of portals (~1)
![Page 82: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/82.jpg)
PAYGO Architecture
There can be many, potentially ill-defined, domainsMediated Schema Schema Clusters
Precise mappings cannot be created to all data sourcesExact Mappings Approximate Mappings
Users prefer keyword queries to structured queriesQuery Reformulation Query Routing
Data sources are diverse and mappings approximateExact Answers Heterogeneous Result Ranking
Uncertainty everywhere !
![Page 83: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/83.jpg)
Pay As You Go in PAYGO
Integration is a continuous process Apriori integration impossible Understanding of mappings/sources/ranking/etc. evolves over time
Mechanisms to facilitate evolution over time Automatic schema clustering and matching Implicit use of user feedback, e.g., from result clicks Result variations to elicit disambiguating user feedback
Queries always answered with best effort “Pay” more by correcting/creating semantic mappings
![Page 84: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/84.jpg)
Query Routing Example
Keyword Analysis
Domain Selection
Query Construction
Source Selection
Result Ranking
make model year attribute
vehicle
vehicle (mk:honda, md:civic, yr:2007, review:?)
car-reviews-by-year.com > car-reviews.com > car-prices.com
“honda civic 2007 review”
![Page 85: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/85.jpg)
Surfacing the Deep Web – A More Practical Solution?
Pre-compute all interesting form submissions each HTML form
Each form submission corresponds to a distinct URL Add URLs for each form submission into search engine index
Enables the reuse of existing search engine infrastructure Deep-web URLs are like any other URL (GET method)
Reduced load on deep-web sites Only in response to user clicks on a search results Search engine performance not dependent on deep-web source
![Page 86: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/86.jpg)
Surfacing Challenges
1. Predicting the appropriate values for text inputs Valid input values are required for retrieving data Ingredients in recipes.com and zipcodes in borderstores.com
2. Predicting the correct input combinations Generating all possible URLs is wasteful + unnecessary Cars.com has ~500K listings, but 250M possible queries
![Page 87: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/87.jpg)
Google’s Deep-Web crawling system
Affects more than 1000 queries per second Enables access to more than a million Deep-Web sites Spans 50+ languages and 100+ domains Results served from 400K+ distinct forms per day Results validate the utility of Deep-Web content
Other systems: http://www.deeppeep.org/ …
![Page 88: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/88.jpg)
Searching for Structured Data on the Surface Web: EntitySearch@UIUC
Entity Extraction and Indexing Ranking Entities Directly:
Contextual - Utilize Entities’ Surrounding Context Uncertain - Extractions are non-”prefect” Holistic - Many evidences from multiple sources Discriminative - Web Pages are of Varying Quality Associative - Tell True Associations from Accidental
Other systems: NAGA (http://www.mpi-inf.mpg.de/~kasneci/naga/) Correlator (http://correlator.sandbox.yahoo.net/)
![Page 89: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/89.jpg)
References
Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web. K. C.-C. Chang, tutorial in SIGMOD 2006
EntityRank: Searching Entities Directly and Holistically. T. Cheng, X. Yan, and K. C.-C. Chang. VLDB 2007.
Web-scale Data Integration: You can only afford to Pay As You Go. Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy. CIDR, 2007.
Google's Deep-Web Crawl. Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, Alon Halevy. VLDB, 2008.
![Page 90: 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data (I) Jiawei Han and Micheline Kamber Department of Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649ebb5503460f94bc366e/html5/thumbnails/90.jpg)
Thank you!