Text Mining with Machine Learning Techniques
description
Transcript of Text Mining with Machine Learning Techniques
Ping-Tsun Chang
Ping-Tsun ChangIntelligent Systems Laboratory
Computer Science and Information Engineering
National Taiwan University
Text MiningText Miningwith Machine Learning Techniqueswith Machine Learning Techniques
Ping-Tsun Chang
LanguageIdentification
Classification
Clustering
Summerization
Feature Selection
Text AnalysisText Analysis
Ping-Tsun Chang
Text MiningText Mining
• Text mining is about looking for patterns in natural language text– Natural Language Processing
• May be defined as the process of analyzing text to extract information from it for particular purposes.– Information Extraction– Information Retrieval
Ping-Tsun Chang
Text MiningText Miningand Knowledge Managementand Knowledge Management
• a recent study indicated that 80% of a company's information is contained in text documents– emails, memos, customer correspondence, and reports
• The ability to distil this untapped source of information provides substantial competitive advantages for a company to succeed in the era of a knowledge-based economy.
Ping-Tsun Chang
Text MiningText MiningApplicationsApplications
• Customer profile analysis– mining incoming emails for customers' complaint and
feedback.
• Patent analysis– analyzing patent databases for major technology players,
trends, and opportunities.
• Information dissemination– organizing and summarizing trade news and reports for
personalized information services.
• Company resource planning– mining a company's reports and correspondences for activities,
status, and problems reported.
Ping-Tsun Chang
Text CategorizationText CategorizationProblem DefinitionProblem Definition
• Text categorization is the problem of automatically assigned predefined categories to free text documents– Document classification– Web page classification– News classification
Ping-Tsun Chang
Information RetrievalInformation Retrieval
• Full text is hard to process, but is a complete representation to document
• Logical view of documents• Models
– Boolean Model– Vector Model– Probabilistic Model
• Think text as patterns?
Ping-Tsun Chang
EvaluationEvaluation
Retrieved
Relevant
ab
c dd
Precisiona
a + bRecall
a
a + d
Ping-Tsun Chang
Pattern RecognizationPattern Recognization
Sensing
Segmentation
Classification
Post-Processing
Feature Extraction Decision
Ping-Tsun Chang
Pattern ClassificationPattern Classification
f1
f2
C1
C2
Ping-Tsun Chang
Machine LearningMachine Learning
• Using Computer help us to induction from complex and large amount of pattern data
• Bayesian Learning
• Instance-Based Learning– K-Nearest Neighbors
• Neural Networks
• Support Vector Machine
Ping-Tsun Chang
Feature Selection (I)Feature Selection (I)
• Information Gain
||
1
||
1
||
1
)0|(log)0|()0(
)1|(log)1|()1(
)(log)(
)|()(),(
c
kkki
c
kkki
c
kkk
ii
tCPtCPKP
tCPtCPKP
CPCP
KCECECKIG
Ping-Tsun Chang
Feature Selection (II)Feature Selection (II)
• Mutual Information
• CHI-Square
)()(
),(log()
)|(
1log()
)(
1log(),(
CPkP
CkP
kcPCPCKMI
t
t
tt
)()()()(
)(),(
22
DCBADBCA
CBADNCk t
Ping-Tsun Chang
Weighting SchemeWeighting SchemeTF IDF‧TF IDF‧
||||
)/log()),(log(1(),(
d
nNdktfdkw ti
i
dk
i
i
dkwd 2),(||||
Ping-Tsun Chang
Simility EvaluationSimility Evaluation
• Cosine-Like schema
T
llj
T
lli
T
lljli
ji
jiji
ww
ww
dd
ddddsim
1
2
1
2
1
||||),(
di
dj
Ping-Tsun Chang
Machine LearningMachine LearningApproaches: Baysian ClassifierApproaches: Baysian Classifier
))|()((maxarg1
n
ii
cCKPCP
TCN
CKNCKP i
i
)(
1)|()|(
Ping-Tsun Chang
Machine LearningMachine LearningApproaches: kNN ClassifierApproaches: kNN Classifier
kNNd
jiij
i
cdCddsimcdC ),(),(),(
d ?
Ping-Tsun Chang
Machine LearningMachine LearningApproaches: Support Vector MachineApproaches: Support Vector Machine
• Basic hypotheses : Consistent hypotheses of the Version Space
• Project the original training data in space X to a higher dimension feature space F via a Mercel operator K
n
iii xxKxf
1
),()(
Ping-Tsun Chang
Compare: SVM and traditional Compare: SVM and traditional LeanersLeaners
• Traditional Leaner
• SVM access the hypothesis space!
P(h)
hypothesis
P(h|D1)
hypothesis
P(h|D1^D2)
hypothesis
Ping-Tsun Chang
SVM Learning in Feature SpacesSVM Learning in Feature Spaces
))(),...,(()(),...( 11 xxxxxx dn
),,)(()( 21},,{
221 mmppx
zyxi
ii
),,,,,,,( 21222111 mmppppppx zyxzyxExample:
X F
Ping-Tsun Chang
Support Vector Machine Support Vector Machine (cont’d)(cont’d)
• Nonlinear– Example: XOR Problem
• Natural Language is Nonlinear!
f1
f2
f1
f1 f2
2
2
Ping-Tsun Chang
Support Vector Machine Support Vector Machine (cont’d)(cont’d)
• Consistent hypothses
• Maximum margin
• Support Vector
Ping-Tsun Chang
Statistical Learning TheoryStatistical Learning Theory
P(X) P(y|x)
F(x)
y
y*
x
x
Generator Supervisor
Leaner
Ping-Tsun Chang
Support Vector MachineSupport Vector MachineLinear Discriminant FunctionsLinear Discriminant Functions
• Linear discriminant space
• Hyperplane
yayg t)(
nkygz kk ,...,1,1)( g(y)>1
y2
y1
g(y)<1
Ping-Tsun Chang
Learning of Support Vector Learning of Support Vector MachineMachine
• Maxmize Margin
• Minimize ||a||
n
kk
tkk yazaaL
1
2 ]1[||||2
1),(
Optimal hyperplane
nkba
ygz kk ,...1,||||
)(
Ping-Tsun Chang
Version SpaceVersion Space
• Hypothesis Space H
• Version Space V
},||||
)()(|{ Ww
w
xwxffH
}0)(},...1{|{ ii xfyniHfV
H
V
Ping-Tsun Chang
Support Vector Machine Support Vector Machine Active LearningActive Learning
• Why Support Vector Machine?– Text Categorization have large amount of data– Traditional Learning cause Over-Fitting– Language is complex and nonlinear
• Why Active Learning? – Labeling instance is time-consuming and costly– Reduce the need for labeled training instances
Ping-Tsun Chang
Active Learning: HistoryActive Learning: HistoryText Classification [Rochio, 71] [Dumais, 98]
Support Vector Machine [Vapnik,82]
Text ClassificationSupport Vector Machine [Joachims,98] [Dumais,98]
Pool-Based Active Learning [Lewis, Gale ‘94] [McCallum, Nigrm ‘98]
The Nature of Statistical Learning Theory [Vapnik, 95]
Automated Text Categorization UsingSupport Vector Machine [Kwok, 98]
Ping-Tsun Chang
Active LearningActive Learning
• Pool-Based active learning have a pool UU of unlabeled instances
• Active Lerner l have three components (f,q,X)– f: classifier x->{-1, 1}
– q: querying function q(X), given a training instance labeled set X, decide which instance in U to query next.
– X: training data, labeled.
Ping-Tsun Chang
Active Learning (cont’d)Active Learning (cont’d)
• Main difference: querying component q.
• How to choose the next unlabeled instance to query?
• Resulting Version Space
}0))((|{ 1
iii xwWwVV
}0))((|{ 1
iii xwWwVV
Ping-Tsun Chang
Active LearnerActive Learner
• Active learner l* always queries instances whose corresponding hyperplanes in parameter space W halves the area of the current version space
)]([sup)]([sup *iP
PiP
PVAreaEVAreaENi
Ping-Tsun Chang
ExperienmentsExperienmentsBayesian ClassifierBayesian Classifier
Ping-Tsun Chang
Comparsion of Learning Comparsion of Learning MethodsMethods
0 10 20 30 40 50 60
0.6
0.8
1
0.4
0.2
Precision
Training Data Size
SVM
kNN
NB
NNet
Ping-Tsun Chang
ConclusionsConclusions
• Text-Mining extraction knowledge from text.
• Support Vector Machine is almost the best statistic-based machine learning method
• Natural Language Understanding is still a open problem
Knowledge