Text Mining with Machine Learning Techniques

34
Ping-Tsun Chang Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining Text Mining with Machine Learning with Machine Learning Techniques Techniques

description

Text Mining with Machine Learning Techniques. Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University. Text Analysis. Summerization. Classification. Feature Selection. Language Identification. Clustering. Text Mining. - PowerPoint PPT Presentation

Transcript of Text Mining with Machine Learning Techniques

Page 1: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Ping-Tsun ChangIntelligent Systems Laboratory

Computer Science and Information Engineering

National Taiwan University

Text MiningText Miningwith Machine Learning Techniqueswith Machine Learning Techniques

Page 2: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

LanguageIdentification

Classification

Clustering

Summerization

Feature Selection

Text AnalysisText Analysis

Page 3: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Text MiningText Mining

• Text mining is about looking for patterns in natural language text– Natural Language Processing

• May be defined as the process of analyzing text to extract information from it for particular purposes.– Information Extraction– Information Retrieval

Page 4: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Text MiningText Miningand Knowledge Managementand Knowledge Management

• a recent study indicated that 80% of a company's information is contained in text documents– emails, memos, customer correspondence, and reports

• The ability to distil this untapped source of information provides substantial competitive advantages for a company to succeed in the era of a knowledge-based economy.

Page 5: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Text MiningText MiningApplicationsApplications

• Customer profile analysis– mining incoming emails for customers' complaint and

feedback.

• Patent analysis– analyzing patent databases for major technology players,

trends, and opportunities.

• Information dissemination– organizing and summarizing trade news and reports for

personalized information services.

• Company resource planning– mining a company's reports and correspondences for activities,

status, and problems reported.

Page 6: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Text CategorizationText CategorizationProblem DefinitionProblem Definition

• Text categorization is the problem of automatically assigned predefined categories to free text documents– Document classification– Web page classification– News classification

Page 7: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Information RetrievalInformation Retrieval

• Full text is hard to process, but is a complete representation to document

• Logical view of documents• Models

– Boolean Model– Vector Model– Probabilistic Model

• Think text as patterns?

Page 8: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

EvaluationEvaluation

Retrieved

Relevant

ab

c dd

Precisiona

a + bRecall

a

a + d

Page 9: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Pattern RecognizationPattern Recognization

Sensing

Segmentation

Classification

Post-Processing

Feature Extraction Decision

Page 10: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Pattern ClassificationPattern Classification

f1

f2

C1

C2

Page 11: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Machine LearningMachine Learning

• Using Computer help us to induction from complex and large amount of pattern data

• Bayesian Learning

• Instance-Based Learning– K-Nearest Neighbors

• Neural Networks

• Support Vector Machine

Page 12: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Feature Selection (I)Feature Selection (I)

• Information Gain

||

1

||

1

||

1

)0|(log)0|()0(

)1|(log)1|()1(

)(log)(

)|()(),(

c

kkki

c

kkki

c

kkk

ii

tCPtCPKP

tCPtCPKP

CPCP

KCECECKIG

Page 13: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Feature Selection (II)Feature Selection (II)

• Mutual Information

• CHI-Square

)()(

),(log()

)|(

1log()

)(

1log(),(

CPkP

CkP

kcPCPCKMI

t

t

tt

)()()()(

)(),(

22

DCBADBCA

CBADNCk t

Page 14: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Weighting SchemeWeighting SchemeTF IDF‧TF IDF‧

||||

)/log()),(log(1(),(

d

nNdktfdkw ti

i

dk

i

i

dkwd 2),(||||

Page 15: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Simility EvaluationSimility Evaluation

• Cosine-Like schema

T

llj

T

lli

T

lljli

ji

jiji

ww

ww

dd

ddddsim

1

2

1

2

1

||||),(

di

dj

Page 16: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Machine LearningMachine LearningApproaches: Baysian ClassifierApproaches: Baysian Classifier

))|()((maxarg1

n

ii

cCKPCP

TCN

CKNCKP i

i

)(

1)|()|(

Page 17: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Machine LearningMachine LearningApproaches: kNN ClassifierApproaches: kNN Classifier

kNNd

jiij

i

cdCddsimcdC ),(),(),(

d ?

Page 18: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Machine LearningMachine LearningApproaches: Support Vector MachineApproaches: Support Vector Machine

• Basic hypotheses : Consistent hypotheses of the Version Space

• Project the original training data in space X to a higher dimension feature space F via a Mercel operator K

n

iii xxKxf

1

),()(

Page 19: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Compare: SVM and traditional Compare: SVM and traditional LeanersLeaners

• Traditional Leaner

• SVM access the hypothesis space!

P(h)

hypothesis

P(h|D1)

hypothesis

P(h|D1^D2)

hypothesis

Page 20: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

SVM Learning in Feature SpacesSVM Learning in Feature Spaces

))(),...,(()(),...( 11 xxxxxx dn

),,)(()( 21},,{

221 mmppx

zyxi

ii

),,,,,,,( 21222111 mmppppppx zyxzyxExample:

X F

Page 21: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Support Vector Machine Support Vector Machine (cont’d)(cont’d)

• Nonlinear– Example: XOR Problem

• Natural Language is Nonlinear!

f1

f2

f1

f1 f2

2

2

Page 22: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Support Vector Machine Support Vector Machine (cont’d)(cont’d)

• Consistent hypothses

• Maximum margin

• Support Vector

Page 23: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Statistical Learning TheoryStatistical Learning Theory

P(X) P(y|x)

F(x)

y

y*

x

x

Generator Supervisor

Leaner

Page 24: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Support Vector MachineSupport Vector MachineLinear Discriminant FunctionsLinear Discriminant Functions

• Linear discriminant space

• Hyperplane

yayg t)(

nkygz kk ,...,1,1)( g(y)>1

y2

y1

g(y)<1

Page 25: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Learning of Support Vector Learning of Support Vector MachineMachine

• Maxmize Margin

• Minimize ||a||

n

kk

tkk yazaaL

1

2 ]1[||||2

1),(

Optimal hyperplane

nkba

ygz kk ,...1,||||

)(

Page 26: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Version SpaceVersion Space

• Hypothesis Space H

• Version Space V

},||||

)()(|{ Ww

w

xwxffH

}0)(},...1{|{ ii xfyniHfV

H

V

Page 27: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Support Vector Machine Support Vector Machine Active LearningActive Learning

• Why Support Vector Machine?– Text Categorization have large amount of data– Traditional Learning cause Over-Fitting– Language is complex and nonlinear

• Why Active Learning? – Labeling instance is time-consuming and costly– Reduce the need for labeled training instances

Page 28: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Active Learning: HistoryActive Learning: HistoryText Classification [Rochio, 71] [Dumais, 98]

Support Vector Machine [Vapnik,82]

Text ClassificationSupport Vector Machine [Joachims,98] [Dumais,98]

Pool-Based Active Learning [Lewis, Gale ‘94] [McCallum, Nigrm ‘98]

The Nature of Statistical Learning Theory [Vapnik, 95]

Automated Text Categorization UsingSupport Vector Machine [Kwok, 98]

Page 29: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Active LearningActive Learning

• Pool-Based active learning have a pool UU of unlabeled instances

• Active Lerner l have three components (f,q,X)– f: classifier x->{-1, 1}

– q: querying function q(X), given a training instance labeled set X, decide which instance in U to query next.

– X: training data, labeled.

Page 30: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Active Learning (cont’d)Active Learning (cont’d)

• Main difference: querying component q.

• How to choose the next unlabeled instance to query?

• Resulting Version Space

}0))((|{ 1

iii xwWwVV

}0))((|{ 1

iii xwWwVV

Page 31: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Active LearnerActive Learner

• Active learner l* always queries instances whose corresponding hyperplanes in parameter space W halves the area of the current version space

)]([sup)]([sup *iP

PiP

PVAreaEVAreaENi

Page 32: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

ExperienmentsExperienmentsBayesian ClassifierBayesian Classifier

Page 33: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

Comparsion of Learning Comparsion of Learning MethodsMethods

0 10 20 30 40 50 60

0.6

0.8

1

0.4

0.2

Precision

Training Data Size

SVM

kNN

NB

NNet

Page 34: Text Mining with Machine Learning Techniques

Ping-Tsun Chang

ConclusionsConclusions

• Text-Mining extraction knowledge from text.

• Support Vector Machine is almost the best statistic-based machine learning method

• Natural Language Understanding is still a open problem

Knowledge