Information filtering, By Hadi Mohammadzadeh

24
1 . Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC) By : Hadi Mohammadzadeh Institute of Applied Information Processing University of Ulm Seminar on Information Filtering (IF) & Text Classifi

description

Information filtering

Transcript of Information filtering, By Hadi Mohammadzadeh

Page 1: Information filtering, By Hadi Mohammadzadeh

1

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

By : Hadi MohammadzadehInstitute of Applied Information ProcessingUniversity of Ulm

Seminar on

Information Filtering (IF) & Text Classification (TC)

Page 2: Information filtering, By Hadi Mohammadzadeh

2

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Content :Content :

• Information Filtering– Definition and Terminology– General Framework– Performance Evaluation– Background and State of the Art– Comparison of Related Tasks– Summary

Page 3: Information filtering, By Hadi Mohammadzadeh

3

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Definition and Terminology (IF)Definition and Terminology (IF)

•What is the objective of IF? To reduce the user’s information load with Respect to their areas of Interest

•Is there any general Framework for an IF system? Yes, we will describe and also review some measures for performance evaluation.

Page 4: Information filtering, By Hadi Mohammadzadeh

4

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

•Definition IF Assume a space of documents . with respect to a specific long-term user interest , we define IF as a mapping ,

Definition and Terminology (IF) continueDefinition and Terminology (IF) continue

•Another Definition for IF we define IF as a mapping :

This notation requires threshold of the relevance score.

Rejecting a Doc.

Accepting a Doc.

]1,0[: Df

D

}1,0{: Df

Page 5: Information filtering, By Hadi Mohammadzadeh

5

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Definition and Terminology (IF)continueDefinition and Terminology (IF)continue

•IF is a Process : actively conducted by human, with or without the assistance of a machine, in order to cope with information overload.

NowThe goal of IF System is to Automate this process.

•Definition IFS An IFS automates the process of IF with the goal to reduce information overload.

1- Collection documents2- Detection relevant documents3- Presentation the result to the user

Process :

Page 6: Information filtering, By Hadi Mohammadzadeh

6

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Definition and Terminology (IF)continueDefinition and Terminology (IF)continue

•In any specific IFS four basic points are that need to analyze: Input

A stream of Textual Doc.EmailNewsgroupsImage

OutputOnlineBatch Processing

Profile ConstructionHumanMachine Design

The concept of RelevanceContent based FilteringCollaborate FilteringEconomic Filtering

Page 7: Information filtering, By Hadi Mohammadzadeh

7

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

General Framework (IF)General Framework (IF)

DocumentSpace

ΝInformation Need Document

User Interest Space

D

Human Judgment

h

l]1,0[

considered as :

A comparison function and shows Human Judgment of the relationship between the user’s interest and a document.

lh DN ]1,0[:

By considering that the user interest may be have many different parts and thus be based ondifferent aspects, so there are different aspects that can be measured on numeric scales.l l

Page 8: Information filtering, By Hadi Mohammadzadeh

8

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

General Framework (IF)ContinueGeneral Framework (IF)Continue

Based on the result of comparison function , then user can decide whetherTo reject orTo accept A document.For doing this we define following function:

}1,0{]1,0[: lh

Final FunctionFor a given information need and any document Information Filtering defined as:

N Dd

),)(()( ddf hh

Page 9: Information filtering, By Hadi Mohammadzadeh

9

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

General Framework (IF)ContinueGeneral Framework (IF)Continue

DocumentSpace

ΝInformation Need DocumentUser Interest

Space D

Human Judgmenth

l]1,0[

Human Side

System Side

Representation

Document Representation

Function

RD:

Document Representation

Space

RProfile

ProfileAcquisitionFunction

User InterestRepresentation

Space

PN :

P

sComparison Function

l]1,0[

Page 10: Information filtering, By Hadi Mohammadzadeh

10

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

General Framework (IF)ContinueGeneral Framework (IF)Continue

So to automate the core of an IFS, every approach must have four basic Components

1- A technique for representing document, denoted as the Document Representation Function

2- A technique for representing the user’s information need, denoted as the

Profile Acquisition Function

3- A technique for matching the representation of the user’s information need against the document representation, denoted as the

Comparison Function

4- A technique for using the results of this comparison denoted as the

System Decision Function

PN :

ls RP ]1,0[:

}1,0{]1,0[: ls

RDp :

Page 11: Information filtering, By Hadi Mohammadzadeh

11

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

General Framework (IF)ContinueGeneral Framework (IF)Continue

Conclusion:

An obvious objective for a FS is that

))(),()((),)(( dd sshh DdN ,

Page 12: Information filtering, By Hadi Mohammadzadeh

12

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Performance Evaluation (IF)Performance Evaluation (IF)

PE

Effectiveness

Efficiency

Accept Relevant Doc.

Reject Non-relevant Doc.

: Talk about resources that are consumed to produce the filtering output such as

Computation TimeLabeled Training Data

Page 13: Information filtering, By Hadi Mohammadzadeh

13

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Content :Content :

Text Classification– Definition and Terminology– Text Representation

• Text Normalization• Term Extraction• Dimensionality Reduction• Vector Generation

– Text Learning • Learning Algorithms• Ensembles of Classifiers

Page 14: Information filtering, By Hadi Mohammadzadeh

14

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Definition and Terminology (TC)Definition and Terminology (TC)

• In this section we will do:1. Define the task of TC.2. Showing relationship between TC and the task of IF.

• Our Goal:To use machine learning techniques for Automatic TC.

• Also we examine:1. Representing Textual Document (Text Representation) in a way

that is appropriate as input for Machine Learning Algorithm.2. Using these Machine Learning Algorithms proper.

Page 15: Information filtering, By Hadi Mohammadzadeh

15

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Definition and Terminology (TC)ContinueDefinition and Terminology (TC)Continue

• TC is a Mapping

• So the task of TC is: To Classify Documents into a fix number of predefined classes.

• And Information Filtering:Means the Classification of Documents as either Relevant or Non-Relevant. Each Document is assigned to Exactly one Class.

Space of Textual

Documents

Fixed Set of

Classes

h

},...,{ 1 kccC

D

Page 16: Information filtering, By Hadi Mohammadzadeh

16

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Definition and Terminology (TC)ContinueDefinition and Terminology (TC)Continue

• What is Supervised Learning (SL) and Its Tasko SL is a ML Techniques for Learning a Function from Training Data.o The Task of SL is to predict the value of the Function for any valid input

object after having seen a number of training examples.

• So Assume: A set of labeled training documents

A fixes set of classes

A target function that assigns each training document to its true class label

Now the Objective of the TL task is to induce a Classifier

DddD n },...,{ 1

},...,{ 1 kccC

CDT :

CDh :

n

k

Page 17: Information filtering, By Hadi Mohammadzadeh

17

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Definition and Terminology (TC)ContinueDefinition and Terminology (TC)Continue

• The Problem of Automatically Classifying Documents falls Into 2 Phases:

1. Apply SLA to construct a Classifier in the Learning Phase.I. Preprocessing text to prepare it as input for MLAII. Using TLA

2. Using this Classifier to predict the Class Label of New Documents in Classification Phase.

Summary of TC Automatically so far

Page 18: Information filtering, By Hadi Mohammadzadeh

18

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Text Representation (TC)Text Representation (TC)

Dd

iw

• Aim TR: To transform a textual document into a format that is suitable as Input for MLA.

Definition VSM (Vector Space Model)

Let be a textual document. The representation of is the document vector d = where each dimension corresponds to a distinct term in the document collection and denotes the weight of the -th term. The set of these index terms, ,is referred to as the vocabulary

dmT

m Rwwd ),...,()( 1Dd

mi

},...,{ 1 mttv

Page 19: Information filtering, By Hadi Mohammadzadeh

19

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Text Representation (TR)ContinueText Representation (TR)Continue

Dd

mTmi Rwwwd ),...,,...,()( 1Dd

Document Space D

DocumentRepresentation

Function

Document Vector or Feature Vector

Specific DocumentCollection

1d

md

id

1Term

Index

2ITiIT

mITEach Dimension Correspond to a Distinct Term

AndAt this Point We Suppose

Index Terms are Plain Text

Page 20: Information filtering, By Hadi Mohammadzadeh

20

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Text Representation (TR)ContinueText Representation (TR)Continue

1. Text Normalization (TN) transforms any type of Document into a sequence of Word Token.

2. Text Extraction (TE) All distinct index terms of the training document are merged to generate a set of candidate index terms which may potentially be used as Vocabulary

3. Dimensionality Reduction (DR) Since the resulting set of index terms tends to be very large, we aim at Reducing the Size of the Vocabulary in the dimensionality reduction step.

4. Vector Generation (VG) Evaluate Weights for all index terms of any given document.

Steps Involved to Transform Training Documents and New Documents into Feature Vectors

Page 21: Information filtering, By Hadi Mohammadzadeh

21

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Text Representation (TC)ContinueText Representation (TC)Continue

Page 22: Information filtering, By Hadi Mohammadzadeh

22

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Text Representation (TC)ContinueText Representation (TC)Continue

TN a Sequence of Normalized Tokens

There are two steps in the Text Normalization Process.

1. Parsing textual components to produce a Sequence of Tokens.2. To Normalize tokens – Depending on the Application.

Term Normalization

Output

Based onDomainSpecific

knowledge

• All letter converted to lower-case

• Punctuation marks at the end of tokens removed

• Tokens that contain any non-alphanumeric char may delete

• Even token containing numeric characters are often omitted

Page 23: Information filtering, By Hadi Mohammadzadeh

23

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Text Representation (TC)ContinueText Representation (TC)Continue

TE a Sequence of Index Term (IT) Based on Tokens

At this time we face with two situations:

1. No Vocabulary: Output of TE are merged to generate a set of distinct Index Term.

2. Vocabulary Exist: If IT is not in the Vocabulary , then IT omitted. Otherwise , Constitution of Term Frequency Vector (TFV) d for

denotes number of times term appears in document

Term Extraction

Output

}

^

1

^^^,...,{ mttv

Tmd ttfttf ))(),...,(( 1 jd

tf

)(ttfd vt d

Page 24: Information filtering, By Hadi Mohammadzadeh

24

.

Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)

Text Representation (TC)ContinueText Representation (TC)Continue

• Curse of Dimensionality: Problem of having too many features.

• When DR step is Applied: only in case the Vocabulary needs to be constructed, i.e. when the training documents are being processed.

• Objective: To reduce the number of features that are finally used to represent document.

Dimensionality Reduction

TrainingData

Term Extraction

Step

}

^

1

^^^,...,{ mttv

It is Very Large

},...,{ 1 mttv

After DR

where^

mm