6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a...

04/18/23 1

A Binary-Categorization Approach for A Binary-Categorization Approach for Classifying Multiple-Record Web Classifying Multiple-Record Web Documents Using a Probabilistic Documents Using a Probabilistic

Retrieval ModelRetrieval Model

Department of Computer Science

Brigham Young University

Quan Wang

November 2001

04/18/23 2

OverviewOverview

Probabilistic Retrieval Model– Application ontology

– Document representations

– Ranking documents based on logistic regression analysis

Experimental Result

04/18/23 3

Application OntologyApplication Ontology

Car

Year Price

Make Model

Mileage Feature PhoneNr

1:*

1:*

1:*

1:*

1:* 1:* 1:*

0:0.975:1 0:0.8:1

0:0.908:10:1.15:*0:2.2:*

0:0.925:1

0:0.45:1

04/18/23 4

Document RepresentationDocument Representation

A set of <index term : term frequency> pairs A1:x1, …….. An:xn.

A density heuristic value y; A grouping heuristic value z;

Document d (x1,……,xn, y, z) (V, y, z)

04/18/23 5

Independence AssumptionIndependence Assumption

P(R|x1,……,xn, y, z)

Independenceassumption

P(R|x1) P(R|xn) P(R|y) P(R|z)* ***

04/18/23 6

Logistic RegressionLogistic Regression

P

x

P(R|x)* ** * *******

*** * ******* ** * xi

P(R|xi)

P(R| x) = 1/(1+exp(-(C0+C1 x))), ln(O(R|x) = C0+C1 x.

04/18/23 7

Probabilistic Retrieval Based on Logistic Probabilistic Retrieval Based on Logistic Regression AnalysisRegression Analysis

Data processing Data analysis Probabilistic retrieval on car-ads application

ontology Correlation relations

04/18/23 8

Data ProcessingData Processing

The corresponding normalized vector

V’ = (X1’, …….. Xn’) is computed as

V’ = |V| / |u|

V

where V is a document vector, u is an ontology vector.

,

04/18/23 9

Data DistributionsData Distributions

**** ** *** **

**** ** *** **

04/18/23 10

Logistic Regression-1Logistic Regression-1

04/18/23 11

Logistic Regression-2Logistic Regression-2

Regression coefficients P-value

04/18/23 12

Statistical Information : Statistical Information : PP-Value-Value

A p-value is a significance indicator.

A large p-value indicates either a bad regression model or a statistically insignificant index term.

We should keep only significant index terms.

04/18/23 13

Select Important Index TermsSelect Important Index Terms

Features PhoneN Density Grouping

P-value .001 .034 .052 .012

Year Make Model Mileage Price

P-value .679 .002 .074 .002 .001

The car-ads application ontology

Double S-curve

04/18/23 14

Probabilistic Retrieval ModelProbabilistic Retrieval Model

ln(O(R|xi)), ln(O(R|y)), ln(O(R|z))

> 0 < 0

relevant irrelevant

04/18/23 15

Correlation RelationsCorrelation Relations

Correlation: There are strong positive correlations among document properties (e.g. Death Date & Birth Date in the obituaries).

Correlations are extra information implicitly contained in a document.

Correlation relations handle “patterns”, e.g., Birth Date-Death Date pair appearing in obituaries application ontology.

04/18/23 16

Special Web DocumentsSpecial Web Documents

Multiple-record Web documents Similar content, format (e.g. item for sale) Same lexical object values (e.g. Honda makes cars and

motorcycles)

8 documents (motorcycle, boat, snowmobile, bicycle) for the car-ads, and 5 documents (death notice, bibliography for famous people, find a graveyard, politician died young, famous people died in car accident) for the obituary.

04/18/23 17

Experimental ResultsExperimental Results

Car-ads obituary

recall 100% 100%

precision 83.3%* 83.3%

accuracy 92.9% 92.0%

*Ten out of eighteen negative documents are specially selected.

04/18/23 18

ConclusionsConclusions

We propose a probabilistic model which is suitable for classifying multiple-record Web documents.

The model performance on a random chosen test document set could be better than the results we present in the thesis.

6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a...

Documents

Transcript of 6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a...