Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry...

19
Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand [email protected]

Transcript of Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry...

Page 1: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Use of Kolmogorov distance identification of web page

authorship, topic and domain

David Parry

Auckland University of Technology

New Zealand

[email protected]

Page 2: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.
Page 3: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Overview

• Problem Statement

• Kolmogorov distance

• Experimental methods

• Results

• Clustering

• Conclusions

Page 4: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Problem statement

• It is often desirable for information retrieval systems to calculate a measure of similarity between documents.

• Similarity measures generally rely on some sort of parsing, or understanding of documents, but effective parsing often depends on detailed knowledge of document structure.

Page 5: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

General-purpose similarity

• Acts on any string of data points.

• Useful for:– Clustering – Verification– Filtering– Motif analysis– Exception detection.

Page 6: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Use of the “zip” technique

• In 2002 Benedetto, Caglioti, & Loreto used the “Zip” compression algorithm to identify the language documents.

• Technique involved concatenating a known language file with an unknown one and comparing the length of the zipped file.

• The shortest concatenated zip file occurred when the known file was written in the same language as the unknown file.

Page 7: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Extensions to this technique

• This approach was also used for author confirmation.

• Used an hierarchical clustering algorithm for the construction of language trees.

Page 8: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Kolmogorov Distance

Li, Chen, Li, Ma, & Vitenyi, 2003 - Assuming C(A|B) is the compressed size of A using the compression dictionary used in compressing B , and vice versa for C(B|A) and C(A), C(B) represent the compressed length of A and B using their own compression dictionaries. The kolmogorov distance between A and B , D(A,B) is given by:

)()(

)|()|(),(

BCAC

ABCBACBAD

Page 9: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Modified approach Obtain the two files – file1 and file2

Concatenate them in two ways, file1+ file2 = (file12)

and file2+ file1 =(file21)

Calculate the compressed length of:file1 as zip1

file2 as zip2

file12 as zip12

file21 as zip21

The Kolmogorov distance (D) is then given by:

21

221112 )()()2,1(

zipzip

zipzipzipzipfilefileD

Page 10: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Experiments

• Author Identification from an online discussion board

• Domain detection from sets of WWW pages

• Topic detection from a collection of related WWW pages.

Page 11: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Methods

• Load files from WWW• Compare test file with 10 others, one of

which is {by the same author,from the same domain,on the same topic}

• Use the modified kolomogorov distance algorithm.

• Select the combination with the shortest distance.

Page 12: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Analysis

• Chi-squared used to analyse the results.• Not really an IR system, as the number of

documents “retrieved” always =1, from 10.• Precision can be related to the percentage of

times when the lowest Kolmogorov distance is found for the desired outcome.

Page 13: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Results – Authorship

StatusPercent

Shortest KDPercent in

sample

Author1<>Author2 51.88% 90%

Author1=Author2 48.13% 10%

 

Using Chi-Squared, this result is significant at the p<0.001 level (SPSS 11) 2=(1,N=160)=258,p<0.001.

 

160 initial documents, 1600 total,

Page 14: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Web domains sampled

Domain Name Number of Pages Average File Length

AUT 2192 58518

OBGYN 203 25937

Microsoft 442 882771

Hon 19 21600

Apple 588 37319

Guardian 234 38326

Total 3678 177411.8

Page 15: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Results – Web domain

Status Percent lowest KD

Percent in sample

Different Domain 18.75% 90%

Same Domain 81.25% 10%

Using Chi-Squared, this result is significant at the p<0.001 level=(1,N=80)=451,p<0.001

80 seed files, from 6 domains

Page 16: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.
Page 17: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.
Page 18: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Results - TopicsSource Occurrences

with shortest distance

Percent in sample  

Different topic domain

17.89% 90%

Same topic domain

82.11% 10%

2=(1,N=665)=3839,p<0.001

Page 19: Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand Dave.parry@aut.ac.nz.

Conclusions

• The modified Kolomogorov distance algorithm is capable of identifying related documents more often than chance.

• This distance measure does not rely on parsing or semantic analysis.

• This method may have application as part of an IR system.