On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1)...

23
On Compression-Based Text Classification Yuval Marton 1 , Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05 Santiago de Compostela, Spain. March 2005

Transcript of On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1)...

Page 1: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

On Compression-Based Text Classification

Yuval Marton1, Ning Wu2 and Lisa Hellerstein2

1) University of Maryland and 2) Polytechnic University

ECIR-05 Santiago de Compostela, Spain. March 2005

Page 2: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Compression for Text Classification??

Proposed in the last ~10 years. Not well-understood why works.

Compression is stupid! slow! Non-standard!

Using compression tools is easy..

Does it work? (Controversy. Mess)

Page 3: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

OverviewWhat’s Text classification (problem setting)

Compression-based text classification Classification Procedures ( + Do it yourself !) Compression Methods (RAR, LZW, and gzip)

Experimental Evaluation

Why?? (Compression as char-based method) Influence of sub/super/non-word features

Conclusions and future work

Page 4: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Text Classification

Given training corpus (labeled documents).

Learn how to label new (test) documents.

Our setting:

Single-class: document belongs to exactly one class.

3 topic classification and 3 authorship attribution tasks.

Page 5: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Classification by Compression

Compression programs build a model or dictionary of their input (language modeling).

Better model better compression

Idea:

Compress a document using different class models.

Label with class achieving highest compression rate.

Minimal Description Length (MDL) principle: select model with shortest length of model + data.

Page 6: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Standard MDL (Teahan & Harper)

D1 D2 D3Class 1

D1 D2 D3Class 2

D1 D2 D3Class n

A1

A2

An

M1M1

M2M2

MnMn

T

Concat. training data Ai

Compress Ai model Mi

Compress T using each Mi

Assign T to its best compressor

… and the winner is…

FR

EE

ZE

!F

RE

EZ

E!

Page 7: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Do it yourself

Five minutes on how to classify text documents e.g., according to their topic or author, using only off-the-shelf compression tools (such as WinZip or RAR)…

Page 8: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

AMDL (Khmelev / Kukushkina et al. 2001)

D1 D2 D3Class 1

D1 D2 D3Class 2

D1 D2 D3Class n

T Concat. training data Ai

Concat. Ai and T AiT

Compress each Ai and AiT

Assign T to class i w/ min vi

A1

A2

An

A1T

Subtract compressed file sizes vi = |AiT| - |Ai|

A2T

AnT

Page 9: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

BCN (Benedetto et al. 2002)

D1 D2 D3Class 1

D4 D5 D6Class 2

D7 D8 D9Class n

T Like AMDL, but concat. each doc Dj with T DjT

Compress each Dj and DjT

Assign T to class i of doc Dj with min vDT

D1

D4

D7

D1T

Subtract compressed file sizes vDT = |DjT| - |Dj|

D4T

D7T

D2T

D5T

D8T

D3T

D6T

D9T

Page 10: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Compression Methods

Gzip: Lempel-Ziv compression (LZ77). - “Dictionary”-based - Sliding window typically 32K.

LZW (Lempel-Ziv-Welch) - Dictionary-based (16 bit). - Fills up on big corpora (typically after ~300KB).

RAR (proprietary shareware) - PPMII variant on text.- Markov Model, n-grams frequencies.

-32K--32K-

-16 bit (~300K) --16 bit (~300K) -

- (almost) unlimited -- (almost) unlimited -

Page 11: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Previous Work

Khmelev et al. (+Kukushkina): Russian authors. Thaper: LZ78, char- and word-based PPM.

Frank et al.: compression (PPM) bad for topic. Teahan and Harper: compression (PPM) good.

Benedetto et al.: gzip good for authors. Goodman: gzip bad!

Khmelev and Teahan: RAR (PPM).

Peng et al.: Markov Language Models.

Page 12: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Compression Good or Bad?

Scoring: we measured Accuracy:

Total # correct classificationsTotal # tests

(Micro-averaged accuracy)

Why? Single-class labels, no tuning parameters.

Page 13: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

AMDL Results

Corpus RAR LZW GZIP

Au

thor

Federalist (2) 0.94 0.83 0.67

Gutenberg-10 0.82 0.65 0.62

Reuters-9 0.78 0.66 0.79

Top

ic

Reuters-10 0.87 0.84 0.83

10news (20news)

0.96 (0.90)

0.66 0.56 (0.47)

Sector (105) 0.90 0.61 0.19

Page 14: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

RAR is a Star!

RAR is best performing method on all but small Reuters-9 corpus.

Poor performance of gzip on large corpora due to its 32K sliding window.

Poor performance of LZW: dictionary fills up after ~ 300KB, other reasons too.

Page 15: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

RAR on Standard Corpora - Comparison

90.5% for RAR on 20news:

- 89.2% Language Modeling (Peng et al. 2004) - 86.2% Extended NB (Rennie et al. 2003) - 82.1% PPMC (Teahan and Harper 2001)

89.6% for RAR on Sector:

- 93.6% SVM (Zhang and Oles 2001) - 92.3% Extended NB (Rennie et al. 2003) - 64.5% Multinomial NB (Ghani 2001)

Page 16: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

AMDL vs. BCN

Gzip / BCN good. Due to processing each doc separatelywith T (1-NN).

Gzip / AMDL bad.

BCN was slow. Probably due to more sys calls and disk I/O.

Method

Corpus

AMDL BCN

RAR gzip RAR gzip

Federalist 0.94

0.67 0.78

0.78

Guten-10 0.82

0.62 0.75

0.72

Reuters-9 0.78

0.79 0.77

0.77

Page 17: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Why Good?!

Compression tools are character-based.(Stupid, remember?)

Better than word-based? WHY? Can they capture

sub-word

word

super-word

non-word features?

Page 18: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Pre-processingSTD: no change to input.

NoP: remove punctuation; replace white spaces (tab, line, parag & page breaks) with spaces.

WOS: NoP + Word Order Scrambling.

RSW: NoP + random-string words.

…and more…

the more the better

better the the more

“the more – the better!”

dqf tmdw dqf lkwe

Page 19: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Non-words: Punctuation

Intuition:

punctuation usage is characteristic of writing style (authorship attribution).

Results:

Accuracy remained the same, or even increased, in many cases.

RAR insensitive to punctuation removal.

Page 20: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Super-words: word seq. Order Scrambling (WOS)WOS removes punctuation and scrambles word order.

WOS leaves sub-word and word info intact. Destroys super-word relations.

RAR: accuracy declined in all but one corpus seems to exploit word seq. (n-grams?).Advantage over SVM state-of-the-art bag-of-words methods.

LZW & gzip: no consistent accuracy decline.

Page 21: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Summary

Compared effectiveness of compression for text classification (compression methods x classification procedures).

RAR (PPM) is a star – under AMDL.- BCN (1-NN) slow(er) and never better in accuracy. - Compression good (Teahan and Harper).- Character-based Markov models good (Peng et al.)

Introduced pre-processing testing techniques: novel ways to test how compression (and other character-based methods) exploit sub/super/non-word features. - RAR benefits from super-word info. - Suggests word-based methods might benefit from it too.

Page 22: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Future ResearchTest / confirm results on more and bigger corpora.

Compare to state-of-the-art techniques:Other compression / character-based methods.

SVM

Word-based n-gram language modeling (Peng et al).

Word-based compression?

Use Standard MDL (Teahan and Harper).Faster, better insight.

Sensitivity to class training data imbalanceWhen is throwing away data desirable for compression?

Page 23: On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Thank you!