On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1)...

On Compression-Based Text Classification

Yuval Marton1, Ning Wu2 and Lisa Hellerstein2

1) University of Maryland and 2) Polytechnic University

ECIR-05 Santiago de Compostela, Spain. March 2005

Compression for Text Classification??

Proposed in the last ~10 years. Not well-understood why works.

Compression is stupid! slow! Non-standard!

Using compression tools is easy..

Does it work? (Controversy. Mess)

OverviewWhat’s Text classification (problem setting)

Compression-based text classification Classification Procedures ( + Do it yourself !) Compression Methods (RAR, LZW, and gzip)

Experimental Evaluation

Why?? (Compression as char-based method) Influence of sub/super/non-word features

Conclusions and future work

Text Classification

Given training corpus (labeled documents).

Learn how to label new (test) documents.

Our setting:

Single-class: document belongs to exactly one class.

3 topic classification and 3 authorship attribution tasks.

Classification by Compression

Compression programs build a model or dictionary of their input (language modeling).

Better model better compression

Idea:

Compress a document using different class models.

Label with class achieving highest compression rate.

Minimal Description Length (MDL) principle: select model with shortest length of model + data.

Standard MDL (Teahan & Harper)

D1 D2 D3Class 1

D1 D2 D3Class 2

D1 D2 D3Class n

A1

A2

An

M1M1

M2M2

MnMn

T

Concat. training data Ai

Compress Ai model Mi

Compress T using each Mi

Assign T to its best compressor

… and the winner is…

FR

EE

ZE

!F

RE

EZ

E!

Do it yourself

Five minutes on how to classify text documents e.g., according to their topic or author, using only off-the-shelf compression tools (such as WinZip or RAR)…

AMDL (Khmelev / Kukushkina et al. 2001)

D1 D2 D3Class 1

D1 D2 D3Class 2

D1 D2 D3Class n

T Concat. training data Ai

Concat. Ai and T AiT

Compress each Ai and AiT

Assign T to class i w/ min vi

A1

A2

An

A1T

Subtract compressed file sizes vi = |AiT| - |Ai|

A2T

AnT

BCN (Benedetto et al. 2002)

D1 D2 D3Class 1

D4 D5 D6Class 2

D7 D8 D9Class n

T Like AMDL, but concat. each doc Dj with T DjT

Compress each Dj and DjT

Assign T to class i of doc Dj with min vDT

D1

D4

D7

D1T

Subtract compressed file sizes vDT = |DjT| - |Dj|

D4T

D7T

D2T

D5T

D8T

D3T

D6T

D9T

Compression Methods

Gzip: Lempel-Ziv compression (LZ77). - “Dictionary”-based - Sliding window typically 32K.

LZW (Lempel-Ziv-Welch) - Dictionary-based (16 bit). - Fills up on big corpora (typically after ~300KB).

RAR (proprietary shareware) - PPMII variant on text.- Markov Model, n-grams frequencies.

-32K--32K-

-16 bit (~300K) --16 bit (~300K) -

- (almost) unlimited -- (almost) unlimited -

Previous Work

Khmelev et al. (+Kukushkina): Russian authors. Thaper: LZ78, char- and word-based PPM.

Frank et al.: compression (PPM) bad for topic. Teahan and Harper: compression (PPM) good.

Benedetto et al.: gzip good for authors. Goodman: gzip bad!

Khmelev and Teahan: RAR (PPM).

Peng et al.: Markov Language Models.

Compression Good or Bad?

Scoring: we measured Accuracy:

Total # correct classificationsTotal # tests

(Micro-averaged accuracy)

Why? Single-class labels, no tuning parameters.

AMDL Results

Corpus RAR LZW GZIP

Au

thor

Federalist (2) 0.94 0.83 0.67

Gutenberg-10 0.82 0.65 0.62

Reuters-9 0.78 0.66 0.79

Top

ic

Reuters-10 0.87 0.84 0.83

10news (20news)

0.96 (0.90)

0.66 0.56 (0.47)

Sector (105) 0.90 0.61 0.19

RAR is a Star!

RAR is best performing method on all but small Reuters-9 corpus.

Poor performance of gzip on large corpora due to its 32K sliding window.

Poor performance of LZW: dictionary fills up after ~ 300KB, other reasons too.

RAR on Standard Corpora - Comparison

90.5% for RAR on 20news:

- 89.2% Language Modeling (Peng et al. 2004) - 86.2% Extended NB (Rennie et al. 2003) - 82.1% PPMC (Teahan and Harper 2001)

89.6% for RAR on Sector:

- 93.6% SVM (Zhang and Oles 2001) - 92.3% Extended NB (Rennie et al. 2003) - 64.5% Multinomial NB (Ghani 2001)

AMDL vs. BCN

Gzip / BCN good. Due to processing each doc separatelywith T (1-NN).

Gzip / AMDL bad.

BCN was slow. Probably due to more sys calls and disk I/O.

Method

Corpus

AMDL BCN

RAR gzip RAR gzip

Federalist 0.94

0.67 0.78

0.78

Guten-10 0.82

0.62 0.75

0.72

Reuters-9 0.78

0.79 0.77

0.77

Why Good?!

Compression tools are character-based.(Stupid, remember?)

Better than word-based? WHY? Can they capture

sub-word

word

super-word

non-word features?

Pre-processingSTD: no change to input.

NoP: remove punctuation; replace white spaces (tab, line, parag & page breaks) with spaces.

WOS: NoP + Word Order Scrambling.

RSW: NoP + random-string words.

…and more…

the more the better

better the the more

“the more – the better!”

dqf tmdw dqf lkwe

Non-words: Punctuation

Intuition:

punctuation usage is characteristic of writing style (authorship attribution).

Results:

Accuracy remained the same, or even increased, in many cases.

RAR insensitive to punctuation removal.

Super-words: word seq. Order Scrambling (WOS)WOS removes punctuation and scrambles word order.

WOS leaves sub-word and word info intact. Destroys super-word relations.

RAR: accuracy declined in all but one corpus seems to exploit word seq. (n-grams?).Advantage over SVM state-of-the-art bag-of-words methods.

LZW & gzip: no consistent accuracy decline.

Summary

Compared effectiveness of compression for text classification (compression methods x classification procedures).

RAR (PPM) is a star – under AMDL.- BCN (1-NN) slow(er) and never better in accuracy. - Compression good (Teahan and Harper).- Character-based Markov models good (Peng et al.)

Introduced pre-processing testing techniques: novel ways to test how compression (and other character-based methods) exploit sub/super/non-word features. - RAR benefits from super-word info. - Suggests word-based methods might benefit from it too.

Future ResearchTest / confirm results on more and bigger corpora.

Compare to state-of-the-art techniques:Other compression / character-based methods.

SVM

Word-based n-gram language modeling (Peng et al).

Word-based compression?

Use Standard MDL (Teahan and Harper).Faster, better insight.

Sensitivity to class training data imbalanceWhen is throwing away data desirable for compression?

Thank you!

On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1)...

Documents

Transcript of On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1)...