On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1)...
-
Upload
norman-heath -
Category
Documents
-
view
226 -
download
2
Transcript of On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1)...
On Compression-Based Text Classification
Yuval Marton1, Ning Wu2 and Lisa Hellerstein2
1) University of Maryland and 2) Polytechnic University
ECIR-05 Santiago de Compostela, Spain. March 2005
Compression for Text Classification??
Proposed in the last ~10 years. Not well-understood why works.
Compression is stupid! slow! Non-standard!
Using compression tools is easy..
Does it work? (Controversy. Mess)
OverviewWhat’s Text classification (problem setting)
Compression-based text classification Classification Procedures ( + Do it yourself !) Compression Methods (RAR, LZW, and gzip)
Experimental Evaluation
Why?? (Compression as char-based method) Influence of sub/super/non-word features
Conclusions and future work
Text Classification
Given training corpus (labeled documents).
Learn how to label new (test) documents.
Our setting:
Single-class: document belongs to exactly one class.
3 topic classification and 3 authorship attribution tasks.
Classification by Compression
Compression programs build a model or dictionary of their input (language modeling).
Better model better compression
Idea:
Compress a document using different class models.
Label with class achieving highest compression rate.
Minimal Description Length (MDL) principle: select model with shortest length of model + data.
Standard MDL (Teahan & Harper)
D1 D2 D3Class 1
D1 D2 D3Class 2
D1 D2 D3Class n
A1
A2
An
M1M1
M2M2
MnMn
T
Concat. training data Ai
Compress Ai model Mi
Compress T using each Mi
Assign T to its best compressor
… and the winner is…
FR
EE
ZE
!F
RE
EZ
E!
Do it yourself
Five minutes on how to classify text documents e.g., according to their topic or author, using only off-the-shelf compression tools (such as WinZip or RAR)…
AMDL (Khmelev / Kukushkina et al. 2001)
D1 D2 D3Class 1
D1 D2 D3Class 2
D1 D2 D3Class n
T Concat. training data Ai
Concat. Ai and T AiT
Compress each Ai and AiT
Assign T to class i w/ min vi
A1
A2
An
A1T
Subtract compressed file sizes vi = |AiT| - |Ai|
A2T
AnT
BCN (Benedetto et al. 2002)
D1 D2 D3Class 1
D4 D5 D6Class 2
D7 D8 D9Class n
T Like AMDL, but concat. each doc Dj with T DjT
Compress each Dj and DjT
Assign T to class i of doc Dj with min vDT
D1
D4
D7
D1T
Subtract compressed file sizes vDT = |DjT| - |Dj|
D4T
D7T
D2T
D5T
D8T
D3T
D6T
D9T
Compression Methods
Gzip: Lempel-Ziv compression (LZ77). - “Dictionary”-based - Sliding window typically 32K.
LZW (Lempel-Ziv-Welch) - Dictionary-based (16 bit). - Fills up on big corpora (typically after ~300KB).
RAR (proprietary shareware) - PPMII variant on text.- Markov Model, n-grams frequencies.
-32K--32K-
-16 bit (~300K) --16 bit (~300K) -
- (almost) unlimited -- (almost) unlimited -
Previous Work
Khmelev et al. (+Kukushkina): Russian authors. Thaper: LZ78, char- and word-based PPM.
Frank et al.: compression (PPM) bad for topic. Teahan and Harper: compression (PPM) good.
Benedetto et al.: gzip good for authors. Goodman: gzip bad!
Khmelev and Teahan: RAR (PPM).
Peng et al.: Markov Language Models.
Compression Good or Bad?
Scoring: we measured Accuracy:
Total # correct classificationsTotal # tests
(Micro-averaged accuracy)
Why? Single-class labels, no tuning parameters.
AMDL Results
Corpus RAR LZW GZIP
Au
thor
Federalist (2) 0.94 0.83 0.67
Gutenberg-10 0.82 0.65 0.62
Reuters-9 0.78 0.66 0.79
Top
ic
Reuters-10 0.87 0.84 0.83
10news (20news)
0.96 (0.90)
0.66 0.56 (0.47)
Sector (105) 0.90 0.61 0.19
RAR is a Star!
RAR is best performing method on all but small Reuters-9 corpus.
Poor performance of gzip on large corpora due to its 32K sliding window.
Poor performance of LZW: dictionary fills up after ~ 300KB, other reasons too.
RAR on Standard Corpora - Comparison
90.5% for RAR on 20news:
- 89.2% Language Modeling (Peng et al. 2004) - 86.2% Extended NB (Rennie et al. 2003) - 82.1% PPMC (Teahan and Harper 2001)
89.6% for RAR on Sector:
- 93.6% SVM (Zhang and Oles 2001) - 92.3% Extended NB (Rennie et al. 2003) - 64.5% Multinomial NB (Ghani 2001)
AMDL vs. BCN
Gzip / BCN good. Due to processing each doc separatelywith T (1-NN).
Gzip / AMDL bad.
BCN was slow. Probably due to more sys calls and disk I/O.
Method
Corpus
AMDL BCN
RAR gzip RAR gzip
Federalist 0.94
0.67 0.78
0.78
Guten-10 0.82
0.62 0.75
0.72
Reuters-9 0.78
0.79 0.77
0.77
Why Good?!
Compression tools are character-based.(Stupid, remember?)
Better than word-based? WHY? Can they capture
sub-word
word
super-word
non-word features?
Pre-processingSTD: no change to input.
NoP: remove punctuation; replace white spaces (tab, line, parag & page breaks) with spaces.
WOS: NoP + Word Order Scrambling.
RSW: NoP + random-string words.
…and more…
the more the better
better the the more
“the more – the better!”
dqf tmdw dqf lkwe
Non-words: Punctuation
Intuition:
punctuation usage is characteristic of writing style (authorship attribution).
Results:
Accuracy remained the same, or even increased, in many cases.
RAR insensitive to punctuation removal.
Super-words: word seq. Order Scrambling (WOS)WOS removes punctuation and scrambles word order.
WOS leaves sub-word and word info intact. Destroys super-word relations.
RAR: accuracy declined in all but one corpus seems to exploit word seq. (n-grams?).Advantage over SVM state-of-the-art bag-of-words methods.
LZW & gzip: no consistent accuracy decline.
Summary
Compared effectiveness of compression for text classification (compression methods x classification procedures).
RAR (PPM) is a star – under AMDL.- BCN (1-NN) slow(er) and never better in accuracy. - Compression good (Teahan and Harper).- Character-based Markov models good (Peng et al.)
Introduced pre-processing testing techniques: novel ways to test how compression (and other character-based methods) exploit sub/super/non-word features. - RAR benefits from super-word info. - Suggests word-based methods might benefit from it too.
Future ResearchTest / confirm results on more and bigger corpora.
Compare to state-of-the-art techniques:Other compression / character-based methods.
SVM
Word-based n-gram language modeling (Peng et al).
Word-based compression?
Use Standard MDL (Teahan and Harper).Faster, better insight.
Sensitivity to class training data imbalanceWhen is throwing away data desirable for compression?
Thank you!