On Compression-Based Text Classification

On Compression-Based Text ClassificationYuval Marton1, Ning Wu2 and Lisa Hellerstein21) University of Maryland and 2) Polytechnic UniversityECIR-05 Santiago de Compostela, Spain. March 2005

Compression for Text Classification??Proposed in the last ~10 years. Not well-understood why works.Compression is stupid! slow! Non-standard!Using compression tools is easy..Does it work? (Controversy. Mess)

Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

OverviewWhats Text classification (problem setting)Compression-based text classificationClassification Procedures ( + Do it yourself !)Compression Methods (RAR, LZW, and gzip)Experimental EvaluationWhy?? (Compression as char-based method)Influence of sub/super/non-word featuresConclusions and future work


Text ClassificationGiven training corpus (labeled documents). Learn how to label new (test) documents.Our setting:Single-class: document belongs to exactly one class. 3 topic classification and 3 authorship attribution tasks.


Classification by CompressionCompression programs build a model or dictionary of their input (language modeling).Better model better compressionIdea:Compress a document using different class models. Label with class achieving highest compression rate.Minimal Description Length (MDL) principle: select model with shortest length of model + data.


Standard MDL (Teahan & Harper)D1D2D3Class 1D1D2D3Class 2D1D2D3Class nA1A2AnM1M2MnTConcat. training data AiCompress Ai model MiCompress T using each MiAssign T to its best compressor and the winner isFREEZE!


Do it yourselfFive minutes on how to classify text documents e.g., according to their topic or author, using only off-the-shelf compression tools (such as WinZip or RAR)


AMDL (Khmelev / Kukushkina et al. 2001)D1D2D3Class 1D1D2D3Class 2D1D2D3Class nTConcat. training data AiConcat. Ai and T AiTCompress each Ai and AiTAssign T to class i w/ min viA1A2AnA1TSubtract compressed file sizes vi = |AiT| - |Ai|A2TAnT


BCN (Benedetto et al. 2002)D1D2D3Class 1D4D5D6Class 2D7D8D9Class nTLike AMDL, but concat. each doc Dj with T DjTCompress each Dj and DjTAssign T to class i of doc Dj with min vDTD1D4D7D1TSubtract compressed file sizes vDT = |DjT| - |Dj|D4TD7TD2TD5TD8TD3TD6TD9T


Compression MethodsGzip: Lempel-Ziv compression (LZ77). - Dictionary-based - Sliding window typically 32K.LZW (Lempel-Ziv-Welch) - Dictionary-based (16 bit). - Fills up on big corpora (typically after ~300KB).RAR (proprietary shareware) - PPMII variant on text. - Markov Model, n-grams frequencies.-32K--16 bit (~300K) -- (almost) unlimited -


Previous WorkKhmelev et al. (+Kukushkina): Russian authors. Thaper: LZ78, char- and word-based PPM.Frank et al.: compression (PPM) bad for topic. Teahan and Harper: compression (PPM) good.Benedetto et al.: gzip good for authors. Goodman: gzip bad! Khmelev and Teahan: RAR (PPM). Peng et al.: Markov Language Models.


Compression Good or Bad? Scoring: we measured Accuracy: Total # correct classifications Total # tests (Micro-averaged accuracy) Why? Single-class labels, no tuning parameters.


AMDL Results

CorpusRARLZWGZIPAuthorFederalist (2)0.940.830.67Gutenberg-100.820.650.62Reuters-90.780.660.79TopicReuters-100.870.840.8310news (20news)0.96 (0.90)0.660.56 (0.47)Sector (105)0.900.610.19


RAR is a Star!RAR is best performing method on all but small Reuters-9 corpus.Poor performance of gzip on large corpora due to its 32K sliding window.Poor performance of LZW: dictionary fills up after ~ 300KB, other reasons too.


RAR on Standard Corpora - Comparison90.5% for RAR on 20news: - 89.2% Language Modeling (Peng et al. 2004) - 86.2% Extended NB (Rennie et al. 2003) - 82.1% PPMC (Teahan and Harper 2001) 89.6% for RAR on Sector: - 93.6% SVM (Zhang and Oles 2001) - 92.3% Extended NB (Rennie et al. 2003) - 64.5% Multinomial NB (Ghani 2001)


AMDL vs. BCNGzip / BCN good. Due to processing each doc separately with T (1-NN). Gzip / AMDL bad. BCN was slow. Probably due to more sys calls and disk I/O.

Method CorpusAMDLBCNRARgzipRARgzipFederalist0.940.670.780.78Guten-100.820.620.750.72Reuters-90.780.790.770.77


Why Good?!Compression tools are character-based. (Stupid, remember?) Better than word-based? WHY? Can they capturesub-wordwordsuper-wordnon-word features?


Pre-processingSTD: no change to input. NoP: remove punctuation; replace white spaces (tab, line, parag & page breaks) with spaces. WOS: NoP + Word Order Scrambling. RSW: NoP + random-string words. and morethemorethebetterbetterthethemorethe more the better!dqftmdwdqflkwe


Non-words: PunctuationIntuition:punctuation usage is characteristic of writing style (authorship attribution).Results:Accuracy remained the same, or even increased, in many cases.RAR insensitive to punctuation removal.


Super-words: word seq. Order Scrambling (WOS)WOS removes punctuation and scrambles word order. WOS leaves sub-word and word info intact. Destroys super-word relations.RAR: accuracy declined in all but one corpus seems to exploit word seq. (n-grams?). Advantage over SVM state-of-the-art bag-of-words methods.LZW & gzip: no consistent accuracy decline.


SummaryCompared effectiveness of compression for text classification (compression methods x classification procedures).RAR (PPM) is a star under AMDL. - BCN (1-NN) slow(er) and never better in accuracy. - Compression good (Teahan and Harper). - Character-based Markov models good (Peng et al.)Introduced pre-processing testing techniques: novel ways to test how compression (and other character-based methods) exploit sub/super/non-word features. - RAR benefits from super-word info. - Suggests word-based methods might benefit from it too.


Future ResearchTest / confirm results on more and bigger corpora.Compare to state-of-the-art techniques:Other compression / character-based methods. SVMWord-based n-gram language modeling (Peng et al). Word-based compression?Use Standard MDL (Teahan and Harper).Faster, better insight.Sensitivity to class training data imbalanceWhen is throwing away data desirable for compression?


Thank you!


Frequently, presenters must deliver material of a technical nature to an audience unfamiliar with the topic or vocabulary. The material may be complex or heavy with detail. To present technical material effectively, use the following guidelines from Dale Carnegie Training.Consider the amount of time available and prepare to organize your material. Narrow your topic. Divide your presentation into clear segments. Follow a logical progression. Maintain your focus throughout. Close the presentation with a summary, repetition of the key steps, or a logical conclusion.Keep your audience in mind at all times. For example, be sure data is clear and information is relevant. Keep the level of detail and vocabulary appropriate for the audience. Use visuals to support key points or steps. Keep alert to the needs of your listeners, and you will have a more receptive audience.If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience.In your opening, establish the relevancy of the topic to the audience. Give a brief preview of the presentation and establish value for the listeners. Take into account your audiences interest and expertise in the topic when choosing your vocabulary, examples, and illustrations. Focus on the importance of the topic to your audience, and you will have more attentive listeners.Determine the best close for your audience and your presentation. Close with a summary; offer options; recommend a strategy; suggest a plan; set a goal. Keep your focus throughout your presentation, and you will more likely achieve your purpose.

On Compression-Based Text Classification

Documents

Transcript of On Compression-Based Text Classification