On Compression-Based Text Classification

23
On Compression-Based Text Classification Yuval Marton 1 , Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05 Santiago de Compostela, Spain. March 2005

description

On Compression-Based Text Classification. Yuval Marton 1 , Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University. ECIR-05 Santiago de Compostela, Spain. March 2005. Compression for Text Classification??. - PowerPoint PPT Presentation

Transcript of On Compression-Based Text Classification

  • On Compression-Based Text ClassificationYuval Marton1, Ning Wu2 and Lisa Hellerstein21) University of Maryland and 2) Polytechnic UniversityECIR-05 Santiago de Compostela, Spain. March 2005

  • Compression for Text Classification??Proposed in the last ~10 years. Not well-understood why works.Compression is stupid! slow! Non-standard!Using compression tools is easy..Does it work? (Controversy. Mess)

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • OverviewWhats Text classification (problem setting)Compression-based text classificationClassification Procedures ( + Do it yourself !)Compression Methods (RAR, LZW, and gzip)Experimental EvaluationWhy?? (Compression as char-based method)Influence of sub/super/non-word featuresConclusions and future work

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • Text ClassificationGiven training corpus (labeled documents). Learn how to label new (test) documents.Our setting:Single-class: document belongs to exactly one class. 3 topic classification and 3 authorship attribution tasks.

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • Classification by CompressionCompression programs build a model or dictionary of their input (language modeling).Better model better compressionIdea:Compress a document using different class models. Label with class achieving highest compression rate.Minimal Description Length (MDL) principle: select model with shortest length of model + data.

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • Standard MDL (Teahan & Harper)D1D2D3Class 1D1D2D3Class 2D1D2D3Class nA1A2AnM1M2MnTConcat. training data AiCompress Ai model MiCompress T using each MiAssign T to its best compressor and the winner isFREEZE!

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • Do it yourselfFive minutes on how to classify text documents e.g., according to their topic or author, using only off-the-shelf compression tools (such as WinZip or RAR)

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • AMDL (Khmelev / Kukushkina et al. 2001)D1D2D3Class 1D1D2D3Class 2D1D2D3Class nTConcat. training data AiConcat. Ai and T AiTCompress each Ai and AiTAssign T to class i w/ min viA1A2AnA1TSubtract compressed file sizes vi = |AiT| - |Ai|A2TAnT

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • BCN (Benedetto et al. 2002)D1D2D3Class 1D4D5D6Class 2D7D8D9Class nTLike AMDL, but concat. each doc Dj with T DjTCompress each Dj and DjTAssign T to class i of doc Dj with min vDTD1D4D7D1TSubtract compressed file sizes vDT = |DjT| - |Dj|D4TD7TD2TD5TD8TD3TD6TD9T

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • Compression MethodsGzip: Lempel-Ziv compression (LZ77). - Dictionary-based - Sliding window typically 32K.LZW (Lempel-Ziv-Welch) - Dictionary-based (16 bit). - Fills up on big corpora (typically after ~300KB).RAR (proprietary shareware) - PPMII variant on text. - Markov Model, n-grams frequencies.-32K--16 bit (~300K) -- (almost) unlimited -

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • Previous WorkKhmelev et al. (+Kukushkina): Russian authors. Thaper: LZ78, char- and word-based PPM.Frank et al.: compression (PPM) bad for topic. Teahan and Harper: compression (PPM) good.Benedetto et al.: gzip good for authors. Goodman: gzip bad! Khmelev and Teahan: RAR (PPM). Peng et al.: Markov Language Models.

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • Compression Good or Bad? Scoring: we measured Accuracy: Total # correct classifications Total # tests (Micro-averaged accuracy) Why? Single-class labels, no tuning parameters.

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • AMDL Results

    CorpusRARLZWGZIPAuthorFederalist (2)0.940.830.67Gutenberg-100.820.650.62Reuters-90.780.660.79TopicReuters-100.870.840.8310news (20news)0.96 (0.90)0.660.56 (0.47)Sector (105)0.900.610.19

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • RAR is a Star!RAR is best performing method on all but small Reuters-9 corpus.Poor performance of gzip on large corpora due to its 32K sliding window.Poor performance of LZW: dictionary fills up after ~ 300KB, other reasons too.

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • RAR on Standard Corpora - Comparison90.5% for RAR on 20news: - 89.2% Language Modeling (Peng et al. 2004) - 86.2% Extended NB (Rennie et al. 2003) - 82.1% PPMC (Teahan and Harper 2001) 89.6% for RAR on Sector: - 93.6% SVM (Zhang and Oles 2001) - 92.3% Extended NB (Rennie et al. 2003) - 64.5% Multinomial NB (Ghani 2001)

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • AMDL vs. BCNGzip / BCN good. Due to processing each doc separately with T (1-NN). Gzip / AMDL bad. BCN was slow. Probably due to more sys calls and disk I/O.

    Method CorpusAMDLBCNRARgzipRARgzipFederalist0.940.670.780.78Guten-100.820.620.750.72Reuters-90.780.790.770.77

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • Why Good?!Compression tools are character-based. (Stupid, remember?) Better than word-based? WHY? Can they capturesub-wordwordsuper-wordnon-word features?

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • Pre-processingSTD: no change to input. NoP: remove punctuation; replace white spaces (tab, line, parag & page breaks) with spaces. WOS: NoP + Word Order Scrambling. RSW: NoP + random-string words. and morethemorethebetterbetterthethemorethe more the better!dqftmdwdqflkwe

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • Non-words: PunctuationIntuition:punctuation usage is characteristic of writing style (authorship attribution).Results:Accuracy remained the same, or even increased, in many cases.RAR insensitive to punctuation removal.

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • Super-words: word seq. Order Scrambling (WOS)WOS removes punctuation and scrambles word order. WOS leaves sub-word and word info intact. Destroys super-word relations.RAR: accuracy declined in all but one corpus seems to exploit word seq. (n-grams?). Advantage over SVM state-of-the-art bag-of-words methods.LZW & gzip: no consistent accuracy decline.

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • SummaryCompared effectiveness of compression for text classification (compression methods x classification procedures).RAR (PPM) is a star under AMDL. - BCN (1-NN) slow(er) and never better in accuracy. - Compression good (Teahan and Harper). - Character-based Markov models good (Peng et al.)Introduced pre-processing testing techniques: novel ways to test how compression (and other character-based methods) exploit sub/super/non-word features. - RAR benefits from super-word info. - Suggests word-based methods might benefit from it too.

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • Future ResearchTest / confirm results on more and bigger corpora.Compare to state-of-the-art techniques:Other compression / character-based methods. SVMWord-based n-gram language modeling (Peng et al). Word-based compression?Use Standard MDL (Teahan and Harper).Faster, better insight.Sensitivity to class training data imbalanceWhen is throwing away data desirable for compression?

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  • Thank you!

    Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

    Frequently, presenters must deliver material of a technical nature to an audience unfamiliar with the topic or vocabulary. The material may be complex or heavy with detail. To present technical material effectively, use the following guidelines from Dale Carnegie Training.Consider the amount of time available and prepare to organize your material. Narrow your topic. Divide your presentation into clear segments. Follow a logical progression. Maintain your focus throughout. Close the presentation with a summary, repetition of the key steps, or a logical conclusion.Keep your audience in mind at all times. For example, be sure data is clear and information is relevant. Keep the level of detail and vocabulary appropriate for the audience. Use visuals to support key points or steps. Keep alert to the needs of your listeners, and you will have a more receptive audience.If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, e-mail, or the Internet. Develop each point adequately to communicate with your audience.In your opening, establish the relevancy of the topic to the audience. Give a brief preview of the presentation and establish value for the listeners. Take into account your audiences interest and expertise in the topic when choosing your vocabulary, examples, and illustrations. Focus on the importance of the topic to your audience, and you will have more attentive listeners.Determine the best close for your audience and your presentation. Close with a summary; offer options; recommend a strategy; suggest a plan; set a goal. Keep your focus throughout your presentation, and you will more likely achieve your purpose.