Phd Thesis

Click here to load reader

  • date post

    02-Dec-2014
  • Category

    Documents

  • view

    120
  • download

    2

Embed Size (px)

Transcript of Phd Thesis

FEATURE GENERATION FOR TEXTUAL INFORMATION RETRIEVAL USING WORLD KNOWLEDGE

EVGENIY GABRILOVICH

FEATURE GENERATION FOR TEXTUAL INFORMATION RETRIEVAL USING WORLD KNOWLEDGE

RESEARCH THESIS

SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

EVGENIY GABRILOVICH

SUBMITTED TO THE SENATE OF THE TECHNION ISRAEL INSTITUTE OF TECHNOLOGY KISLEV, 5767 HAIFA DECEMBER, 2006

THIS RESEARCH THESIS WAS DONE UNDER THE SUPERVISION OF DR. SHAUL MARKOVITCH IN THE DEPARTMENT OF COMPUTER SCIENCE

ACKNOWLEDGMENTSI am indebted to my advisor, Shaul Markovitch, for the guidance he aorded me during my studies. Shaul gave me plenty of freedom to explore the directions I was most interested in. At the same time, however, he constantly reminded me of the importance of focusing on the essential things and foregoing lower-hanging fruit that carry little benet in the larger perspective. Shauls comments on paper drafts were always to the point, and invariably led to much more coherent and simply better papers. From our countless discussions, I adopted a useful habit of always thinking about evaluation before launching a new study. Shaul also provided me with support at the times when my spirit was low, always insisting that there is light at the end of the tunnel. I am thankful to Susan Dumais and Eric Horvitz for hosting me as a summer intern at Microsoft Research. I also thank Arkady Borkovsky and Hadar Shemtov for arranging my summer internship at Yahoo. I am grateful to Alex Gontmakher, my longtime friend and oce roommate. Alex oered his help on things too many to mention, from program optimization to idiosyncrasies of Linux. Alex was also solely responsible for persuading me to study Perl, which then quickly became my scripting language of choice, and for spending hours in teaching me new Perl idioms. I am thankful to Erez Halahmi, my former manager at Comverse Technology, for teaching me how to manage large software projects. Without his couching, I would hardly be able to write tens of thousands of lines of code during my Ph.D. studies, and would likely still be engrossed in debugging.

Throughout my doctoral studies, I beneted from numerous discussions with my friends and colleagues, who shared their wisdom, oered views from the perspective of their elds of expertise, and just helped me when I needed some advice. I am grateful to all of them for their time and assistance: Gregory Begelman, Beata Beigman-Klebanov, Ron Bekkerman, Ido Dagan, Dmitry Davidov, Gideon Dror, Ofer Egozi, Saher Esmeir, Ran El-Yaniv, Anna Feldman, Ariel Felner, Lev Finkelstein, Maayan Geet, Oren Glickman, Yaniv Hamo, Alon Itai, Shyam Kapur, Richard Kaspersky, Ronny Lempel, Dmitry Leshchiner, Irit Opher, Dan Roth, Eytan Ruppin, Dmitry Rusakov, Fabrizio Sebastiani, Vitaly Skachek, Frank Smadja, Gennady Sterlin, Vitaly Surazhsky, Stan Szpakowicz, Peter Turney, and Shuly Wintner. I am especially thankful to my wife Lena for her unconditional love, for putting up with the countless hours I spent on my thesis work, and for being there when I needed it. Finally, I am deeply beholden to my parents, Margarita Sterlina and Solomon Gabrilovich, for their immeasurable love and support. I am also thankful to my parents for teaching me the value of knowledge and education. No words in any natural language would be sucient to thank my parents for all they have done for me. This thesis is dedicated to them.

THE GENEROUS FINANCIAL HELP OF THE TECHNION IS GRATEFULLY ACKNOWLEDGED

To my parents, Margarita Emovna Sterlina and Solomon Isaakovich Gabrilovich

An investment in knowledge always pays the best interest. Benjamin Franklin

ContentsAbstract 1 Introduction 1.1 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Contributions of This Thesis . . . . . . . . . . . . . . . . . . . . 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Background 2.1 Text Categorization . . . . . . . . . . . . . 2.1.1 Document Features . . . . . . . . . 2.1.2 Feature Selection . . . . . . . . . . . 2.1.3 Feature Valuation . . . . . . . . . . 2.1.4 Metrics . . . . . . . . . . . . . . . . 2.1.5 Notes on Evaluation . . . . . . . . . 2.2 Problems With the Bag of Words Approach 2.3 Feature Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 5 7 10 11 13 13 14 16 19 20 22 22 25 29 29 31 33 34 34 35 36 37 38 38 39 42

3 Feature Generation Methodology 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Requirements on Suitable Knowledge Repositories . . . . . . . . 3.3 Building a Feature Generator . . . . . . . . . . . . . . . . . . . . 3.3.1 Attribute Selection . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Feature Generation per se . . . . . . . . . . . . . . . . . . 3.4 Contextual Feature Generation . . . . . . . . . . . . . . . . . . . 3.4.1 Analyzing Local Contexts . . . . . . . . . . . . . . . . . . 3.4.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Feature Valuation . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Revisiting the Running Example . . . . . . . . . . . . . . 3.5 Using Hierarchically-Structured Knowledge Repositories . . . . . 3.6 Using Knowledge Repositories that Dene Arbitrary Relations Between Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Instantiation of the Feature Generation Methodology for ODP and Wikipedia 4.1 Using the Open Directory for Feature Generation . . . . . . . 4.1.1 Multiplying Knowledge Through Web Crawling . . . . 4.1.2 Noise Reduction and Attribute Selection . . . . . . . . 4.1.3 Implementation Details . . . . . . . . . . . . . . . . . 4.2 Using Wikipedia for Feature Generation . . . . . . . . . . . . 4.2.1 Wikipedia as a Knowledge Repository . . . . . . . . . 4.2.2 Feature Generator Design . . . . . . . . . . . . . . . . 4.2.3 Using the Link Structure . . . . . . . . . . . . . . . . 4.2.4 Implementation Details . . . . . . . . . . . . . . . . .

the . . . . . . . . . . . . . . . . . . 45 45 46 46 50 51 53 54 55 57 61 61 61 62 62 63 64 65 67 68 69 70 70 70 70 76 77 77 79 81 82 83 85 85 97 99 99

5 Empirical Evaluation of Feature Generation for Text Categorization 5.1 Test Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Reuters-21578 . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 20 Newsgroups (20NG) . . . . . . . . . . . . . . . . . . . 5.1.3 Movie Reviews . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 Reuters Corpus Version 1 (RCV1) . . . . . . . . . . . . . 5.1.5 OHSUMED . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.6 Short Documents . . . . . . . . . . . . . . . . . . . . . . . 5.1.7 Automatic Acquisition of Data Sets . . . . . . . . . . . . 5.2 Experimentation Procedure . . . . . . . . . . . . . . . . . . . . . 5.2.1 Text Categorization Infrastructure . . . . . . . . . . . . . 5.2.2 Baseline Performance of Hogwarts . . . . . . . . . . . . 5.2.3 Using the Feature Generator . . . . . . . . . . . . . . . . 5.3 ODP-based Feature Generation . . . . . . . . . . . . . . . . . . . 5.3.1 Qualitative Analysis of Feature Generation . . . . . . . . 5.3.2 The Eect of Feature Generation . . . . . . . . . . . . . . 5.3.3 The Eect of Contextual Analysis . . . . . . . . . . . . . 5.3.4 The Eect of Knowledge Breadth . . . . . . . . . . . . . . 5.3.5 The Utility of Feature Selection . . . . . . . . . . . . . . . 5.3.6 The Eect of Category Size . . . . . . . . . . . . . . . . . 5.3.7 The Eect of Feature Generation for Classifying Short Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.8 Processing Time . . . . . . . . . . . . . . . . . . . . . . . 5.4 Wikipedia-based Feature Generation . . . . . . . . . . . . . . . . 5.4.1 Qualitative Analysis of Feature Generation . . . . . . . . 5.4.2 The Eect of Feature Generation . . . . . . . . . . . . . . 5.4.3 The Eect of Knowledge Breadth . . . . . . . . . . . . . . 5.4.4 Classifying Short Documents . . . . . . . . . . . . . . . .

5.4.5

Using Inter-article links as Concept Relations . . . . . . . 100

6 Using Feature Generation for Computing Semantic Relatedness of Texts 103 6.1 Explicit Semantic Analysis . . . . . . . . . . . . . . . . . . . . . 103 6.2 Empirical Evaluation of Explicit Semantic Analysis . . . . . . . . 105 6.2.1 Test Collections . . . . . . . . . . . . . . . . . . . . . . . 106 6.2.2 The Eect of External Knowledge . . . . . . . . . . . . . 107 7 Related work 7.1 Beyond the Bag of Words . . . . . . . . . . . . . . . . . . . 7.2 Feature Generation for Text Categorization . . . . . . . . . 7.2.1 Feature Generation Using Electronic Dictionaries . . 7.2.2 Comparing Knowledge Sources for Feature Generation: versus WordNet . . . . . . . . . . . . . . . . . . . . . 7.2.3 Using Unlabeled Examples . . . . . . . . . . . . . . 7.2.4 Other Related Studies . . . . . . . . . . . . . . . . . 7.3 Semantic Similarity and Semantic Relatedness . . . . . . . . 8 Conclusions 109 . . . 109 . . . 109 . . . 111 ODP . . . 113 . . . 116 . . . 117 . . . 118 123

A Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5 127 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.2 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . 129 A.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 A.2.2 Predicting the Utility of Feature Selection . . . . . . . . . 131 A.2.3 Extended Feature Set Based on WordNet . . . . . . . . . 132 A.2.4 Feature Selection Algorithms . . . . . . . . . . . . . . . . 133 A.2.5 Classication Algorithms and Measures . . . . . . . . . . 133 A.3 E