Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text :...
-
Upload
charlene-singleton -
Category
Documents
-
view
230 -
download
5
Transcript of Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text :...
![Page 1: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/1.jpg)
Chapter 2. Extracting Lexical Features
2007 년 1 월 23 일인공지능연구실 조선호
Text : FINDING OUT ABOUT
Page. 39 ~ 59
![Page 2: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/2.jpg)
Preview
2.1 Building Useful Tools 2.2 Inter-document Parsing 2.3 Intra-document Parsing
2.3.1 Stemming and Other Morphological Processing 2.3.2 Noise Words 2.3.3 Summary
2.4 Example corpora 2.5 Implementation
2.5.1 Basic Algorithm 2.5.2 Fine Points 2.5.3 Software Libraries
![Page 3: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/3.jpg)
2.1 Building Useful Tools Introduce the example of IR system.
Search engine 개발의 주된 three phases1. First Phase - textual objects 의 arbitrary pile 을 잘 정의된 ( 각 포함하고
있는 terms 의 string 이 index 되어진 ) document 의 corpus 로 convert2. Second Phase - Index relation 을 invert 하기 위해 효율적인 data structure
로 만드는 것이 필요☞ 특정 Keywords 가 포함된 모든 문서를 찾을 수 있다 . ( 모든 keyword 가
포함된 특정 문서를 찾는 것보다 더 유리 )
3. Third Phase – query 에 가장 유사한 것들을 검색하기 위해 인덱스들에 대한 query 를 match
Extracting Lexical features – First and second phase 에서 주로 사용: 그 이후의 분석에서 의미를 가진 features 의 집합의 추출이 목표
이 작업을 통해 얻게 되는 이러한 단위적인 feature set 의 specification 이 중요 .
Level of analysis – documents, words, roots, characters, ...
![Page 4: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/4.jpg)
2.2 Inter-document Parsing
Corpus (an arbitrary “pile of text”) 를 개별적인 검색 가능한 document 로 만드는 단계
AI theses (AIT) and email 의 사례 Multiple text fields
- concatenation( 연결 ) 로써의 implement 주석을 사용
- hitlist 에 proxy 들로 사용- 특별한 강조로 사용
특별한 document class 들을 위한 Pre-filters - deTeX - HTML, XML parsers (SAX, DOX)
Email 은 문장 구성에 따른 구조적인 정보의 사례 Ex) mark-up languages – TEX, XML, HTML
☞ filter 같은 것이 존재하여 의미 있는 text 를 추출해낸다 .
![Page 5: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/5.jpg)
![Page 6: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/6.jpg)
![Page 7: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/7.jpg)
![Page 8: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/8.jpg)
2.3 Intra-document Parsing File 은 간단히 character 들의 stream 으로 구분된다고 가정할 수 있다 .
Process a string of characters assemble characters into tokens
(tokenizer) choose tokens to index
Lexical Analyzer generator
Ex) Lex / yacc Basic idea is a finite state machine Triples of input state, transition token,
output state
![Page 9: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/9.jpg)
Lexical Analyzer Output of lexical analyzer is a string of tokens Remaining operations are all on these tokens We have already thrown away some information; makes more
efficient, but limits somewhat the power of our search
Same lexical analysis for both documents and queries!
![Page 10: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/10.jpg)
Stemming and Other Morphological Processing Conflation Stemming
Rewrite rules Porter stemmer
Other approaches Phrases
![Page 11: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/11.jpg)
Stemming
Additional processing at the token level We covered earlier this semester
Turn words into a canonical form: “cars” into “car” “children” into “child” “walked” into “walk”
Decreases the total number of different tokens to be processed Decreases the precision of a search, but increases its recall
![Page 12: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/12.jpg)
Conflation
![Page 13: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/13.jpg)
Stemming Stemming 에서는 suffix 들은 제거된다 .
다음은 복수형의 단수형화이다 .
- WOMAN / WOMEN
- LEAF / LEAVES
- FERRY / FERRIES
- ALUMNUS / ALUMNI
- DATUM / DATA
Rewrite rules
![Page 14: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/14.jpg)
Porter stemmer Rules
Rule matching
![Page 15: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/15.jpg)
Other approaches
Phrases
![Page 16: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/16.jpg)
Noise Words
a.k.a. Stop Words, negative dictionaries
Function words that contribute little or nothing to meaning
Very frequent words If a word occurs in every document, it is not useful in
choosing among documents However, need to be careful, because this is corpus-
dependent Often implemented as a discrete list
![Page 17: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/17.jpg)
Summary
Text document is represented by the words it contains (and their occurrences) e.g., “Lord of the rings” {“the”, “Lord”, “rings”, “of”} Highly efficient Makes learning far simpler and easier Order of words is not that important for certain applications
Stemming Reduce dimensionality Identifies a word by its root e.g., flying, flew fly
Stop words Identifies the most common words that are unlikely to help with text
mining e.g., “the”, “a”, “an”, “you”
![Page 18: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/18.jpg)
2.4 Example Corpora We are assuming a fixed corpus. Some sample corpora:
AIT Email. Anyone’s email. Reuters corpus Brown corpus
Will contain textual fields, maybe structured attributes Textual: free, unformatted, no meta-information. NLP
mostly needed here Structured: additional information beyond the content
![Page 19: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/19.jpg)
AI Theses (AIT)
![Page 20: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/20.jpg)
AIT year Distribution
![Page 21: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/21.jpg)
Structured Fields for Email An Email Message
Header – From, To, Cc, Subject, Date
![Page 22: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/22.jpg)
Text fields for Email
Subject Format is structured, content is arbitrary. Captures most critical part of content. Proxy for content -- but may be inaccurate.
Body of email Highly irregular, informal English. Entire document, not summary. Spelling and grammar irregularities. Structure and length vary.
![Page 23: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/23.jpg)
2.5 Implementation
Indexing
We have a tokenized, stemmed sequence of words Next step is to parse document, extracting index
terms Assume that each token is a word and we don’t
want to recognize any more complex structures than single words.
When all documents are processed, create index
![Page 24: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/24.jpg)
Basic algorithm
Figure 2.4 Basic Posting Data Structure
![Page 25: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/25.jpg)
Basic Indexing Algorithm For each document in the corpus
Get the next token Create or update an entry in a list
- doc ID, frequency. For each token found in the corpus
calculate #docs, total frequency sort by frequency Often called a “reverse index”, because it reverses the
“words in a document” index to be a “documents containing words” index.
May be built on the fly or created after indexing.
![Page 26: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/26.jpg)
Refined Posting Data Structures
Minimizing OS dependencies
![Page 27: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/27.jpg)
Fine Points Dynamic Corpora (e.g., the web): requires incremental algorit
hms Higher-resolution data (eg, char position).
Supports highlighting Supports phrase searching Useful in relevance ranking
Giving extra weight to proxy text (typically by doubling or tripling frequency count)
Document-type-specific processing In HTML, want to ignore tags In email, maybe want to ignore quoted material
![Page 28: Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.](https://reader036.fdocuments.net/reader036/viewer/2022062301/56649e9a5503460f94b9cf16/html5/thumbnails/28.jpg)
Basic Measures for Text Retrieval
|}{|
|}{}{|
retrieved
RetrievedRelevantprecision
Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)
Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved
|}{|
|}{}{|
relevant
RetrievedRelevantrecall
Relevant Relevant & Retrieved Retrieved
All Documents