Robust Semantics, Information Extraction, and Information Retrieval
description
Transcript of Robust Semantics, Information Extraction, and Information Retrieval
![Page 1: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/1.jpg)
CS 4705
Robust Semantics, Information Extraction, and Information
Retrieval
![Page 2: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/2.jpg)
Problems with Syntax-Driven Semantics
• Syntactic structures often don’t fit semantic structures very well– Important semantic elements often distributed very
differently in trees for sentences that mean ‘the same’
I like soup. Soup is what I like.
– Parse trees contain many structural elements not clearly important to making semantic distinctions
– Syntax driven semantic representations are sometimes pretty verbose
V --> serves )},(),(),((,{ xeServedyeServerServingeIsayex
![Page 3: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/3.jpg)
Alternatives?
• Semantic Grammars• Information Extraction Techniques• Information Retrieval --> Information Extraction
![Page 4: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/4.jpg)
Semantic Grammars
• Alternative to modifying syntactic grammars to deal with semantics too
• Define grammars specifically in terms of the semantic information we want to extract– Domain specific: Rules correspond directly to entities
and activities in the domainI want to go from Boston to Baltimore on Thursday,
September 24th
– Greeting --> {Hello|Hi|Um…}– TripRequest Need-spec travel-verb from City to
City on Date
![Page 5: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/5.jpg)
Predicting User Input
• Semantic grammars rely upon knowledge of the task and (sometimes) constraints on what the user can do, when– Allows them to handle very sophisticated phenomena
I want to go to Boston on Thursday.
I want to leave from there on Friday for Baltimore.
TripRequest Need-spec travel-verb from City on Date for City
Dialogue postulate maps filler for ‘from-city’ to pre-specified from-city
![Page 6: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/6.jpg)
Drawbacks of Semantic Grammars
• Lack of generality– A new one for each application
– Large cost in development time
• Can be very large, depending on how much coverage you want
• If users go outside the grammar, things may break disastrouslyI want to leave from my house.
I want to talk to someone human.
![Page 7: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/7.jpg)
Information Extraction
• Another ‘robust’ alternative• Idea is to ‘extract’ particular types of information
from arbitrary text or transcribed speech• Examples:
– Named entities: people, places, organizations, times, dates
– Telephone numbers
<Organization> MIPS</Organization> Vice President <Person>John Hime</Person>
• Domains: Medical texts, broadcast news, voicemail,...
![Page 8: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/8.jpg)
Appropriate where Semantic Grammars and Syntactic Parsers are Not
• Appropriate where information needs very specific– Question answering systems, gisting of news or mail…– Job ads, financial information, terrorist attacks
• Input too complex and far-ranging to build semantic grammars
• But full-blown syntactic parsers are impractical– Too much ambiguity for arbitrary text– 50 parses or none at all– Too slow for real-time applications
![Page 9: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/9.jpg)
Information Extraction Techniques
• Often use a set of simple templates or frames with slots to be filled in from input text– Ignore everything else– My number is 212-555-1212.– The inventor of the wiggleswort was Capt. John T.
Hart.– The king died in March of 1932.
• Context (neighboring words, capitalization, punctuation) provides cues to help fill in the appropriate slots
![Page 10: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/10.jpg)
The IE Process
• Given a corpus and a target set of items to be extracted:– Clean up the corpus
– Tokenize it
– Do some hand labeling of target items
– Extract some simple features
• POS tags
• Phrase Chunks …
– Do some machine learning to associate features with target items or derive this associate by intuition
– Use e.g. FSTs, simple or cascaded to iteratively annotate the input, eventually identifying the slot fillers
![Page 11: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/11.jpg)
Some examples
• Semantic grammars• Information extraction
![Page 12: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/12.jpg)
Information Retrieval
• How related to NLP?– Operates on language (speech or text)– Does it use linguistic information?
• Stemming• Bag-of-words approach
– Does it make use of document formatting?• Headlines, punctuation, captions
• Collection: a set of documents• Term: a word or phrase• Query: a set of terms
![Page 13: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/13.jpg)
But…what is a term?
• Stop list• Stemming• Homonymy, polysemy, synonymy
![Page 14: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/14.jpg)
Vector Space Model
• Simple versions represent documents and queries as feature vectors, one binary feature for each term in collection
• Is t in this document or query or not?D = (t1,t2,…,tn)Q = (t1,t2,…,tn)• Similarity metric:how many terms does a query
share with each candidate document?
• Weighted terms: term-by-document matrixD = (wt1,wt2,…,wtn)Q = (wt1,wt2,…,wtn)
![Page 15: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/15.jpg)
• How do we compare the vectors?– Normalize each term weight by the number of terms in
the document: how important is each t in D?
– Compute dot product between vectors to see how similar they are
– Cosine of angle: 1 = identity; 0 = no common terms
• How do we get the weights?– Term frequency (tf): how often does t occur in D?
– Inverse document frequency (idf): # docs/ # docs term t occurs in
– tf . idf weighting: weight of term i for doc j is product of frequency of i in j with log of idf in collection
i
i nNidf log
iji idftfw ji ,,
![Page 16: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/16.jpg)
Evaluating IR Performance
• Precision: #rel docs returned/total #docs returned -- how often are you right when you say this document is relevant?
• Recall: #rel docs returned/#rel docs in collection -- how many of the relevant documents do you find?
• F-measure combines P and R RPPR
)(2
![Page 17: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/17.jpg)
Improving Queries
• Relevance feedback: users rate retrieved docs• Query expansion: many techniques
– e.g. add top N docs retrieved to query
• Term clustering: cluster rows of terms to produce synonyms and add to query
![Page 18: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/18.jpg)
IR Tasks
• Ad hoc retrieval: ‘normal’ IR• Routing/categorization: assign new doc to one of
predefined set of categories• Clustering: divide a collection into N clusters• Segmentation: segment text into coherent chunks• Summarization: compress a text by extracting
summary items• Question-answering: find a stretch of text
containing the answer to a question
![Page 19: Robust Semantics, Information Extraction, and Information Retrieval](https://reader035.fdocuments.net/reader035/viewer/2022062410/5681554b550346895dc31980/html5/thumbnails/19.jpg)
Summary
• Many approaches to ‘robust’ semantic analysis– Semantic grammars targeting particular domains
Utterance --> Yes/No Reply
Yes/No Reply --> Yes-Reply | No-Reply
Yes-Reply --> {yes,yeah, right, ok,”you bet”,…}
– Information extraction techniques targeting specific tasks
• Extracting information about terrorist events from news
– Information retrieval techniques --> more like NLP