Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
-
Upload
branden-norman -
Category
Documents
-
view
215 -
download
0
Transcript of Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
![Page 1: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/1.jpg)
Collocations and Information Management Applications
Gregor ErbachSaarland University
Saarbrücken
![Page 2: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/2.jpg)
Outline
• Information Management Applications
• Information Retrieval Techniques
• Categorization, Clustering
• Summarization
• Information Extraction
• Question Answering
• Points for discussion
![Page 3: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/3.jpg)
Well-known Applications of Collocations
• Lexicography
• Machine Translation
• NL Generation
• NL Parsing
• Terminology Extraction
• Foreign Language Teaching
• Speech Recognition
![Page 4: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/4.jpg)
Information Management Applications
• Information Retrieval• Text Categorisation
(by language, topic, author, genre ...)
• Clustering
• Summarisation / Keyword Extraction
• Information Extraction
• Question Answering
![Page 5: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/5.jpg)
Information Retrieval
• Most IR systems don't retrieve information, but documents
• Boolean retrieval: an unordered set of documents are returned as result for a query
• Ranked retrieval: an ordered list of documents is returned; relevance of documents is determined by matching with a query
![Page 6: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/6.jpg)
IR System Model
Documents
RQ
Matching
[0, 1]
Query
Representation RD
![Page 7: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/7.jpg)
Query Languages
• Co-occurence within document information AN D retrieval
• Negation information AND (NOT retrieval)
• Multi-word expression "information retrieval"
• Proximity operators information NEAR retrieval
![Page 8: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/8.jpg)
Evaluation
• Precision
• Recall
• Precision/recall graphs
• 11 point average precision
• TREC (Text Retrieval Conference)
• TREC Tasks: ad-hoc, web, spoken documents, multimedia, cross-language ...
![Page 9: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/9.jpg)
Relevance
• Relevance is matching of a document with an information need expressed through a query
• Relevance is considered as binary and determined by human assessors for document-query pairs
• Relevance is modelled by a similarity measure that compares query and document representations
![Page 10: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/10.jpg)
Document and Query Representations in IR
• Documents and queries are generally represented as a vector of terms weights
• Documents are treated as bags of words
• Preprocessing: stemming or morphological analysis
• POS, chunking, syntax did not improve information retrieval performance
![Page 11: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/11.jpg)
Similarity Measures
• Term weighting: TF , TF/ICF, TF/IDF
• Similarity measures determine how close two documents are, or how alike a document and a query are
• A common similarity measure is the cosine of the angles between the vector representations
![Page 12: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/12.jpg)
Cosine Similarity
term2
term1
4 8
2
6
![Page 13: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/13.jpg)
Result Ranking
• Adjacency or proximity of search terms can be taken into account in ranking of retrieval results
• This accounts for phrases and collocations
• Search terms occurring near each other (e.g. within a paragraph) are more likely to be related than search term occurring in different parts of a document
![Page 14: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/14.jpg)
Latent Semantic Indexing
• LSI: Singular value decomposition, dimensionality reduction
• LSI associates terms that share the same context, i.e. can be substituted
• Applications: information retrieval, cross-language IR, language learning, text categorisation, vocabulary tests
![Page 15: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/15.jpg)
Query Expansion
• Query expansion with related terms (e.g. from WordNet, thesarus)
• Relevance Feedback: Query expansion with terms from relevant document
• Blind Relevance Feedback: Query expansion with terms from top-ranking document. Expansion with co-occurring terms improves precision/recall.
![Page 16: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/16.jpg)
Language Models for IR
• Language models generate queries from documents
• Estimate probability that a given query was generated by a particular document
• Uni-gram language models
• (special case of probabilistic IR)
![Page 17: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/17.jpg)
Cross-language IR
• Methods: document translation, query translation, parallel/comparable corpora
![Page 18: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/18.jpg)
Document Categorization
• Similar techniques to IR (document representation, similarity measures)
• Document base contains categorized documents• New document as query which retrieves the
best matching documents from database• Support Vector Machines achieve
very good performance on various text categorization tasks
![Page 19: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/19.jpg)
Document Clustering
• Similar techniques to IR (document representation, similarity measures)
• Each cluster is represented by a centroid
• Iterative hierarchical grouping of similar documents
![Page 20: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/20.jpg)
Summarization
• Two approaches: Extraction (of sentencs or keywords) and abstraction (summary generation)
• Indicative vs. informative summaries
• Query-independent vs. query-biased summaries
• Evaluation criteria: informativeness, coherence
![Page 21: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/21.jpg)
Information Extraction
• Tasks: named entity extraction, coreference, template extraction
• named entities: person, organisation, location, time, date, money, percentage
• methods: finite-state grammars, finite-state transducers
• evaluation: precision, recall, f-measure
![Page 22: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/22.jpg)
Question Answering
• Answer extraction (passage retrieval) vs. Information extraction + answer generation
• Combination of IR-based and NLP-based approaches (semantic concepts, dependency relations).
• TREC open domain QA evaluation: extract 50-word passage containing the answer to a factual question
![Page 23: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/23.jpg)
Co-location spaces
• Linear (speech, text)
• document (as bag of words)
• hierarchical structure (tree, dependency relations)
• semantic/conceptual space (e.g. WordNet)
• cyberspace (hyperlinks)
![Page 24: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/24.jpg)
Collocations and IM
Collocations are
• multi-word units
• with statistical associations
• with restricted semantic compositionality
Are they useful for information management applications?
![Page 25: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/25.jpg)
Collocations and Document Representations
• Common representations treat terms in the document and query as independent
• Collocations research shows that they are not independent
• Implications?
![Page 26: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/26.jpg)
Collocations / Query Formulation
• Use of collocations for query expansion?(e.g. collocation, corpus, association ...
vs. collocation, facility, service, server, hosting ...)
• Automatic or interactive?
![Page 27: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/27.jpg)
Collocations / Categorisation & Clustering
• How much can category-specific collocations improve performance?
• Collocations for identification of genre, author, dialect?
![Page 28: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/28.jpg)
Collocations / IE
• IE techniques (finite-state shallow parsing) for collocation identification
• Use of collocation in IE grammars (Gewinn machen, Umsatz erzielen ...)
![Page 29: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/29.jpg)
Collocations / QA
• Use of collocations for finding answers (e.g. function-proper_name)
![Page 30: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/30.jpg)
Collocations and Summarisation
• Keyword / key phrase extraction
• Evaluation of coherence: which association measures can be used?
![Page 31: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.](https://reader031.fdocuments.net/reader031/viewer/2022020307/56649f175503460f94c2e86d/html5/thumbnails/31.jpg)
Questions for Discussion
• Are collocations a useful level of representation for indexing and retrieval?
• Or are they only useful in establishing semantic representations?