Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research...
-
Upload
jonas-clark -
Category
Documents
-
view
220 -
download
3
Transcript of Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research...
![Page 1: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/1.jpg)
Mining the Web to Create Minority Language Corpora
Rayid GhaniAccenture Technology Labs - Research
Rosie JonesCarnegie Mellon University
Dunja MladenicJ. Stefan Institute, Slovenia
![Page 2: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/2.jpg)
Who Needs a Language Specific Corpus?
Language Technology Applications Language Modeling Speech Recognition Machine Translation Linguistic and Socio-Linguistic Studies Multilingual Retrieval
![Page 3: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/3.jpg)
What Corpora are Available?
Explicit, marked up corpora: Linguistic Data Consortium -- 20 languages [Liebermann and Cieri 1998]
Search Engines -- implicit language-specific corpora, European languages, Chinese and Japanese Excite - 12 languages Google - 25 languages AltaVista - 25 languages Lycos - 25 languages
![Page 4: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/4.jpg)
BUT what about Slovenian? Or Tagalog? Or Tatar?
You’re just out of luck!
![Page 5: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/5.jpg)
The Human Solution
Start from Yahoo->Slovenia… Crawl www.*.si Search on the web, look at documents,
modify query, analyze documents, modify query,…
Repetitive, time-consuming, requires reasonable familiarity with the language
![Page 6: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/6.jpg)
Task
Given: 1 Document in Target Language 1 Other Document (negative example) Access to a Web Search Engine
Create a Corpus of the Target Language quickly with no human effort
![Page 7: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/7.jpg)
Algorithm
Query Generator WWWSeed Docs
Language Filter
![Page 8: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/8.jpg)
Web
Word Statistics
Initial Docs
Build Query
Filter
Relevant
Non-Relevant
Learning
![Page 9: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/9.jpg)
Query Generation
Examine current relevant and non-relevant documents to generate a query likely to find documents that ARE similar to the relevant ones and NOT similar to non-relevant ones
A Query consists of m inclusion terms and n exclusion terms e.g +intelligence +web –military
![Page 10: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/10.jpg)
Query Term Selection Methods
Uniform (UN) – select k words randomly from the current vocabulary
Term-Frequency (TF) – select top k words ranked according to their frequency
Probabilistic TF (PTF) – k words with probability proportional to their frequency
![Page 11: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/11.jpg)
Query Term Selection Methods
RTFIDF – top k words according to their rtfidf scores
Odds-Ratio (OR) – top k words according to their odds-ratio scores
Probabilistic OR (POR) – select k words with probability proportional to their Odds-Ratio scores
![Page 12: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/12.jpg)
Evaluation
Goal: Collect as many relevant documents as possible while minimizing the cost
Cost Number of total documents retrieved from the Web Number of distinct Queries issued to the Search Engine
Evaluation Measures Percentage of retrieved documents that are relevant Number of relevant documents retrieved per unique query
![Page 13: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/13.jpg)
Experimental Setup
Language: Slovenian Initial documents: 1 web page in Slovenian, 1
in English Search engine: Altavista
![Page 14: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/14.jpg)
Results
![Page 15: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/15.jpg)
Results – Precision at 3000
0
10
20
30
40
50
60
70
80
90
100
Length=1 Length=3 Length=5 Length=10
OR
POR
TF
PTF
UN
Percentage of Target Docs after 3000 Docs Retrieved
![Page 16: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/16.jpg)
Results – Docs Per Query
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Length=1 Length=3 Length=5 Length=10
OR
POR
TF
PTF
UN
![Page 17: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/17.jpg)
Results - Summary
In terms of documents: For lengths 1-3, Odds-Ratio works best
In terms of queries: Odds-Ratio is consistently better than others
Long queries are usually very precise but do not result in a lot of documents (low recall)
![Page 18: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/18.jpg)
Further Experiments
Comparison to Altavista’s “More Like This” Better performance than Altavista’s feature
Keywords Similar results when initializing with keywords
instead of documents
Other Languages Similar results with Croatian, Czech and Tagalog
![Page 19: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/19.jpg)
Conclusions
Successfully able to build corpora for minority languages (Slovenian, Croatian, Czech, Tagalog) using Web search engines
Not sensitive to initial “seed” documents
System and Corpora are/will be available at www.cs.cmu.edu/~TextLearning/CorpusBuilder
![Page 20: Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.](https://reader036.fdocuments.net/reader036/viewer/2022081519/56649ebe5503460f94bc8ed6/html5/thumbnails/20.jpg)
Ideas for Future Work
Explore other Term-Selection methods
From Language specific corpus to Topic Specific corpus as an alternative to focused spidering
Finding documents matching a user profile – Personal Agent