Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected]
description
Transcript of Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected]
Luís SarmentoUniversidade do Porto (NIAD&R) and
Setup
* 2.8 Ghz PIV
* 2Gb RAM
* 160 Gb IDE HD
* Fedora Core 2
* Perl 5.6
* MySQL 5.0.15
* DBI + DBD-Mysql
Optimize Queries…
* Text at sentence level:
QA, Definition Extraction
* 1-4 word window contexts:
find MWE, collocations
* word co-occurrence data:
WSD, context clustering
Global Motivation
* Obtain fast text query methods for a variety of “data-driven” NLP techniques
* Develop practical methods for querying current gigabyte corpora (web collections…)
* Experiment scalable methods for querying the next generation of terabyte corpora
Tabl e # tupl es( mi l l io ns)
Tabl esi ze ( GB)
I ndexsi ze ( GB)
Metadata 1.529 0.2 0.05Sentences 35. 575 6.55 5.90di ction ary 6.834 0.18 0.272-grams 54. 610 1.50 0.923-grams 173.608 5.43 2.974-grams 293.130 10. 40 6.35co- occurrence 761.044 20. 10 7.56BACO total - 44. 4 ~ 24
Statistics
BACOA large database of text and co-occurrences
Some Practical Problems
* How to compile lists of n-grams (2,3,4…) in a 1B word collection?
* How to obtain co-occurrence info for all pairs of words in a 1B word collection?
* Which data structures are best (and easily available in Perl)
hash tables? Trees? Others (Judy? T-Trees?)…
* How should all this data be stored and indexed in a standard RDBS?
Some conclusions
* RDBS are a good alternative for querying gigabyte text collections for NLP purposes
* complex data pre-processing tasks, data modeling and system tuning may be required
* current implementation deals with raw text but models may be extended for annotated corpora
* query speed depends on internal details of MySQL indexing mechanism
* current performance may be improved by a more efficient database scheme and parallelization
Current Deliverables
* MySQL Encoded database of text, n-grams and information about co-occurrence pairs
* Perl Module to easily query BACO instances
Duplicate
removal(by Nuno Seco [email protected])
WPT03
12 GB
6 GB
1.5M docs
sentence splitting
document metadata
tabular format
loaddata
indexdata
indexed database
metadata + text sentences
Stage 1: Data preparation and loading
text
sentences
Stage 2: compiling dictionary + 2,3,4-grams + co-occurrence pairs
single pass
13 iterationsdisjoint division based on
number of chars
DIC
2GRAMS
3, 4-grams + co-occurrence pairs multiple iterations
N documents per iterationtemp files are sorted
3GRAMS
4GRAMS
CO-OCPAIRS
load data
index data
BACO
Final Tables: * metadata * text sentences * Dictionary * 2,3,4-grams * co-occurrence pairs
Linguateca
* Improving processing and research on the Portuguese language
* Fostering collaboration among researchers
* Providing public and free-of-charge tools and resources to the community
http://www.linguateca.pt
WPT03 - A public resource
* The WPT 03 is a resource built by XLDB Group (xldb.di.fc.ul.pt), and distributed by Linguateca (www.linguateca.pt)
* 12GB, 3.7M web documents and ~1.6B words
* Obtained from the Portuguese web search engine TUMBA! http://www.tumba.pt
NIAD&R
* Research group started in 1998 as part of the LIACC (AI Lab) @ Universidade do Porto
* Research topics: Multi-Agent Systems, E-business Technology, Machine Learning, Robotics, Ontologies
http://www.fe.up.pt/~eol/
BACO: BAse de
Co-Ocorrências