Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected]

Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected] Setup * 2.8 Ghz PIV * 2Gb RAM * 160 Gb IDE HD * Fedora Core 2 * Perl 5.6 * MySQL 5.0.15 * DBI + DBD-Mysql Optimize Queries… * Text at sentence level: QA, Definition Extraction * 1-4 word window contexts: find MWE, collocations * word co-occurrence data: WSD, context clustering Global Motivation * Obtain fast text query methods for a variety of “data-driven” NLP techniques * Develop practical methods for querying current gigabyte corpora (web collections…) * Experiment scalable methods for querying the next generation of terabyte corpora Table # tuples (millions) Table size (GB) Index size (GB) Metadata 1.529 0.2 0.05 Sentences 35.575 6.55 5.90 dictionary 6.834 0.18 0.27 2-grams 54.610 1.50 0.92 3-grams 173.608 5.43 2.97 4-grams 293.130 10.40 6.35 co-occurrence 761.044 20.10 7.56 BACO total - 44.4 ~ 24 Statistics BACO A large database of text and co-occurrences Some Practical Problems * How to compile lists of n-grams (2,3,4…) in a 1B word collection? * How to obtain co-occurrence info for all pairs of words in a 1B word collection? * Which data structures are best (and easily available in Perl) hash tables? Trees? Others (Judy? T-Trees?)… * How should all this data be stored and indexed in a standard RDBS? Some conclusions * RDBS are a good alternative for querying gigabyte text collections for NLP purposes * complex data pre-processing tasks, data modeling and system tuning may be required * current implementation deals with raw text but models may be extended for annotated corpora * query speed depends on internal details of MySQL indexing mechanism * current performance may be improved by a more efficient database scheme and parallelization Current Deliverables * MySQL Encoded database of text, n-grams and information about co-occurrence pairs * Perl Module to easily query BACO instances Duplicate removal (by Nuno Seco [email protected]) WPT03 12 GB 6 GB 1.5M docs sentence splitting document metadata tabular format load data index data indexed database metadata + text sentences Stage 1: Data preparation and loading text sentences Stage 2: compiling dictionary + 2,3,4-grams + co-occurrence pairs single pass 13 iterations disjoint division based on number of chars DIC 2 GRAMS 3, 4-grams + co-occurrence pairs multiple iterations N documents per iteration temp files are sorted 3 GRAMS 4 GRAMS CO-OC PAIRS load data index data BACO Final Tables: * metadata * text sentences * Dictionary * 2,3,4-grams * co-occurrence pairs Linguateca * Improving processing and research on the Portuguese language * Fostering collaboration among researchers * Providing public and free-of-charge tools and resources to the community http://www.linguateca.pt WPT03 - A public resource * The WPT 03 is a resource built by XLDB Group (xldb.di.fc.ul.pt), and distributed by Linguateca (www.linguateca.pt) * 12GB, 3.7M web documents and ~1.6B words * Obtained from the Portuguese web search engine TUMBA! http://www.tumba.pt NIAD&R * Research group started in 1998 as part of the LIACC (AI Lab) @ Universidade do Porto * Research topics: Multi-Agent Systems, E-business Technology, Machine Learning, Robotics, Ontologies http://www.fe.up.pt/~eol/ BACO: BAse de Co-Ocorrências

Upload
roden
Category

Documents
view
15
download
0

Embed Size (px):

description

Stage 1: Data preparation and loading. Global Motivation. * Obtain fast text query methods for a variety of “data-driven” NLP techniques * Develop practical methods for querying current gigabyte corpora (web collections…) - PowerPoint PPT Presentation

Transcript of Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected]

Luís SarmentoUniversidade do Porto (NIAD&R) and

[email protected]

Setup

* 2.8 Ghz PIV

* 2Gb RAM

* 160 Gb IDE HD

* Fedora Core 2

* Perl 5.6

* MySQL 5.0.15

* DBI + DBD-Mysql

Optimize Queries…

* Text at sentence level:

QA, Definition Extraction

* 1-4 word window contexts:

find MWE, collocations

* word co-occurrence data:

WSD, context clustering

Global Motivation

* Obtain fast text query methods for a variety of “data-driven” NLP techniques

* Develop practical methods for querying current gigabyte corpora (web collections…)

* Experiment scalable methods for querying the next generation of terabyte corpora

Tabl e # tupl es( mi l l io ns)

Tabl esi ze ( GB)

I ndexsi ze ( GB)

Metadata 1.529 0.2 0.05Sentences 35. 575 6.55 5.90di ction ary 6.834 0.18 0.272-grams 54. 610 1.50 0.923-grams 173.608 5.43 2.974-grams 293.130 10. 40 6.35co- occurrence 761.044 20. 10 7.56BACO total - 44. 4 ~ 24

Statistics

BACOA large database of text and co-occurrences

Some Practical Problems

* How to compile lists of n-grams (2,3,4…) in a 1B word collection?

* How to obtain co-occurrence info for all pairs of words in a 1B word collection?

* Which data structures are best (and easily available in Perl)

hash tables? Trees? Others (Judy? T-Trees?)…

* How should all this data be stored and indexed in a standard RDBS?

Some conclusions

* RDBS are a good alternative for querying gigabyte text collections for NLP purposes

* complex data pre-processing tasks, data modeling and system tuning may be required

* current implementation deals with raw text but models may be extended for annotated corpora

* query speed depends on internal details of MySQL indexing mechanism

* current performance may be improved by a more efficient database scheme and parallelization

Current Deliverables

* MySQL Encoded database of text, n-grams and information about co-occurrence pairs

* Perl Module to easily query BACO instances

Duplicate

removal(by Nuno Seco [email protected])

WPT03

12 GB

6 GB

1.5M docs

sentence splitting

document metadata

tabular format

loaddata

indexdata

indexed database

metadata + text sentences

Stage 1: Data preparation and loading

text

sentences

Stage 2: compiling dictionary + 2,3,4-grams + co-occurrence pairs

single pass

13 iterationsdisjoint division based on

number of chars

DIC

2GRAMS

3, 4-grams + co-occurrence pairs multiple iterations

N documents per iterationtemp files are sorted

3GRAMS

4GRAMS

CO-OCPAIRS

load data

index data

BACO

Final Tables: * metadata * text sentences * Dictionary * 2,3,4-grams * co-occurrence pairs

Linguateca

* Improving processing and research on the Portuguese language

* Fostering collaboration among researchers

* Providing public and free-of-charge tools and resources to the community

http://www.linguateca.pt

WPT03 - A public resource

* The WPT 03 is a resource built by XLDB Group (xldb.di.fc.ul.pt), and distributed by Linguateca (www.linguateca.pt)

* 12GB, 3.7M web documents and ~1.6B words

* Obtained from the Portuguese web search engine TUMBA! http://www.tumba.pt

NIAD&R

* Research group started in 1998 as part of the LIACC (AI Lab) @ Universidade do Porto

* Research topics: Multi-Agent Systems, E-business Technology, Machine Learning, Robotics, Ontologies

http://www.fe.up.pt/~eol/

BACO: BAse de

Co-Ocorrências

NIAD&R – Distributed Artificial Intelligence and Robotics Group 1 NIAD&R – Distributed Artificial Intelligence and Robotics Group 1 The Rationale behind.

Helder Tavares ee08025@fe.up.pt 4 de Abril de 2012

Helder Tavares [email protected] 4 de Abril de 2012

EdV-Linguateca 2006 - comum.rcaap.pt · EdV-Linguateca 2006 Ontologias & Terminologias: Perspectivas da engenharia Mário J. Silva Julho de 2006 EdV 2006 - Ontologias 2 Transferência

NIAD - investing in agriculture arabic book

LINGUATECA & Translation, terminology and research at the University of Porto

Www.mitportugal.org Doctoral Program in Sustainable Energy Systems 2009 NOV 17 Ricardo Bessa (pds09004@fe.up.pt)pds09004@fe.up.pt Wind Energy, Support.

Www.mitportugal.org Doctoral Program in Sustainable Energy Systems 2009 NOV 17 Ricardo Bessa ([email protected])[email protected] Wind Energy, Support.

O que é o COMPARA ? 2007-07-31. Linguateca COMPARA.

João Bispo, Pedro Pinto, João M. P. Cardoso · 2019. 10. 30. · João Bispo, Pedro Pinto, João M. P. Cardoso jbispo@fe.up.pt, p.pinto@fe.up.pt, jmpc@fe.up.pt João Bispo acknowledges

João Bispo, Pedro Pinto, João M. P. Cardoso · 2019. 10. 30. · João Bispo, Pedro Pinto, João M. P. Cardoso [email protected], [email protected], [email protected] João Bispo acknowledges

Newtonian fluid flow through Microfabricated Hyperbolic ...web.mit.edu/nnf/publications/GHM108.pdf · Porto, Porto, Portugal, monica.oliveira@fe.up.pt, mmalves@fe.up.pt 2: Department

Newtonian fluid flow through Microfabricated Hyperbolic ...web.mit.edu/nnf/publications/GHM108.pdf · Porto, Porto, Portugal, [email protected], [email protected] 2: Department

NIAD&R is a member of LIACC Artificial Intelligence and Computer Science Laboratory

MULTI-UAV INTEGRATION FOR COORDINATED MISSIONS Ricardo Bencatel ricardo.bencatel@fe.up.pt Pedro Almeida pinto.almeida@fe.up.pt Gil Manuel Gonçalves gil@fe.up.pt.

MULTI-UAV INTEGRATION FOR COORDINATED MISSIONS Ricardo Bencatel [email protected] Pedro Almeida [email protected] Gil Manuel Gonçalves [email protected].

Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.

Niad Issue #3 March 1985 - SacNews issue 3 - march 1985.pdf · crannies" of your adam. i want to thank wayne is a great help to me and you as niad "iembers. wayne will be a big contributor

Comparação de anotações na Gramateca - Linguateca · Linguateca para estudar gramática da língua portuguesa: ... •Foi criado um conjunto de mais 100 casos para cada ... erros

Óptica e Electromagnetismomines/publicacoes_pedagogic... · gab I 326 ajcosta@fe.up.pt (aprox. 2/3 das aulas) Inês Carvalho gab I 313 mines@fe.up.pt Teórico-práticas Práticas.

Óptica e Electromagnetismomines/publicacoes_pedagogic... · gab I 326 [email protected] (aprox. 2/3 das aulas) Inês Carvalho gab I 313 [email protected] Teórico-práticas Práticas.

View on Internationalization NIAD-UE Seminar Japan

G I401 “Um Outro Panorama da FEUP” - web.fe.up.ptprojfeup/cd_2010_11/files/G_I401_poster.pdf · ... ei10112@fe.up.pt, Daniel Nora, ei10030@fe.up.pt, João Santos, gei10032@fe

G I401 “Um Outro Panorama da FEUP” - web.fe.up.ptprojfeup/cd_2010_11/files/G_I401_poster.pdf · ... [email protected], Daniel Nora, [email protected], João Santos, gei10032@fe

Roles of NIAD UE in the National Quality Assurance …...Roles of NIAD‐UE in the National Quality Assurance System Towards Quality‐Assured International University Exchanges The

1 Lígia Maria Ribeiro Gabriel David FEUP - Rua dos Bragas, 4050-123 Porto - PORTUGAL Tel. 351-2-2041842 - Fax: 351-2-2000808 Email: lmr@fe.up.pt gtd@fe.up.pt.

1 Lígia Maria Ribeiro Gabriel David FEUP - Rua dos Bragas, 4050-123 Porto - PORTUGAL Tel. 351-2-2041842 - Fax: 351-2-2000808 Email: [email protected] [email protected].

Os corpos da Linguateca na prática · Os corpos da Linguateca na prática Cláudia Freitas (Linguateca & PUC-Rio) EBRALC - 2011 V Escola Brasileira de Linguística Computacional

Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected]

Documents

Transcript of Luís Sarmento Universidade do Porto (NIAD&R) and Linguateca [email protected]