Aspects of NLP Practice

Practical Aspectsof NLP Work

Vsevolod DyomkinGrammarly

TAAC'2012, Kyiv, Ukraine

Topics

* Practical vs Theoretical NLP work* Working with Data for NLP* NLP Tools

A bit about Grammarly

(c) xkcd

An example of what we deal with

Research vs Development

“Trick for productionizing research: read current 3-5 pubs and note the stupid simple thing they all claim to beat, implement that.

--Jay Kreps https://twitter.com/jaykreps/

status/219977241839411200

NLP practice

R - research work:set a goal →devise an algorithm →train the algorithm →test its accuracy

D - development work:implement the algorithm as an API with sufficient performance and scaling characteristics

Research1. Set a goal

Business goal:

* Develop best/good enough/better than Word/etc spellchecker

* Develop a set of grammar rules, that will catch errors according to MLA Style

* Develop a thesaurus, that will produce synonyms relevant to context

Translate it to measurable goal* On a test corpus of 10000 sentences with common errors achieve smaller number of FNs (and FPs), that other spellcheckers/Word spellchecker/etc

* On a corpus of examples of sentences with each kind of error (and similar sentences without this kind of error) find all sentences with errors and do not find errors in correct sentences

* On a test corpus of 1000 sentences suggest synonyms for all meaningful words that will be considered relevant by human linguists in 90% of the cases

Research

1. Set a goal2. Devise an algorithm3. Train & improve the algorithm

Research

1. Set a goal2. Devise an algorithm3. Train & improve the algorithm

http://nlp-class.org

4. Test its performance

ML: one corpus, divided into training,development,test

4. Test its performance

ML: one corpus, divided into training,development,test

Often different corpora:—* for training some part of the algorithm* for testing the whole system

Theoretical maxima

Theoretical maxima are rarely achievable. Why?

Theoretical maxima


* because you need their data

Theoretical maxima


* because you need their data

* domains might differ

Pre/post-processingWhat ultimately matters is not crude performance, but...


Acceptance to users (much harder to measure & depends on domain).


Acceptance to users (much harder to measure & depends on domain).

Real-world is messier, than any lab set-up.

Examples ofpre-processing

For spellcheck:

* some people tend to use words, separated by slashes, like: spell/grammar check

* handling of abbreviations

Data

“Data is the next Intel Inside.

--Tim O'Reilly, What is Web2.0 http://oreilly.com/web2/archive/what-is-web-

20.html?page=3

Categorization of Data

* Structured small—* Semi-structured medium—* Unstructured big—

Where to get data?Well-known sources:* Penn Tree Bank* Wordnet* BNC* Web1T Google N-gram Corpus* Linguistic Data Consortium (http://www.ldc.upenn.edu/)

More dataAlso well-known sources, but with a twist:

* Wikipedia & Wiktionary, DBPedia* OpenWeb Common Crawl* Public APIs of some services: Twitter, Wordnik

Academic resources

* Stanford* CoNLL* Oxford (http://www.ota.ox.ac.uk/)* CMU, MIT,...* LingPipe, OpenNLP, NLTK,...

http://www.ota.ox.ac.uk/

Crowd-sourced data

Jonathan Zittrain, The Future of the Internet

http://goo.gl/hs4qB

And remember...

“Data is ten times more powerful than algorithms.

--Peter Norvig The Unreasonable Effectiveness of Data http://youtu.be/yvDCzhbjYWs

Levels of NLP tools

High-level: user services

Middle-level: NLP algorithms

Low-level: data-crunching

Choosing a language

Requirement types:* Research* NLP-specific* Production

Research requirements

* Interactivity* Mathematical basis* Expressiveness* Agility Malleability* Advanced tools

Specific NLP requirements

* Good support for statistics & number-crunching – Statistical AI

* Good support for working with trees & symbols – Symbolic AI

Production requirements

* Scalability* Maintainability* Integrability* ...

Choose Lisp

(c) xkcd

Lisp FTW* Truly interactive environment* Very flexible => DSLs* Native tree support* Fast and solid

- No OpenNLP/NLTK

Heterogeneous systems

“Java way” vs. “Unix way”

Create language-agnostic systems, that can easily communicate!

Take-aways* As they say, in theory research and practice are the same, but in practice...

* Data is key. There are 3 types of it. Collect it, build tools to work with it easily and efficiently

* Choose a good language for R&D: interactive & malleable, with as few barriers as possible

Thanks!

Vsevolod Dyomkin@vseloved

Aspects of NLP Practice

Technology

Transcript of Aspects of NLP Practice