NLTK: Natural Language Processing made easy
description
Transcript of NLTK: Natural Language Processing made easy
http://barcampbangalore.org
NLTK
Natural Language Processing made easyElvis Joel D’Souza
Gopikrishnan Nambiar
Ashutosh Pandey
http://barcampbangalore.org
WHAT: Session Objective
To introduce Natural Language Toolkit(NLTK), an open source library which simplifies the implementation of Natural Language Processing(NLP) in Python.
http://barcampbangalore.org
HOW: Session Layout
This session is divided into 3 parts:• Python – The programming language• Natural Language Processing (NLP) – The concept• Natural Language Toolkit (NLTK) – The tool for NLP
implementation in Python
http://barcampbangalore.org
http://barcampbangalore.org
Why Python?
http://barcampbangalore.org
Data Structures
Python has 4 built-in data structures:1.List2.Tuple3.Dictionary4.Set
http://barcampbangalore.org
List
• A list in Python is an ordered group of items (or elements).
• It is a very general structure, and list elements don't have to be of the same type.
listOfWords = [‘this’,’is’,’a’,’list’,’of’,’words’]
listOfRandomStuff = [1,’pen’,’costs’,’Rs.’,6.50]
http://barcampbangalore.org
Tuple
• A tuple in Python is much like a list except that it is immutable (unchangeable) once created.
• They are generally used for data which should not be edited.
Example: (100,10,0.01,’hundred’)
NumberSquare root
ReciprocalNumber in words
http://barcampbangalore.org
Return a tuple
def func(x,y): # code to compute a and breturn (a,b)
One very useful situation is returning multiple values from a function. To return multiple values in many other languages requires creating an object or container of some type.
http://barcampbangalore.org
Dictionary• A dictionary in python is a collection of
unordered values which are accessed by key.• Example:
• Here, the key is the character and the value is its position in the alphabet
{1: ‘one’, 2: ‘two’, 3: ‘three’}
http://barcampbangalore.org
Sets
• Python also has an implementation of the mathematical set. • Unlike sequence objects such as lists and tuples, in which
each element is indexed, a set is an unordered collection of objects.
• Sets also cannot have duplicate members - a given object appears in a set 0 or 1 times.
SetOfBrowsers=set([‘IE’,’Firefox’,’Opera’,’Chrome’])
http://barcampbangalore.org
Control Statements
http://barcampbangalore.org
Decision Control - If
num = 3
http://barcampbangalore.org
Loop Control - While
number = 10
http://barcampbangalore.org
Loop Control - For
http://barcampbangalore.org
Functions - Syntax
def functionname(arg1, arg2, ...):statement1 statement2 return variable
http://barcampbangalore.org
Functions - Example
http://barcampbangalore.org
Modules
• A module is a file containing Python definitions and statements.
• The file name is the module name with the suffix .py appended.
• A module can be imported by another program to make use of its functionality.
http://barcampbangalore.org
Import
import math
The import keyword is used to tell Python, that we need the ‘math’ module.
This statement makes all the functions in this module accessible in the program.
http://barcampbangalore.org
Using Modules – An Example
print math.sqrt(100)
sqrt is a functionmath is a module
math.sqrt(100) returns 10This is being printed to the standard output
http://barcampbangalore.org
Natural Language Processing
(NLP)
http://barcampbangalore.org
Natural Language Processing
The term natural language processing encompasses a broad set of techniques for automated generation, manipulation, and analysis of natural or human languages
http://barcampbangalore.org
Why NLP
• Applications for processing large amounts of texts require NLP expertise
• Index and search large texts• Speech understanding• Information extraction• Automatic summarization
http://barcampbangalore.org
Stemming
• Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form.
• The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.
• When you apply stemming on 'cats', the result is 'cat'
http://barcampbangalore.org
Part of speech tagging(POS Tagging)
• Part-of-speech (POS) tag: A word can be classified into one or more lexical or part-of-speech categories
• such as nouns, verbs, adjectives, and articles, to name a few. A POS tag is a symbol representing such a lexical category, e.g., NN (noun), VB (verb), JJ (adjective), AT (article).
http://barcampbangalore.org
POS tagging - continued
• Given a sentence and a set of POS tags, a common language processing task is to automatically assign POS tags to each word in the sentence.
• State-of-the-art POS taggers can achieve accuracy as high as 96%.
http://barcampbangalore.org
POS Tagging – An Example
The ball is redNOUN VERB
ADJECTIVEARTICLE
http://barcampbangalore.org
Parsing
Parsing a sentence involves the use of linguistic knowledge of a language to discover the way in which a sentence is structured
http://barcampbangalore.org
Parsing– An Example
The boy went home
NOUNVERB NOUN
ARTICLE
NP VP
The boy went home
http://barcampbangalore.org
Challenges
• We will often imply additional information in spoken language by the way we place stress on words.
• The sentence "I never said she stole my money" demonstrates the importance stress can play in a sentence, and thus the inherent difficulty a natural language processor can have in parsing it.
http://barcampbangalore.org
Depending on which word the speaker places the stress, sentences could have several distinct meanings
Here goes an example…
http://barcampbangalore.org
• "I never said she stole my money“ Someone else said it, but I didn't.
• "I never said she stole my money“ I simply didn't ever say it.
• "I never said she stole my money" I might have implied it in some way, but I never explicitly said it.
• "I never said she stole my money" I said someone took it; I didn't say it was she.
http://barcampbangalore.org
• "I never said she stole my money" I just said she probably borrowed it.
• "I never said she stole my money" I said she stole someone else's money.
• "I never said she stole my money" I said she stole something, but not my money
http://barcampbangalore.org
NLTK
Natural Language Toolkit
http://barcampbangalore.org
Design Goals
http://barcampbangalore.org
Exploring Corpora
Corpus is a large collection of text which is used to either train an NLP program or is used as input by an NLP program
In NLTK , a corpus can be loaded using the PlainTextCorpusReader Class
http://barcampbangalore.org
http://barcampbangalore.org
Loading your own corpus
>>> from nltk.corpus import PlaintextCorpusReadercorpus_root = ‘C:\text\’>>> wordlists = PlaintextCorpusReader(corpus_root, '.*‘)>>> wordlists.fileids()['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']>>> wordlists.words('connectives')['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]
http://barcampbangalore.org
NLTK Corpora
• Gutenberg corpus• Brown corpus• Wordnet• Stopwords• Shakespeare corpus• Treebank• And many more…
http://barcampbangalore.org
Computing with Language: Simple Statistics
Frequency Distributions
>>> fdist1 = FreqDist(text1)>>> fdist1 [2]<FreqDist with 260819 outcomes>>>> vocabulary1 = fdist1.keys()>>> vocabulary1[:50][',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-','his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for','this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on','so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were','now', 'which', '?', 'me', 'like']>>> fdist1['whale']906
http://barcampbangalore.org
Cumulative Frequency Plot for 50 Most Frequently Words in Moby Dick
http://barcampbangalore.org
POS tagging
http://barcampbangalore.org
WordNet Lemmatizer
http://barcampbangalore.org
Parsing
>>> from nltk.parse import ShiftReduceParser>>> sr = ShiftReduceParser(grammar)>>> sentence1 = 'the cat chased the dog'.split()>>> sentence2 = 'the cat chased the dog on the rug'.split()>>> for t in sr.nbest_parse(sentence1):... print t(S (NP (DT the) (N cat)) (VP (V chased) (NP (DT the) (N dog))))
http://barcampbangalore.org
Authorship Attribution
An Example
http://barcampbangalore.org
Find nltk @ <python-installation>\Lib\site-packages\nltk
http://barcampbangalore.org
The Road AheadPython:
• http://www.python.org• A Byte of Python, Swaroop CH
http://www.swaroopch.com/notes/python
Natural Language Processing:• Speech And Language Processing, Jurafsky and Martin• Foundations of Statistical Natural Language Processing,
Manning and Schutze
Natural Language Toolkit:• http://www.nltk.org (for NLTK Book, Documentation)
• Upcoming book by O'reilly Publishers