Language Sleuthing HOWTO with NLTK
-
Upload
brianna-laugher -
Category
Technology
-
view
1.747 -
download
0
description
Transcript of Language Sleuthing HOWTO with NLTK
![Page 1: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/1.jpg)
Language Sleuthing HOWTOor
Discovering Interesting Thingswith Python's
Natural Language Tool Kit
Brianna Laugher
modernthings.org brianna[@.]laugher.id.au
![Page 2: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/2.jpg)
why?
Corpus linguistics on web texts
![Page 3: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/3.jpg)
Because the web is full of language data
Because linguistic techniques can reveal unexpected insights
Because I don't want to have to read everything
![Page 4: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/4.jpg)
Like... mailing lists
![Page 5: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/5.jpg)
luv-main as a corpus
√ Big collection of text x Messy data x Not annotated
![Page 6: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/6.jpg)
what's interesting?
conversations
topics
change over time
(authors)
![Page 7: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/7.jpg)
Step 1:
get the data
![Page 8: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/8.jpg)
wget vs Python script
√ wget is purpose-built
√ convenient options like --convert-links
![Page 9: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/9.jpg)
Meaningful URLs FTW
Sympa/MhonArc:
lists.luv.asn.au/wws/arc/luv-main/2009-04/
msg00057.html
![Page 10: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/10.jpg)
![Page 11: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/11.jpg)
Step 2:
clean the data
![Page 12: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/12.jpg)
Cleaning for what?
Remove archive boilerplate
Remove HTML
Remove quoted text?
Remove signatures?
![Page 13: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/13.jpg)
J.W.
J.W.
W.E.
![Page 14: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/14.jpg)
Behind the scenes
J.W.
W.E.
![Page 15: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/15.jpg)
what are we aiming for?
what do NLTK corpora look like?
![Page 16: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/16.jpg)
Getting NLTK
sudo apt-get install python-nltk
in Ubuntu 10.04
or
sudo apt-get install python-pip
pip install nltk
or from source at nltk.org/download
![Page 17: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/17.jpg)
Getting NLTK data...
an “NLTKism”
![Page 18: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/18.jpg)
![Page 19: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/19.jpg)
NLTK corpora types
![Page 20: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/20.jpg)
Brown corpus
A CategorizedTagged corpus:
Dear/jj Sirs/nns :/: Let/vb me/ppo begin/vb by/in clearing/vbg up/in any/dti possible/jj misconception/nn in/in your/pp$ minds/nns ,/, wherever/wrb you/ppss are/ber ./.The/at collective/nn by/in which/wdt I/ppss address/vb you/ppo in/in the/at title/nn above/rb is/bez neither/cc patronizing/vbg nor/cc jocose/jj but/cc an/at exact/jj industrial/jj term/nn in/in use/nn among/in professional/jj thieves/nns ./.
![Page 21: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/21.jpg)
Inaugural corpusA Plaintext corpus:
My fellow citizens:
I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition.
Forty-four Americans have now taken the presidential oath. ...............
![Page 22: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/22.jpg)
But we still have lots of HTML...
![Page 23: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/23.jpg)
![Page 24: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/24.jpg)
BeautifulSoup to the rescue
>>> from BeautifulSoup import BeautifulSoup as BS>>> data = open(filename,'r').read()>>> soup = BS(data)>>> print '\n'.join(soup.findAll(text=True))
![Page 25: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/25.jpg)
![Page 26: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/26.jpg)
notice the blockquote!
![Page 27: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/27.jpg)
>>> bqs = s.findAll('blockquote')>>> [bq.extract() for bq in bqs]>>> print '\n'.join(s.findAll(text=True))
On 05/08/2007, at 12:05 PM, [...] wrote:If u want it USB bootable, just burn the DSL boot disk to CD and fire it up.  Then from the desktop after boot, right click and create the bootable USB key yourself.  I havent actually done this myself (only seen the option from the menu), but I am assuming it will be a fairly painless process if you are happy with the stock image.  Would be interested in how you go as I have to build 50 USB bootable DSL's in the next couple weeks.Regards,[...]
What about blockquotes?
![Page 28: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/28.jpg)
Step 3:
analyse the data
![Page 29: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/29.jpg)
Getting it into NLTK
import nltkpath = 'path/to/files'corpus = nltk.corpus.PlaintextCorpusReader(path, '.*\.html')
![Page 30: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/30.jpg)
What about our metadata?
Create a Python dictionary that maps filenames to categoriese.g.categories={}categories['2008-12/msg00226.html'] =
['year-2008', 'month-12', 'author-BM<bm@xxxxx>']
....etc
then...import nltkpath = 'path/to/files/'corpus = nltk.corpus.CategorizedPlaintextCorpusReader(path, '.*\.html', cat_map=categories)
![Page 31: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/31.jpg)
Simple categories
cats = corpus.categories()authorcats=[c for c in cats if c.startswith('author')]#>>> len(authorcats)#608yearcats=[c for c in cats if c.startswith('year')]monthcats=[c for c in cats if c.startswith('month')]
![Page 32: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/32.jpg)
...who are the top posters?posts = [(len(corpus.fileids(author)), author) for author in authorcats]posts.sort(reverse=True)
for count, author in posts[:10]:print "%5d\t%s" % (count, author)
→ 1304 author-JW 1294 author-RC 1243 author-CS 1030 author-JH 868 author-DP 752 author-TWB 608 author-CS#2 556 author-TL 452 author-BM 412 author-RM(email me if you're curious to know if you're on it...)
![Page 33: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/33.jpg)
Frequency distributions
popular =['ubuntu','debian','fedora','arch']niche = ['gentoo','suse','centos','redhat']
def getcfd(distros,limit):cfd = nltk.ConditionalFreqDist(
(distro, fileid[:limit])for fileid in corpus.fileids()for w in corpus.words(fileid)for distro in distrosif w.lower().startswith(distro))
return cfd
popularcfd = getcfd(popular,4) # or 7 for monthspopularcfd.plot()nichecfd = getcfd(niche,4)nichecfd.plot()
another “NLTKism”
![Page 34: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/34.jpg)
'Popular' distros by month
![Page 35: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/35.jpg)
'Popular' distros by year
![Page 36: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/36.jpg)
'Niche' distros by year
![Page 37: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/37.jpg)
Random text generation
import randomwords = [w.lower() for w in corpus.words()]bigrams = nltk.bigrams(words)cfd = nltk.ConditionalFreqDist(bigrams)
def generate_model(cfdist, word, num=15): for i in range(num): print word,
words = list(cfdist[word])word = random.choice(words)
text = [w.lower() for w in corpus.words()]bigrams = nltk.bigrams(text)cfd = nltk.ConditionalFreqDist(bigrams)generate_model(cfd, 'hi', num=20)
![Page 38: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/38.jpg)
hi...hi allan : ages since apparently yum erased . attempts now venturing into config run ip 10 431 ms 57
hi serg it illegal address entries must *, t close relative info many families continue fi into modem and reinstalled
hi wen and amended :) imageshack does for grade service please blame . warning issued an overall environment consists in
hi folks i accidentally due cause excitingly stupid idiots , deletion flag on adding option ? branded ) mounting them
hi guys do composite required </ emulator in for unattended has info to catalyse a dbus will see atz init3
![Page 39: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/39.jpg)
hi from Peter...text = [w.lower() for w in corpus.words(categories=
[c for c in authorcats if 'PeterL' in c])]
hi everyone , hence the database schema and that run on memberdb on mail store is 12 . yep ,
hi anita , your favourite piece of cpu cycles , he was thinking i hear the middle of failure .
hi anita , same vhost b internal ip / nine seem odd occasion i hazard . 25ghz g4 ibook here
hi everyone , same ) on removes a "-- nicelevel nn " as intended . 00 . main host basis
hi cameron , no biggie . candidates in to upgrade . ubuntu dom0 install if there ! now ). txt
hi cameron , attribution for 30 seconds , and runs out on linux to on www . luv , these
![Page 40: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/40.jpg)
interesting collocations...or not
ext = [w.lower() for w in corpus.words() if w.isalpha()]from nltk.collocations import *bigram_measures = nltk.collocations.BigramAssocMeasures()finder = BigramCollocationFinder.from_words(text)
finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 10) → bufnewfile bufreadbusmaster speccyclecellx cellycheswick bellovincread clocalcurtail atldmcrs rscemdmmrbc dmostdmost dmcrs...
![Page 41: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/41.jpg)
oblig tag cloud
stopwords = nltk.corpus.stopwords.words('english')words = [w.lower() for w in corpus.words()
if w.isalpha()]words = [w for w in words if w not in stopwords]word_fd = nltk.FreqDist(words)wordmax = word_fd[word_fd.max()]wordmin = 1000 #YMMVtaglist = word_fd.items()ranges = getRanges(wordmin, wordmax)writeCloud(taglist, ranges, 'tags.html')
![Page 42: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/42.jpg)
![Page 43: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/43.jpg)
another one for Peter :)cats = [c for c in corpus.categories()
if 'PeterL' in c]words=[w.lower() for w in corpus.words(categories=cats)
if w.isalpha()]wordmin = 10
→
![Page 44: Language Sleuthing HOWTO with NLTK](https://reader033.fdocuments.net/reader033/viewer/2022052618/554940e2b4c905194d8b521e/html5/thumbnails/44.jpg)
thanks!
for more corpus fun:http://www.nltk.org/
The Book:'Natural Language Processing
with Python', 2nd ed. pub. Jan 2010
These slides are © Brianna Laugher and are released under the Creative Commons Attribution ShareAlike license,
v3.0 unported. The data set is not free, sadly...