Natural Language Processing with Python
-
Upload
tetiana-kodliuk -
Category
Data & Analytics
-
view
238 -
download
7
Transcript of Natural Language Processing with Python
![Page 1: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/1.jpg)
Natural Language Processing
with Python
Kodliuk Tetiana
![Page 2: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/2.jpg)
![Page 3: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/3.jpg)
www.vitech.com.ua
What is NLP?Natural language processing (NLP) is the ability of a computer program to understand human speech as it is spoken and as it is written.
![Page 4: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/4.jpg)
www.vitech.com.ua
Why NLP?
NUMBERS EVERYWHERE
In the beginning was THE WORD…
![Page 5: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/5.jpg)
www.vitech.com.ua
The most terrible – Statistics…
![Page 6: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/6.jpg)
www.vitech.com.ua
What does statistic lie?
World Average• 6.1 Trillion Text Messages / year• 7 billion people• 3 messages/day/person
But:• Teenagers: 50 messages/day
![Page 7: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/7.jpg)
www.vitech.com.ua
What does statistic lie? 2050• 9B people acting like teenagers • 450 billion texts/day• 164 Trillion texts/year (6 Trillion now)
![Page 8: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/8.jpg)
www.vitech.com.ua
Why Python?
WHAT?
![Page 9: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/9.jpg)
www.vitech.com.ua
Business problems
•Sentiment analysis
•Spam/Non-spam detection
•Similar text searching•Text specialization
![Page 10: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/10.jpg)
www.vitech.com.ua
Liquid crystal suspensions of carbon nanotubesassisted by organically modified Laponite
nanoplatelets
If you are Scientist…
![Page 11: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/11.jpg)
www.vitech.com.ua
Articles Similarity
● How to find similar articles?● How to find interesting news for you?● How to say if these customers are similar?● How to detect the theme of text?
![Page 12: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/12.jpg)
www.vitech.com.ua
Word2Vec - Tomas Mikolov, 2013
Lda2Vec- Christopher Moody, 2015
Doc2Vec - Tomas Mikolov, 2014
![Page 13: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/13.jpg)
www.vitech.com.ua
Good solution: Doc2Vec✓ “Oculist and eye-doctor … occur in almost the same environments”, Z. Harris (1954)
✓ “You shall know a word by the company it keeps!”, Firth (1957)
✓ “Tell me who your friends are and I tell you who you are”, Ukrainian
![Page 14: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/14.jpg)
www.vitech.com.ua
Word2Vec
![Page 15: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/15.jpg)
www.vitech.com.ua
Word2Vec
![Page 16: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/16.jpg)
www.vitech.com.ua
Word2VecCorpus Reading
Vocabulary creating
Sub-sampling
Window moving
Feedforward Neural Network
![Page 17: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/17.jpg)
www.vitech.com.ua
Word2Vec
It
elementary
dear Watsonmyis
CBoW
![Page 18: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/18.jpg)
www.vitech.com.ua
Word2Vec
it [0.23, 0.45, …… 0.71]
is [0.13, 0.50, …… 0.12]
elementary [0.05, 0.89, …… 0.08]
my [0.65, 0.15, …… 0.41]
dear [0.98, 0.21, …… 0.11]
watson [0.42, 0.12, …… 0.81]
![Page 19: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/19.jpg)
www.vitech.com.ua
Word2Vec
Sherlock Holmes cried: “Exactly, my dear Watson!”
Holmes said: Elementary, my dear fellow! Ho! Elementary“
Then Psmith murmured: “Elementary, my dear Watson, elementary,”
Holmes
Psmith
Watson
fellow
Elementary
Exactly
cried
said
![Page 20: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/20.jpg)
www.vitech.com.ua
Doc2Vec
![Page 21: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/21.jpg)
www.vitech.com.ua
Doc2Vec
Titanic.txt [0.23, 0.45, …… 0.71]Room.txt [0.13, 0.50, …… 0.12]Sangredus.txt [ 0.05, 0.89, …… 0.08]Umbriel.txt [0.65, 0.15, …… 0.41]Dumped.txt [0.98, 0.21, …… 0.11]Nessa.txt [0.42, 0.12, …… 0.81]
titanic [0.03, 0.89, …… 0.71]apartment [0.83, 0.50, …… 0.12]room [ 0.55, 0.89, …… 0.08]parrot [0.62, 0.15, …… 0.41]nessa [0.08, 0.21, …… 0.11]word [0.42, 0.12, …… 0.81]
Vector for Document Vector for Word
![Page 22: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/22.jpg)
www.vitech.com.ua
Doc2Vec
LDA
Doc2VecWord2
Vec
![Page 23: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/23.jpg)
www.vitech.com.ua
LDA
![Page 24: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/24.jpg)
www.vitech.com.ua
LDA2Vec
![Page 25: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/25.jpg)
www.vitech.com.ua
LDA2Vec
= 0,15*programming + 0,25*football + 0,60*beer
![Page 26: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/26.jpg)
www.vitech.com.ua
Doc2VecTIME FOR PYTHON
![Page 27: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/27.jpg)
www.vitech.com.ua
Why Python?
• NLTK
• Gensim
• TextBlob
• Urllib
• Pattern
• Orange
• Sklearn
![Page 28: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/28.jpg)
www.vitech.com.ua
![Page 29: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/29.jpg)
www.vitech.com.ua
![Page 30: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/30.jpg)
www.vitech.com.ua
Data Sciense Flow
Target formulation
Wikipedia parsing
Text cleaning
Models
building
Results analysis
![Page 31: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/31.jpg)
www.vitech.com.ua
Target formulation
Articles similarity for Doc2VecTopics for LDA2Vec
![Page 32: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/32.jpg)
www.vitech.com.ua
Wikipedia parsing
![Page 33: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/33.jpg)
www.vitech.com.ua
Text cleaning
TokenizationDigits
removingStopwords removing
Punctuation cleaning
Coding Stemming
['ukraine', 'ukrainian', 'ukraina', 'country', 'eastern', 'europe', 'bordered', 'russia', 'east', 'northeast', 'belarus', 'northwest', 'poland', 'slovakia', 'west', 'hungary', 'romania', 'moldova', 'southwest', 'black', 'azov', 'south', 'southeast', 'respectively', 'ukraine', 'currently', 'territorial', 'dispute', 'russia', 'crimean', 'peninsula', 'russia', 'annexed', 'ukraine', 'international', 'community', 'recognise', 'ukrainian', 'including', 'crimea', 'ukraine', 'area', 'making', 'largest', 'country', 'entirely', 'within', 'europe', 'largest', 'country', 'world', 'population', 'million', 'making', 'populous', 'country', 'world']
![Page 34: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/34.jpg)
www.vitech.com.ua
Doc2Vec: LabeledSentence
“Doc_12” “Robot” “Food_cat” LDA vec
![Page 35: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/35.jpg)
www.vitech.com.ua
Doc2Vecmodel = Doc2Vec(size=300, window=10, min_count=10, workers=4,alpha=0.025, min_alpha=0.025)
![Page 36: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/36.jpg)
www.vitech.com.ua
Doc2Vec as Word2VecARTICLE
WORD
MALWARE
![Page 37: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/37.jpg)
www.vitech.com.ua
![Page 38: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/38.jpg)
www.vitech.com.ua
Doc2Vec as Word2Vec
![Page 39: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/39.jpg)
www.vitech.com.ua
Robot
Hobbit
Programmer
Math
![Page 40: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/40.jpg)
www.vitech.com.ua
LDA2Vec: LabeledSentence
“Doc_12” “Robot” “Food_cat” LDA vec
![Page 41: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/41.jpg)
www.vitech.com.ua
LDAlda = gensim.models.ldamodel.LdaModel(modelled_corpus, num_topics=20, update_every=100, passes=20, id2word=dictionary, alpha='auto', eval_every=5)
![Page 42: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/42.jpg)
www.vitech.com.ua
Useful links
1. http://u.cs.biu.ac.il/~yogo/cvsc2015.pdf2. https://radimrehurek.com/gensim/models/doc2vec.
html3. https://github.com/cemoody/lda2vec4. https://mebius.io/analysis/intro-to-LDA
![Page 43: Natural Language Processing with Python](https://reader034.fdocuments.net/reader034/viewer/2022052313/58f07dbd1a28ab34038b456f/html5/thumbnails/43.jpg)
www.vitech.com.ua