Word2vec: From intuition to practice using gensim

WORD2VECFROM INTUITION TO PRACTICE USING GENSIM

Edgar Marcamatiskay@gmail.com

Python Peru MeetupSeptember 1st, 2016Lima - Perú

About Edgar Marca

# Software Engineer at Love Mondays.# One of the organizers of Data Science Lima Meetup.# Machine Learning and Data Science enthusiasm.# Eu falo um pouco de Português.

DATA SCIENCE LIMA MEETUP

Data Science Lima Meetup

# 5 Meetups y el 6to a la vuelta de la esquina# 410 Datanautas en el Grupo de Meetup.# 329 Personas en el Grupo de Facebook.

Organizadores

# Manuel Solorzano.# Dennis Barreda.# Freddy Cahuas.# Edgar Marca

Data Science Lima Meetup

Figure: Foto del quinto Data Science Lima Meetup.4

Data Never Sleeps

Figure: How much data is generated every minute? 1

1Data Never Sleeps 3.0https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/

NATURAL LANGUAGE PROCESSING

Introduction

# Text is the core business of internet companies today.# Machine Learning and natural language processing

techniques are applied to big datasets to improve search,ranking and many other tasks (spam detection, adsrecomendations, email categorization, machine translation,speech recognition, etc)

Natural Language Processing

Problems with text

# Messy.# Irregularities of the language.# Hierarchically.# Sparse Nature.

REPRESENTATIONS FOR TEXTS

Contextual Representation

How to Learn good representations?

One-hot Representation

One-hot encoding

Represent every word as an R|V | vector with all 0s and 1 at theindex of that word.

EXAMPLE

Example:

Let V = {the, hotel, nice,motel}

wthe =

,whotel =

,wnice =

,wmotel =

We represent each word as a completely independent entity.This word representation does not give us directly any notion ofsimilarity.

For instance

⟨whotel ,wmotel⟩R4 = 0 (1)

⟨whotel ,wcat⟩R4 = 0 (2)

we can try to reduce the size of this space from R4 to somethingsmaller and find a subspace that encodes the relationshipsbetween words.

Problems

# The dimension depends on the vocabulary size.# Leads to data sparsity. So we need more data.# Provide not useful information to the system.# Encondings are arbitrary.

Bag-of-words representation

# Sum of one-hot codes.# Ignores orders or words.

Examples:

# vocabulary = (monday, tuesday, is, a, today)# Monday Monday = [2, 0, 0, 0, 0]# today is monday = [1 0 1 1 1]# today is tuesday = [0 1 1 1 1]# is a monday today = [1 0 1 1 1]

Distributional hypotesis

You shall know a word by the company it keeps!Firth (1957)

Language Modeling (Unigrams, Bigrams, etc)

A language model is a probabilistic model that assignsprobability to any sequence of n words P(w1 ,w2 , . . . , wn)

Unigrams

Assuming that the word ocurrences are completely independent

P(w1 ,w2 , . . . , wn) = Πni=1P(wi) (3)

Language Modeling (Unigrams, Bigrams, etc)

Bigrams

The probability of the sequence depend on the pairwise prob-ability of a word in the sequence and the word next to it.

P(w1 ,w2 , . . . , wn) = Πni=2P(wi | wi−1) (4)

Word Embeddings

A set of language modeling and feature learning techniques inNLP where words or phrases from the vocabulary are mappedto vectors of real numbers in a low-dimensional space relativeto the vocabulary size (”continuous space”).

# Vector space models (VSMs) represent (embed) words in acontinous vector space.

# Semantically similar words are mapped to nearby points.# Basic idea is Distributional Hypothesis: words that appear

in the same context share semantic meaning.

WORD2VEC

Distributional hypotesis

You shall know a word by the company it keeps!Firth (1957)

Word2Vec

Figure: Two original papers published in association with word2vecby Mikolov et al. (2013)

# Efficient Estimation of Word Representations in VectorSpace https://arxiv.org/abs/1301.3781.

# Distributed Representations of Words and Phrases andtheir Compositionality https://arxiv.org/abs/1310.4546. 24

Continuous Bag of Words and Skip-gram

Word is represented by context in use

Word Vectors

Word2Vec

# vking − vman + vwoman ≈ vqueen

# vparis − vfrance + vitaly ≈ vrome

# Learns from raw text# Huge splash in NLP world.# Comes pretrained. (If you don’t have any specialize

vocabulary)# Word2vec is computationally efficient model for learning

word embeddings.# Word2Vec is a successful example of ”shallow” learning.# Very simple Feedforward neural network with single hidden

layer, backpropagation, and no non-linearities.32

Word2vec

Gensim

APPLICATIONS

What the Fuck Are Trump Supporters Thinking?

# They gathered four million tweets belonging to more thantwo thousand hard-core Trump supporters.

# Distances between those vectors encoded the semanticdistance between their associated words (e.g. the vectorrepresentation of the word morons was near idiots but faraway from funny)

Link: https://medium.com/adventurous-social-science/what-the-fuck-are-trump-supporters-thinking-ecc16fb66a8d

Restaurant Recomendation.

http://www.slideshare.net/SudeepDasPhD/recsys-2015-making-meaningful-restaurant-recommendations-at-opentable

Restaurant Recomendation.

http://www.slideshare.net/SudeepDasPhD/recsys-2015-making-meaningful-restaurant-recommendations-at-opentable

Song Recomendations

Link: https://social.shorthand.com/mawsonguy/3CfQA8mj2S/playlist-harvesting 41

TAKEAWAYS

Takeaways

# If you don’t have enough data you can use pre-trainedmodels.

# Remember: Garbage in, garbage out.# Every data set will come out with diferent results.# Use Word2vec as feature extractor.

Obrigado

Word2vec: From intuition to practice using gensim

Software

Transcript of Word2vec: From intuition to practice using gensim

Word2vec (中文)

EKSTRAKSI FITUR MENGGUNAKAN MODEL WORD2VEC …

19 Research on intuition using intuition

Intuition - House Of Montclairs · 2020. 3. 22. · Intuition

Compare word2vec with hash2vec for Word Sense ... · Compare word2vec with hash2vec for Word Sense Disambiguation on Wikipedia corpus Output : figure. 3. Word2vec implementation output

An Introduction to gensim: "Topic Modelling for Humans"

Deep learning · Machine Learning Meetup Michal Illich. Michal Illich. Obsah ... Intermezzo 1 Semantic hashing ... gbm, libsvm, vowpal wabbit, sofia-ml, sofia-kmeans word2vec, gensim,

UNTANGLING THE INTUITION MESS: INTUITION AS A … as resubmitted.pdf · UNTANGLING THE INTUITION MESS: INTUITION AS A CONSTRUCT IN ENTREPRENEURSHIP RESEARCH ... and unexplained variance

Applications in NLP - IDA · using available tools. word2vec, Gensim, GloVe • Pre-trained word vectors for English, Swedish, and various other ... in practice, use tf-idf or PPMI

Efficient Parallel Learning of Word2Vec

Intuition: A Social Cognitive Neuroscience Approach€¦ · cognitive neuroscience approach. Logic of Intuition as Implicit Learning Intuition What is intuition? Most accounts have

word2vec Parameter Learning Explained

USING Intuition - Laura Silva Quesadalaurasilvaquesada.com/wp-content/uploads/2017/03/Intuition-in... · USING Intuition IN BUSINESS [2] Using INTUITION IN Business INTUITION AND

Chatelet Intuition Geometrique Intuition Physique

word2vec Hendrik Heuer From theory to practice - Meetupfiles.meetup.com/16315032/StockholmNLPMeetup-WordEmbedding-20150324... · word2vec • by Mikolov, Sutskever, Chen, Corrado

word2vec - jlu.myweb.cs.uwindsor.ca

Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Word2Vec vs DBnary ou comment (ré)concilier ...Plan Introduction (Word2Vec vs DBnary) Approche (Comment réconcilier représentations distribuées et réseaux lexico-sémantiques)

word2vec Hendrik Heuer From theory to practicehen-drik.de/pub/Heuer - word2vec - From theory to...word2vec • by Mikolov, Sutskever, Chen, Corrado and Dean at Google • NAACL 2013

Word2vec and Friends