Post on 24-May-2020
Natural Language Processing (NLP)
Pradnya Nimkar, ACAS, MAAA
Disclaimer: This presentation is going to be…..wordy!
NLP is everywhere!
Business cases in Insurance:● Lot of unstructured data
○ Data that does not follow a predefined pattern ○ accident description, injury description, claim notes, doctor notes, nurses notes, policy terms
etc.
● Claim Triage model○ Analyze claim notes, accident descriptions, injury descriptions to identify large losses early on.
● Risk Management Practices○ Identify and label areas that need attention
● Fraud Models○ Analyze settlement notes, claim notes to identify fraud claims
● Underwriting/ Policy Management○ Avoid costly mistakes by pointing underwriters to inconsistencies in tailor made wordings
● Claims Management:○ Analyze claims/complaints and direct them to appropriate claim adjuster○ Speed up decision making process by matching claim notes with existing claims
Why should an actuary care about NLP?ASOP 38: Using Models Outside the Actuary’s Area of Expertise (Property and Casualty)
What is Natural language Processing (NLP)?● How to program computers to process and analyze large amounts of data centered around human
language.● Focus is to capture syntactic and semantic meaning of the natural language.
History of NLP:● Dates back to 1950● 1950-1980:
○ Handwritten rules, lot of if...then...statements○ Hard to maintain
● 1980- Now○ Corpus*/Statistical methods
● Now- Future○ Deep learning methods + Statistical Methods
Typical workflow with unstructured data:
Preprocessing - some NLP terminology
WORKER slipped while carrying groceries. Worker fractured his elbow
worker developed carpal tunnel from repetitive typing
worker got traumatized from NLP presentation
Corpus (Collection of texts (paragraphs, papers, books))
NLP,WORKER,carpal,carrying,developed,elbow,fractured,from,got,groceries,his,presentation,repetitive,slipped,traumatized,tunnel,typing,while,worker
Vocabulary (The unique list of words observed)
Preprocessing - Casing
WORKER slipped while carrying groceries. Worker fractured his elbow
worker slipped while carrying groceries. worker fractured his elbow
worker developed carpal tunnel from repetitive typing
worker developed carpal tunnel from repetitive typing
worker got traumatized from nlp presentation
worker got traumatized from NLP presentation
Preprocessing - Lemmatization (Reduce word to canonical form)
WORKER slipped while carrying groceries. Worker fractured his elbow
worker slip while carry grocery. worker fracture his elbow
worker developed carpal tunnel from repetitive typing
worker develop carpal tunnel from repetitive typing
worker get traumatize from nlp presentation
worker got traumatized from NLP presentation
tokens
Preprocessing - Stemming (Removing affixes to get stem word)
WORKER slipped while carrying groceries. Worker fractured his elbow
worker slip while carri groceri. worker fractur hi elbow
worker developed carpal tunnel from repetitive typing worker develop carpal tunnel from
repetit type
worker got traumat from nlp present
worker got traumatized from NLP presentation
Other preprocessing steps to be considered:
● Part of speech tagging● Remove stop words like a, an, the, in etc.● Remove special characters● Expanding contractions● Dealing with abbreviations and misspellings
Main take-away: balance between simplification and retention of language nuance, encoding as much information as possible in the most tightly organized way possible
Time to go to space...Vector Space ● Two words: Word vectors!● Core idea: Map the text to mathematical entities (vectors)● Vector space Models(VSM)are most common models in NLP
○ Translate raw texts to vectors○ There are many!
● Popular VSMs○ Sparse Representation( Don’t reduce the vector space):
■ Counts (Term-Frequency)■ Absence or presence of word (One-hot-encoding)■ TF-IDF (Term frequency Inverse document frequency)
○ Dense Representation: (Reduce the space)■ LSI /LDA (Dimensionality reduction)■ Word embeddings : e.g. Word2vec (Neural net), Glove■ Sentence, Document embeddings: e.g. Doc2vec, SkipThought
Vector Space Model I : Counts or Term-frequency
● Count number times each word occurs● Order of words does not matter● Hence, the term bag of words (BOW)
worker
deve lop
carpal
note_2[1,1,1]
carpal carry develop elbow worker
note_1 0 1 0 1 2
note_2 1 0 1 0 1
note_3 0 0 0 0 1
note_1 “worker carpal deve lop” [2,0 ,0 ]
Vector Space Model II: Binary or One hot encoding
● Zipf’s law for word distributions○ Word counts follow a long tailed distribution
● Presence or absence of a word○ 1 = if term occurs at least once○ 0 = if word does not occur
VSM III: Term Frequency-Inve rse Document Frequency (TF-IDF)
● Calculate s importance of a te rm for a particular document
tf(t) * idf(t)
Greate r when the te rm isFrequent in a particular document
Greate r when the te rm isRare in ALL the documents (corpus)
● Diffe rent we ighing schemes for idf part - most common is logarithmic
Total number of documents
Number of documents in which that te rm appears
VSM III: TF-IDF Python implementation example
tf(note_1, worker) = 2, N= 3, df(worker) = 3
carpal carry develop elbow worker
note_1 0 0.3451 0 0.3451 0.407
note_2 0.4107 0 0.4107 0 0.2425
note_3 0 0 0 0 0.2660
= 0.407
carpal carry develop elbow worker
note_1 0 1 0 1 2
note_2 1 0 1 0 1
note_3 0 0 0 0 1
Count model
VSM III: Other Considerations in TF-IDF
Min df: removes highly infrequent termsmin_df = 0.10 => ignore the terms that occur in less than 1% of the documentsmin_df = 3 => ignore the terms that occur in less than 3 documents
Max df: removes terms that occur too frequentlymax_df = 0.5 => ignore the terms that occur in more than 50% of the documentsmax_df = 5 => ignore the terms that occur in more than 5 documents
Ngrams: continuous sequence of wordstries to captures the context of the sentencebi-grams and tri-grams are common
Bi-grams example: worker developed carpal tunnel from repetitive typing => (worker developed, developed carpal, carpal tunnel, …..repetitive typing)
● Advantages:○ Simple but surprisingly effective○ Quick ○ Interpretable
● Disadvantages:○ Assumes all words are independent or equidistant, which is not the case in real world○ Very sparse representation (sparse = bad because few examples to learn from)
Cosine Similarity● Any text can be represented by V-dimensional vector space. ● Cosine similarity used for measuring the similarity between the two vectors:
○ Measures the cosine of the angle between the two vectors
○ cosine is bound by [-1,1]: 1 being similar, 0 being dissimilar and -1 being opposite
● Basic Fraud Model : Rank other claim notes with respect to cosine similarity wrt fraudulent claim
Claim note associated with Fraudulent claim
Investigate claim associated with red claim note
Curse of Dimensionality● Dimensionality increases, the volume of the space increases so fast that the available data become
sparse
● Matrix view○ Sparse Lot of zero values○ Do not provide any additional information○ Arithmetic operations take a lot of time○ Takes lot of space in the memory
● Distance calculations○ In high dimensional vector space, distances are far.○ When a measure such as a Euclidean distance is defined using many coordinates, there is little
difference in the distances between different pairs of samples.
● Answer: Reduce the dimensions (Dense representations)
Dense Representation:
● Use Matrix Factorization ○ Singular value decomposition (Latent Semantic Indexing)○ Non-Negative Matrix Factorization
● Use Probabilistic inference ○ (Bayesian inference/ Latent Dirichlet allocation)
● Use Neural network approach ○ Word2vec (Google model)○ Glove○ FastText (Facebook model)○ BlazingText ( Amazon)○ Train your own!
Topic Modeling:
Q) Find Topics that best represents the information in these documents?
● Assumptions:○ Each topic consists of collection of words○ Each document consists of mixture of topics
● Uses:○ Unsupervised learning algorithms○ But, can also be an input to other supervised algorithms○ Labels the clusters
Latent Semantic Analysis/Indexing:
● Performs matrix factorization on the document-term matrix○ Matrix factorization done using Singular value decomposition○ Document term matrix : earlier tf-idf matrix transposed!
● Singular Value Decomposition
Documents (n)
Terms(m)
Tf-idf matrix
k k
k k
mxn
term document matrix
document space
topic weights
LSI parameter
K: Number of dimensions to reduce to:
● Depends on the data size○ old standard (300)○ new standard (500-1000)
LSI Example with k = 2:
note_1 note_2 note_3
carpal 0 0.4107 0
carry 0.3451 0 0
develop 0 0.4107 0
elbow 0.3451 0 0
worker 0.407 0.2425 0.266
0.222 -0.17
0.153 0.311
0.222 -0.17
0.15 0.311
0.358 -0.23
1.12 0
0 0.96
0.497 0.607 0.618
0.865 -0.398 -0.304
original space reduced 2-dimensional spacenote_1 ([0, 0.3451, 0, 0.3451, …...]) ===> note_1([0.497, 0.865])
Word assignment to
topics
Topic importance Topic distribution across
documents
NNMF Factorization:
● Another matrix factorization method!● Decomposes document-term matrix in 2 matrices, instead of 3● Main advantage over SVD
○ Elements in both matrices are non-negative○ Input matrix has non-negative elements
● Weakness:○ Factorization is not unique
Topic Modeling I : Latent Dirichlet Allocation
● Developed in 2003
Assumptions:
● There are k latent topics according to which documents are generated. ● Distribution of words for each topic
○ Each topic is represented by set of terms○ Models the probability of topics each word belongs to.○ Same word can appear in multiple topics
● Mixture of topics within a document
Topic Modeling I : Latent Dirichlet AllocationTopic 1: '0.038*"injury" + 0.027*"neck" + 0.024*"whiplash" + 0.017*"sti" + 0.015*"strain" + 0.011*"spin" + 0.010*"cerv" + 0.010*"low" + 0.009*"whiplash injury"+ ……..'
Topic 2: '0.019*"anxy" + 0.013*"disord" + 0.012*"depress" + 0.009*"ptsd" + 0.007*"stress" + 0.007*"adjust" + 0.006*"adjust disord" + 0.006*"traum" + 0.006*"post" + 0.005*"shock+ ……."
Topic 3: 0.017*"bru" + 0.015*"rt" + 0.012"left" + 0.010*"lt" + 0.010*"injury" + 0.009*"abras" + 0.009*"lac" + 0.011**"kne" + 0.007*"cut" + 0.006*"fall+ …..."'
Topic 4: '0.014*"rt" + 0.013*"kne" + 0.009*"left" + 0.009*"fract" + 0.007*"right" + 0.006*"ankl" + 0.005*"lt" + 0.005*"dist" + 0.005*"tib" + 0.004*"foot + ….."'
Document Level:
ip suff whiplash injury and has return to work whiplash injury of the neck musculoliga strain of the back sti of the l should adjust disord w depress mood anxy aggrav pre ex deg chang up low spin whiplash injury of the neck muscul liga strain of the back soft tissu injury of left should
Topic 1 Topic 2 Topic 3 Topic 4
0.411 0.316 0.153 0.120
Drawbacks of topic models:
● Sensitive to pre-processing● Training time is relatively much longer + more memory
Statistical word counts Topic models / grouping words Word Embeddings
Dense Representation● In 2013, a team at Google led by Tomas Mikolov
created word2vec
● 1956 motivated by Harris Distributional hypothesis - intuition that similar words have similar contexts, know words by the “neighbors they keep”
● Other dense word embedding variants such as Glove, matrix factorization methods, fastText
● Similar words become close in space, can do vector operations that “make sense” (king-man=queen)
● Can capture synonyms, misspellings, etc., can apply transfer learning
● Drawbacks of one dominating meaning, context dependent, relatively data hungry
Dense Representation Cont.
fall
burn
burnt
fe ll
ve rb-tense
Modeling Architecture
worker slipped on water
onworke r wate r
slipped
CBOW Skip-gram
worker on water
slipped slippedslipped
FastText - Dealing with words not in Vocabulary
Dense Representation Cont.● Might wonder, ok we now have vectors for each word… how’s that work for
sentences? Paragraphs?● Many answers (its own topic in research and practice):● Average vectors, concatenate, sentence embeddings, document embeddings
Recap / Summary
● We saw different ways to turn words into numbers: counts, groups, embeddings
● Simplicity + Speed vs Complexity + Cost● Implicit: Data Dependent (Conservation of Garbage)
Additional resources:● Code for generating some of the demonstrated VSM:
https://github.com/pradnya-nimkar/CABA-presentation/blob/master/CABA%20NLP%20presentation-May31.ipynb
● LDA Blei paper
https://ai.stanford.edu/~ang/papers/nips01-lda.pdf
● Word2vec paper link
https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
● Stanford NLP
https://nlp.stanford.edu/
Thank you!