Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance...

52
Text Mining for Economics and Finance Latent Dirichlet Allocation Stephen Hansen Text Mining Lecture 5 1 / 45

Transcript of Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance...

Page 1: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Text Mining for Economics and FinanceLatent Dirichlet Allocation

Stephen Hansen

Text Mining Lecture 5 1 / 45

Page 2: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Introduction

Recall we are interested in mixed-membership modeling, but that the pLSImodel has a huge number of parameters to estimate.

One solution is to adopt a Bayesian approach; the pLSI model with a priordistribution on the document-specific mixing probabilities is called LatentDirichlet Allocation (Blei, Ng, and Jordan 2003).

LDA is widely used within computer science and, increasingly, social sciences.

LDA forms the basis of many, more complicated mixed-membership models.

Text Mining Lecture 5 2 / 45

Page 3: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Latent Dirichlet Allocation—Original

1. Draw θd independently for d = 1, . . . ,D from Dirichlet(α).

2. Each word wd,n in document d is generated from a two-step process:

2.1 Draw topic assignment zd,n from θd .

2.2 Draw wd,n from βzd,n .

Estimate hyperparameters α and term probabilities β1, . . . ,βK .

Text Mining Lecture 5 3 / 45

Page 4: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Latent Dirichlet Allocation—Modified

1. Draw βk independently for k = 1, . . . ,K from Dirichlet(η).

2. Draw θd independently for d = 1, . . . ,D from Dirichlet(α).

3. Each word wd,n in document d is generated from a two-step process:

3.1 Draw topic assignment zd,n from θd .

3.2 Draw wd,n from βzd,n .

Fix scalar values for η and α.

Text Mining Lecture 5 4 / 45

Page 5: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Example statement: Yellen, March 2006,#51

We have noticed a change in the relationship between the core CPI and thechained core CPI, which suggested to us that maybe something is going onrelating to substitution bias at the upper level of the index. You focused on thenonmarket component of the PCE, and I wondered if something unusual mightbe happening with the core CPI relative to other measures.

Text Mining Lecture 5 5 / 45

Page 6: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Example statement: Yellen, March 2006,#51

We have noticed a change in the relationship between the core CPI and thechained core CPI, which suggested to us that maybe something is going onrelating to substitution bias at the upper level of the index. You focused on thenonmarket component of the PCE, and I wondered if something unusual mightbe happening with the core CPI relative to other measures.

Text Mining Lecture 5 5 / 45

Page 7: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Example statement: Yellen, March 2006,#51

We have noticed a change in the relationship between the core CPI and thechained core CPI, which suggested to us that maybe something is going onrelating to substitution bias at the upper level of the index. You focused on thenonmarket component of the PCE, and I wondered if something unusual mightbe happening with the core CPI relative to other measures.

Text Mining Lecture 5 5 / 45

Page 8: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Example statement: Yellen, March 2006,#51

We have 17ticed a 39ange in the 39lationship 1etween the 25re 25I and the41ained 25re 25I, which 25ggested to us that 36ybe 36mething is 38ing on43lating to 25bstitution 20as at the 25per 39vel of the 16dex. You 23cused onthe 25nmarket 25mponent of the 25E, and I 32ndered if 38mething 16usualmight be 4appening with the 25re 25I 16lative to other 25asures.

Text Mining Lecture 5 5 / 45

Page 9: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Distribution of Attention

Text Mining Lecture 5 6 / 45

Page 10: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Topic 25

Text Mining Lecture 5 7 / 45

Page 11: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Advantage of Flexibility

‘measur’ has probability 0.026 in topic 25, and probability 0.021 in topic 11

Text Mining Lecture 5 8 / 45

Page 12: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Topic 11

Text Mining Lecture 5 8 / 45

Page 13: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Advantage of Flexibility

‘measur’ has probability 0.026 in topic 25, and probability 0.021 in topic 11.

It gets assigned to 25 in this statement consistently due to the presence of othertopic 25 words.

In statements containing words on evidence and numbers, it consistently getsassigned to 11.

Sampling algorithm can help place words in their appropriate context.

Text Mining Lecture 5 8 / 45

Page 14: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

External Validation—BBD

50

100

150

200

BB

D

0.0

2.0

4.0

6.0

8.1

Fra

ction o

f F

OM

C 1

& 2

1987m7 1991m7 1995m7 1999m7 2003m7 2007m7

FOMC Meeting Date

BBD Topic 35

50

100

150

200

BB

D

0.0

2.0

4.0

6.0

8.1

Fra

ction o

f F

OM

C 1

& 2

1987m7 1991m7 1995m7 1999m7 2003m7 2007m7

FOMC Meeting Date

BBD Topic 12

Text Mining Lecture 5 9 / 45

Page 15: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Pro-Cyclical Topics0

.05

.1.1

5.2

Fra

ction o

f speaker’s tim

e in F

OM

C2

1987m7 1991m1 1994m7 1998m1 2001m7 2005m1 2008m7

FOMC Meeting Date

Min Median Max

0.0

5.1

.15

.2

Fra

ction o

f speaker’s tim

e in F

OM

C2

1987m7 1991m1 1994m7 1998m1 2001m7 2005m1 2008m7

FOMC Meeting Date

Min Median Max

Text Mining Lecture 5 10 / 45

Page 16: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Counter-Cyclical Topics0

.05

.1.1

5.2

Fra

ction o

f speaker’s tim

e in F

OM

C2

1987m7 1991m1 1994m7 1998m1 2001m7 2005m1 2008m7

FOMC Meeting Date

Min Median Max

0.0

5.1

.15

.2

Fra

ction o

f speaker’s tim

e in F

OM

C2

1987m7 1991m1 1994m7 1998m1 2001m7 2005m1 2008m7

FOMC Meeting Date

Min Median Max

Text Mining Lecture 5 11 / 45

Page 17: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Graphical Models

Consider a probabilistic model with joint distribution f (x) over the randomvariables x = (x1, . . . , xN).

In high-dimensional models, it is useful to summarize relationships amongrandom variables with directed graphs in which nodes represent randomvariables and links between nodes represent dependencies.

A node’s parents are the set of nodes that link to it; a node’s children are thethe set of nodes that it links to.

A Bayesian network is a probabilistic model whose joint distribution can berepresented by a directed acyclic graph (DAG).

The nodes in a DAG can be ordered so that parents precede children.

Text Mining Lecture 5 12 / 45

Page 18: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

LDA as a Bayesian Networkα

θ1

z1,1 z1,N1

w1,1 w1,N1

θD

zD,1 zD,ND

wD,1 wD,ND

β1, . . . ,βK

η

Text Mining Lecture 5 13 / 45

Page 19: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

LDA Plate Diagram

DNd

α

θd

zd ,n

wd ,n

βk

K

η

Text Mining Lecture 5 14 / 45

Page 20: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Graph Properties

Let B = (β1, . . . ,βK ) and T = (θ1, . . . ,θD).

Object Parents Childrenα ∅ Tθd α zdzd,n θd wd,n

wd,n zd,n,B ∅βk η w1, . . . ,wD

η ∅ B

Text Mining Lecture 5 15 / 45

Page 21: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Conditional Independence Property I

In a Bayesian network, nodes are independent of their ancestors conditional ontheir parents.

This means we can write f (x) =N∏i=1

f (xi | parents(xi )), which can greatly

simplify joint distributions.

Applying this formula to LDA yields(∏d

∏n

Pr [wd,n | zd,n,B ]

)(∏d

∏n

Pr [ zd,n | θd ]

)×(∏

d

Pr [θd | α ]

)(∏k

Pr [βk | η ]

)

Text Mining Lecture 5 16 / 45

Page 22: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Conditional Independence Property II

The Markov blanket MB(xi ) of a node xi in a Bayesian network is the set ofnodes consisting of xi ’s parents, children, and children’s parents.

Conditional on its Markov blanket, the node xi is independent of all nodesoutside its Markov blanket.

So f (xi | x−i ) has the same distribution as f (xi | MB(xi )).

Text Mining Lecture 5 17 / 45

Page 23: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Posterior Distribution

The inference problem in LDA is to compute the posterior distribution over z,T, and B given the data w and Dirichlet hyperparameters.

Let’s consider the simpler problem of inferring the latent variables taking theparameters as given. Posterior distribution is

Pr [ z = z ′ | w,T,B ] =Pr [w | z = z ′,T,B ]Pr [ z = z ′ | T,B ]∑z′ Pr [w | z = z ′,T,B ]Pr [ z = z ′ | T,B ]

.

We can compute the numerator easily, and each element of denominator.

But z ′ ∈ {1, . . . ,K}N ⇒ there are KN terms in the sum ⇒ intractable problem.

For example, a 100 word corpus with 50 topics has ≈ 7.88e169 terms.

Text Mining Lecture 5 18 / 45

Page 24: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Posterior Distribution

The inference problem in LDA is to compute the posterior distribution over z,T, and B given the data w and Dirichlet hyperparameters.

Let’s consider the simpler problem of inferring the latent variables taking theparameters as given. Posterior distribution is

Pr [ z = z ′ | w,T,B ] =Pr [w | z = z ′,T,B ]Pr [ z = z ′ | T,B ]∑z′ Pr [w | z = z ′,T,B ]Pr [ z = z ′ | T,B ]

.

We can compute the numerator easily, and each element of denominator.

But z ′ ∈ {1, . . . ,K}N ⇒ there are KN terms in the sum ⇒ intractable problem.

For example, a 100 word corpus with 50 topics has ≈ 7.88e169 terms.

Text Mining Lecture 5 18 / 45

Page 25: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Approximate Inference

Instead of obtaining a closed-form solution for the posterior distribution, wemust approximate it.

Markov chain Monte Carlo methods provide a stochastic approximation to thetrue posterior.

The general idea is to define a Markov chain whose stationary distribution isequivalent to the posterior distribution, which we then draw samples from.

There are several MCMC methods, but we will consider Gibbs sampling.

Text Mining Lecture 5 19 / 45

Page 26: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Gibbs Sampling

We want to draw samples from some joint distribution over x = (x1, . . . , xN)given by f (x) (e.g. a posterior distribution).

Suppose we can compute the conditional distribution fi ≡ f (xi | x−i ).

Then we can use the following algorithm:

1. Randomly allocate an initial value for x, say x0

2. Let S be the number of iterations to run chain. For each s ∈ {1, . . . ,S},draw x si according to

x si ∼ f (xi | x s1 , . . . , x si−1, xs−1i+1 , . . . , x

s−1N ).

3. Discard initial iterations (burn in), and collect samples from every mth(thinning interval) iteration thereafter.

4. Use collected samples to approximate joint distribution, or relateddistributions and moments.

Text Mining Lecture 5 20 / 45

Page 27: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Sampling Equations for θd

The Markov blanket of θd is:

The parent α.

The children zd .

So we need to draw samples from Pr [θd | α, zd ]. This is the posteriordistribution for θd given a fixed value for the vector of allocation variables zd .

We computed this posterior above. Let nd,k be the number of words indocument d that have topic allocation k .

Then Pr [θd | α, zd ] = Dir (α+ nd,1, . . . , α+ nd,K ).

Text Mining Lecture 5 21 / 45

Page 28: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Sampling Equations for βk

The Markov blanket of βk is:

The parent η.

The children w = (w1, . . . ,wD).

The children’s parents:1. z = (z1, . . . , zD).2. B−k .

Consider Pr [βk | η,w, z ]. Only the allocation variables assigned to k—andtheir associated words—are informative about βk .

Let mk,v be the number of times topic k allocation variables generate term v .

Then Pr [βk | η,w, z ] = Dir (η +mk,1, . . . , η +mk,V ).

Text Mining Lecture 5 22 / 45

Page 29: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Sampling Equations for Allocations

The Markov blanket of zd,n is:

The parent θd .

The child wd,n.

The child’s parents β1, . . . ,βK .

Pr [ zd,n = k | wd,n = v ,B,θd ] =

Pr [wd,n = v | zd,n = k ,B,θd ]Pr [ zd,n = k | B,θd ]∑k Pr [wd,n = v | zd,n = k ,B,θd ]Pr [ zd,n = k | B,θd ]

=θkdβ

vk∑

k θkdβ

vk

.

Text Mining Lecture 5 23 / 45

Page 30: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Summary

To complete one iteration of Gibbs sampling, we need to:

1. Sample from a multinomial distribution N times for the topic allocationvariables.

2. Sample from a Dirichlet D times for the document-specific mixingprobabilities.

3. Sample from a Dirichlet K times for the topic-specific term probabilities.

Sampling from these distributions is standard, and implemented in manyprogramming languages.

Text Mining Lecture 5 24 / 45

Page 31: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Collapsed Sampling

Collapsed sampling refers to analytically integrating out some variables in thejoint likelihood and sampling the remainder.

This tends to be more efficient because we reduce the dimensionality of thespace we sample from.

Griffiths and Steyvers (2004) proposed a collapsed sampler for LDA thatintegrates out the T and B terms and samples only z.

For details see Heinrich (2009) and technical appendix of Hansen, McMahon,and Prat (2015).

Text Mining Lecture 5 25 / 45

Page 32: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Collapsed Sampling Equation for LDA

The sampling equation for the nth allocation variable in document d is:

Pr[zd,n = k

∣∣ z−(d,n),w, α, η ] ∝ m−k,vd,n + η∑v m−k,v + ηV

(n−d,k + α

)where the − superscript denotes counts excluding (d , n) term.

Probability term n in document d is assigned to topic k is increasing in:1. The number of other terms in document d that are currently assigned to k .2. The number of other occurrences of the term vd,n in the entire corpus that

are currently assigned to k.

Both mean that terms that regularly co-occur in documents will be groupedtogether to form topics.

Property 1 means that terms within a document will tend to be groupedtogether into few topics rather than spread across many separate topics.

Text Mining Lecture 5 26 / 45

Page 33: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Collapsed Sampling Equation for LDA

The sampling equation for the nth allocation variable in document d is:

Pr[zd,n = k

∣∣ z−(d,n),w, α, η ] ∝ m−k,vd,n + η∑v m−k,v + ηV

(n−d,k + α

)where the − superscript denotes counts excluding (d , n) term.

Probability term n in document d is assigned to topic k is increasing in:1. The number of other terms in document d that are currently assigned to k .2. The number of other occurrences of the term vd,n in the entire corpus that

are currently assigned to k .

Both mean that terms that regularly co-occur in documents will be groupedtogether to form topics.

Property 1 means that terms within a document will tend to be groupedtogether into few topics rather than spread across many separate topics.

Text Mining Lecture 5 26 / 45

Page 34: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Predictive Distributions

Collapsed sampling gives the distribution of the allocation variables, but we alsocare about variables we integrated out.

Their predictive distributions are easy to form given topic assignments:

β̂k,v =mk,v + η∑V

v=1 (mk,v + η)and θ̂d,k =

nd,k + α∑Kk=1 (nd,k + α)

.

Text Mining Lecture 5 27 / 45

Page 35: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

LDA on Survey Data

Recall from the first lecture that text data is one instance of count data.

Although typically applied to natural language, LDA is in principle applicable toany count data.

We recently applied it to CEO survey data to estimate management “behaviors”with K = 2.

Can visualize results in terms of likelihood ratios of marginals over specific datafeatures.

Text Mining Lecture 5 28 / 45

Page 36: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Activity Type

Text Mining Lecture 5 29 / 45

Page 37: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Planning; Size; Number of Functions

Text Mining Lecture 5 30 / 45

Page 38: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Insiders vs Outsiders

Text Mining Lecture 5 31 / 45

Page 39: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Functions

Text Mining Lecture 5 32 / 45

Page 40: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Model Selection

There are three parameters to set to run the Gibbs sampling algorithm: numberof topics K and hyperparameters α, η.

Priors don’t receive too much attention in literature. Griffiths and Steyversrecommend η = 200/V and α = 50/K . Smaller values will tend to generatemore concentrated distributions. (See also Wallach et. al. 2009).

K is less clear. Two potential goals:

1. Predict text well. Statistical criteria to select K .

2. Interpretability. General versus specific.

Text Mining Lecture 5 33 / 45

Page 41: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Formalizing Interpretablility

Chang et. al. (2009) propose an objective way of determining whether topicsare indeed interpretable.

Two tests:1. Word intrusion. Form set of words consisting of top five words from topic k

+ word with low probability in topic k . Ask subjects to identify insertedword.

2. Topic intrusion. Show subjects a snippet of a document + top three topicsassociated to it + randomly drawn other topic. Ask to identify insertedtopic.

Estimate LDA and other topic models on NYT and Wikipedia articles forK = 50, 100, 150.

Text Mining Lecture 5 34 / 45

Page 42: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Results

Text Mining Lecture 5 35 / 45

Page 43: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Implications

Topics seem objectively interpretable in many contexts.

Tradeoff between goodness-of-fit and interpretablility, which is generally moreimportant in social science.

Potential development of statistical models in future to explicitly maximizeinterpretablility.

Text Mining Lecture 5 36 / 45

Page 44: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Chain Convergence and Selection

Determining when a chain has converged can be tricky.

One approach is to measure how well different states of the chain predict thedata, and determine convergence in terms of its stability.

Standard practice is run chains from different starting values, after which youcan select the best-fit chain for analysis.

For LDA, a typical goodness-of-fit measure is perplexity, given by

exp

−∑Dd=1

∑Vv=1 xd,v log

(∑Kk=1 θ̂d,k β̂k,v

)∑D

d=1 Nd

.

Text Mining Lecture 5 37 / 45

Page 45: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Perplexity 1

Text Mining Lecture 5 38 / 45

Page 46: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Perplexity 2

Text Mining Lecture 5 39 / 45

Page 47: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Out-of-Sample Documents

We are sometimes interested in obtaining the document-topic distribution forout-of-sample documents.

We can perform Gibbs sampling treating estimated topics as fixed

Pr[zd,n = k

∣∣ z−(d,n),wd , α, η]∝ β̂k,vd,n

[n−d,k + α

]for each out-of-sample document d .

Only 10-20 iterations necessary since topics already estimated.

Text Mining Lecture 5 40 / 45

Page 48: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Dictionary Methods + LDA

The terms in dictionaries come labeled, so can be seen as a type of supervisedapproach to information retrieval.

One can combine dictionary methods with the output of LDA to weight wordscounts by topic.

Recent application to minutes of the Federal Reserve to extract index ofeconomic situation and forward guidance.

First step is to run 15-topic model and identify two separate kinds of topic.

Text Mining Lecture 5 41 / 45

Page 49: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Example Topic

Text Mining Lecture 5 42 / 45

Page 50: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Monetary Measures of Tone

Contraction Expansiondecreas* increas*decelerat* accelerat*slow* fast*weak* strong*low* high*loss* gain*contract* expand*

Text Mining Lecture 5 43 / 45

Page 51: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Indices

Text Mining Lecture 5 44 / 45

Page 52: Text Mining for Economics and Finance Latent Dirichlet ... · Text Mining for Economics and Finance Latent Dirichlet Allocation StephenHansen Text Mining Lecture 5 1/45

Conclusion

Key ideas from this lecture:

1. Latent Dirichlet Allocation. Influential in computer science, many potentialapplications in economics.

2. Graphical models to simplify and visualize complex joint likelihoods.

3. Gibbs sampling to stochastically approximate posterior.

4. Interpretable output in several domains.

Topic for future: incorporate economic structure into LDA.

Text Mining Lecture 5 45 / 45