Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

84
Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine

Transcript of Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Page 1: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Finding scientific topics

Tom GriffithsStanford University

Mark SteyversUC Irvine

Page 2: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Why map knowledge?

• Quickly grasp important themes in a new field

• Synthesize content of an existing field

• Discover targets for funding and research

Page 3: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Why map knowledge?

• Quickly grasp important themes in a new field

• Synthesize content of an existing field

• Discover targets for funding and research

INFORMATION OVERLOAD

Page 4: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Apoptosis + Plant Biology

Page 5: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Apoptosis + Medicine

Page 6: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Apoptosis + Medicine

Page 7: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Apoptosis + Medicine

Page 8: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Apoptosis + Medicine

Apoptosis + Medicine

Page 9: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

probabilisticgenerative

model

Apoptosis + Medicine

Page 10: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

statisticalinference

Apoptosis + Medicine

Page 11: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

1. A generative model for documents

2. Discovering topics with Gibbs sampling

3. Results– Topics and classes– Mapping science– Topic dynamics

4. Future directions– Tagging abstracts

Page 12: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

1. A generative model for documents

2. Discovering topics with Gibbs sampling

3. Results– Topics and classes– Mapping science– Topic dynamics

4. Future directions– Tagging abstracts

Page 13: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

A generative model for documents

• Each document a mixture of topics

• Each word chosen from a single topic

• from parameters

• from parameters

(Blei, Ng, & Jordan, 2003)

Page 14: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

A generative model for documents

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2SCIENTIFIC 0.0KNOWLEDGE 0.0WORK 0.0RESEARCH 0.0MATHEMATICS 0.0

HEART 0.0 LOVE 0.0SOUL 0.0TEARS 0.0JOY 0.0 SCIENTIFIC 0.2KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

topic 1 topic 2

w P(w|z = 1) = (1) w P(w|z = 2) = (2)

Page 15: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Choose mixture weights for each document, generate “bag of words”

= {P(z = 1), P(z = 2)}

{0, 1}

{0.25, 0.75}

{0.5, 0.5}

{0.75, 0.25}

{1, 0}

MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK

SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART

MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART

WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL

TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

Page 16: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

A generative model for documents

• Called Latent Dirichlet Allocation (LDA)

• Introduced by Blei, Ng, and Jordan (2003), reinterpretation of PLSI (Hofmann, 2001)

z

w

zz

w w

Page 17: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

wor

ds

documents

U D V

wor

ds

dims

dims

dim

s

vect

ors documents

SVD

wor

ds

documents

wor

ds

topics

topi

csdocuments

LDA

P(w

|z)

P(z)P(w)

(Dumais, Landauer)

Page 18: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

1. A generative model for documents

2. Discovering topics with Gibbs sampling

3. Results– Topics and classes– Mapping science– Topic dynamics

4. Future directions– Tagging abstracts

Page 19: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Inverting the generative model

• Maximum likelihood estimation (EM)

• Variational EM (Blei, Ng & Jordan, 2003)

• Bayesian inference

Page 20: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Bayesian inference

• Sum in the denominator over Tn terms

• Full posterior only tractable to a constant

Page 21: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Markov chain Monte Carlo

• Sample from a Markov chain which converges to target distribution

• Allows sampling from an unnormalized posterior distribution

• Can compute approximate statistics from intractable distributions

Page 22: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

pixel = word image = document

sample each pixel froma mixture of topics

A visual example: Bars

Page 23: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.
Page 24: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.
Page 25: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Interpretable decomposition

• SVD gives a basis for the data, but not an interpretable one

• The true basis is not orthogonal, so rotation does no good

Page 26: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Bayesian model selection

• How many topics do we need?

• A Bayesian would consider the posterior:

• Involves summing over assignments z

P(T|w) P(w|T) P(T)

Page 27: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Corpus (w)

P(

w |T

)

T = 10

T = 100

Bayesian model selection

Page 28: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Corpus (w)

P(

w |T

)

T = 10

T = 100

Bayesian model selection

Page 29: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Corpus (w)

P(

w |T

)

T = 10

T = 100

Bayesian model selection

Page 30: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Back to the bars

Page 31: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

1. A generative model for documents

2. Discovering topics with Gibbs sampling

3. Results– Topics and classes– Mapping science– Topic dynamics

4. Future directions– Tagging abstracts

Page 32: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Corpus preprocessing

• Used all D = 28,154 abstracts from 1991-2001• Used any word occurring in at least five

abstracts, not on “stop” list (W = 20,551)• Segmentation by any delimiting character, total

of n = 3,026,970 word tokens in corpus• Also, PNAS class designations for 2001

(thanks to Kevin Boyack)

Page 33: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Running the algorithm

• Memory requirements linear in T(W+D), runtime proportional to nT

• T = 50, 100, 200, 300, 400, 500, 600, (1000)

• Ran 8 chains for each T, burn-in of 1000 iterations, 10 samples/chain at a lag of 100

• All runs completed in under 30 hours on BlueHorizon supercomputer at San Diego

Page 34: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

How many topics?

Page 35: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.
Page 36: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

FORCESURFACE

MOLECULESSOLUTIONSURFACES

MICROSCOPYWATERFORCES

PARTICLESSTRENGTHPOLYMER

IONICATOMIC

AQUEOUSMOLECULARPROPERTIES

LIQUIDSOLUTIONS

BEADSMECHANICAL

HIVVIRUS

INFECTEDIMMUNODEFICIENCY

CD4INFECTION

HUMANVIRAL

TATGP120

REPLICATIONTYPE

ENVELOPEAIDSREV

BLOODCCR5

INDIVIDUALSENV

PERIPHERAL

MUSCLECARDIAC

HEARTSKELETALMYOCYTES

VENTRICULARMUSCLESSMOOTH

HYPERTROPHYDYSTROPHIN

HEARTSCONTRACTION

FIBERSFUNCTION

TISSUERAT

MYOCARDIALISOLATED

MYODFAILURE

STRUCTUREANGSTROM

CRYSTALRESIDUES

STRUCTURESSTRUCTURALRESOLUTION

HELIXTHREE

HELICESDETERMINED

RAYCONFORMATION

HELICALHYDROPHOBIC

SIDEDIMENSIONALINTERACTIONS

MOLECULESURFACE

NEURONSBRAIN

CORTEXCORTICAL

OLFACTORYNUCLEUS

NEURONALLAYER

RATNUCLEI

CEREBELLUMCEREBELLAR

LATERALCEREBRAL

LAYERSGRANULELABELED

HIPPOCAMPUSAREAS

THALAMIC

A selection of topics

TUMORCANCERTUMORSHUMANCELLS

BREASTMELANOMA

GROWTHCARCINOMA

PROSTATENORMAL

CELLMETASTATICMALIGNANT

LUNGCANCERS

MICENUDE

PRIMARYOVARIAN

Page 37: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

PARASITEPARASITES

FALCIPARUMMALARIA

HOSTPLASMODIUM

ERYTHROCYTESERYTHROCYTE

MAJORLEISHMANIA

INFECTEDBLOOD

INFECTIONMOSQUITOINVASION

TRYPANOSOMACRUZI

BRUCEIHUMANHOSTS

ADULTDEVELOPMENT

FETALDAY

DEVELOPMENTALPOSTNATAL

EARLYDAYS

NEONATALLIFE

DEVELOPINGEMBRYONIC

BIRTHNEWBORN

MATERNALPRESENTPERIOD

ANIMALSNEUROGENESIS

ADULTS

CHROMOSOMEREGION

CHROMOSOMESKB

MAPMAPPING

CHROMOSOMALHYBRIDIZATION

ARTIFICIALMAPPED

PHYSICALMAPS

GENOMICDNA

LOCUSGENOME

GENEHUMAN

SITUCLONES

MALEFEMALEMALES

FEMALESSEX

SEXUALBEHAVIOROFFSPRING

REPRODUCTIVEMATINGSOCIALSPECIES

REPRODUCTIONFERTILITY

TESTISMATE

GENETICGERM

CHOICESRY

STUDIESPREVIOUS

SHOWNRESULTSRECENTPRESENT

STUDYDEMONSTRATED

INDICATEWORK

SUGGESTSUGGESTED

USINGFINDINGS

DEMONSTRATEREPORT

INDICATEDCONSISTENT

REPORTSCONTRAST

A selection of topics

MECHANISMMECHANISMSUNDERSTOOD

POORLYACTION

UNKNOWNREMAIN

UNDERLYINGMOLECULAR

PSREMAINS

SHOWRESPONSIBLE

PROCESSSUGGESTUNCLEARREPORT

LEADINGLARGELYKNOWN

MODELMODELS

EXPERIMENTALBASED

PROPOSEDDATA

SIMPLEDYNAMICSPREDICTED

EXPLAINBEHAVIOR

THEORETICALACCOUNTTHEORY

PREDICTSCOMPUTER

QUANTITATIVEPREDICTIONSCONSISTENT

PARAMETERS

Page 38: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

PARASITEPARASITES

FALCIPARUMMALARIA

HOSTPLASMODIUM

ERYTHROCYTESERYTHROCYTE

MAJORLEISHMANIA

INFECTEDBLOOD

INFECTIONMOSQUITOINVASION

TRYPANOSOMACRUZI

BRUCEIHUMANHOSTS

ADULTDEVELOPMENT

FETALDAY

DEVELOPMENTALPOSTNATAL

EARLYDAYS

NEONATALLIFE

DEVELOPINGEMBRYONIC

BIRTHNEWBORN

MATERNALPRESENTPERIOD

ANIMALSNEUROGENESIS

ADULTS

CHROMOSOMEREGION

CHROMOSOMESKB

MAPMAPPING

CHROMOSOMALHYBRIDIZATION

ARTIFICIALMAPPED

PHYSICALMAPS

GENOMICDNA

LOCUSGENOME

GENEHUMAN

SITUCLONES

MALEFEMALEMALES

FEMALESSEX

SEXUALBEHAVIOROFFSPRING

REPRODUCTIVEMATINGSOCIALSPECIES

REPRODUCTIONFERTILITY

TESTISMATE

GENETICGERM

CHOICESRY

STUDIESPREVIOUS

SHOWNRESULTSRECENTPRESENT

STUDYDEMONSTRATED

INDICATEWORK

SUGGESTSUGGESTED

USINGFINDINGS

DEMONSTRATEREPORT

INDICATEDCONSISTENT

REPORTSCONTRAST

A selection of topics

MECHANISMMECHANISMSUNDERSTOOD

POORLYACTION

UNKNOWNREMAIN

UNDERLYINGMOLECULAR

PSREMAINS

SHOWRESPONSIBLE

PROCESSSUGGESTUNCLEARREPORT

LEADINGLARGELYKNOWN

MODELMODELS

EXPERIMENTALBASED

PROPOSEDDATA

SIMPLEDYNAMICSPREDICTED

EXPLAINBEHAVIOR

THEORETICALACCOUNTTHEORY

PREDICTSCOMPUTER

QUANTITATIVEPREDICTIONSCONSISTENT

PARAMETERS

Page 39: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

1. A generative model for documents

2. Discovering topics with Gibbs sampling

3. Results– Topics and classes– Mapping science– Topic dynamics

4. Future directions– Tagging abstracts

Page 40: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Topics and classes

• PNAS authors provide class designations– major: Biological, Physical, Social Sciences– minor: 33 separate disciplines*

• Find topics diagnostic of classes– validate “reality” of classes– show topics pick out meaningful structure

(classes, and the the relations between them)

Page 41: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.
Page 42: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

210SYNAPTICNEURONS

POSTSYNAPTICHIPPOCAMPAL

SYNAPSESLTP

PRESYNAPTICTRANSMISSIONPOTENTIATION

PLASTICITYEXCITATORY

RELEASEDENDRITIC

PYRAMIDALHIPPOCAMPUS

DENDRITESCA1

STIMULATIONTERMINALS

SYNAPSE

Page 43: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

201RESISTANCERESISTANT

DRUGDRUGS

SENSITIVEMDR

MULTIDRUGSUSCEPTIBLE

SELECTEDGLYCOPROTEIN

SENSITIVITYPGP

AGENTSCONFERS

MDR1CYTOTOXICCONFERRED

CHEMOTHERAPEUTICEFFLUX

INCREASED

Page 44: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

280SPECIES

SELECTIONEVOLUTION

GENETICPOPULATIONSPOPULATIONVARIATIONNATURAL

EVOLUTIONARYFITNESS

ADAPTIVERATES

THEORYTRAITS

DIVERSITYEXPECTEDNEUTRALEVOLVED

COMPETITIONHISTORY

Page 45: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

222CORTEXBRAIN

SUBJECTSTASK

AREASREGIONS

FUNCTIONALLEFT

MEMORYTEMPORALIMAGING

PREFRONTALCEREBRAL

TASKSFRONTAL

AREATOMOGRAPHY

EMISSIONPOSITRONCORTICAL

Page 46: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

2SPECIESGLOBALCLIMATE

CO2WATER

ENVIRONMENTALYEARS

MARINECARBON

DIVERSITYOCEAN

EXTINCTIONTERRESTRIALCOMMUNITYABUNDANCE

EARTHECOLOGICAL

CHANGETIME

ECOSYSTEM

Page 47: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

39THEORY

TIMESPACEGIVEN

PROBLEMSHAPESIMPLE

DIMENSIONALPAPER

NUMBERCASE

LOCALTERMS

SYMMETRYRANDOM

EQUATIONCLASSICAL

COMPLEXITYNUMERICALPROPERTIES

Page 48: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

1. A generative model for documents

2. Discovering topics with Gibbs sampling

3. Results– Topics and classes– Mapping science– Topic dynamics

4. Future directions– Tagging abstracts

Page 49: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Mapping science

• Topics provide dimensionality reduction

• Some applications require visualization (and even lower dimensionality)

• Low-dimensional representation from methods for analysis of compositional data

Page 50: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.
Page 51: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.
Page 52: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.
Page 53: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

1. A generative model for documents

2. Discovering topics with Gibbs sampling

3. Results– Topics and classes– Mapping science– Topic dynamics

4. Future directions– Tagging abstracts

Page 54: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Topic dynamics

• We have the distribution over topics for abstracts from 1991 to 2001

• Analysis of dynamics:– perform linear trend analysis for each topic– “hot topics” go up, “cold topics” go down

Page 55: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Cold topics Hot topics

Page 56: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Cold topics Hot topics

2SPECIESGLOBALCLIMATE

CO2WATER

ENVIRONMENTALYEARS

MARINECARBON

DIVERSITYOCEAN

EXTINCTIONTERRESTRIALCOMMUNITYABUNDANCE

134MICE

DEFICIENTNORMAL

GENENULL

MOUSETYPE

HOMOZYGOUSROLE

KNOCKOUTDEVELOPMENT

GENERATEDLACKINGANIMALSREDUCED

179APOPTOSIS

DEATHCELL

INDUCEDBCL

CELLSAPOPTOTIC

CASPASEFAS

SURVIVALPROGRAMMED

MEDIATEDINDUCTIONCERAMIDE

EXPRESSION

Page 57: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Cold topics Hot topics

2SPECIESGLOBALCLIMATE

CO2WATER

ENVIRONMENTALYEARS

MARINECARBON

DIVERSITYOCEAN

EXTINCTIONTERRESTRIALCOMMUNITYABUNDANCE

134MICE

DEFICIENTNORMAL

GENENULL

MOUSETYPE

HOMOZYGOUSROLE

KNOCKOUTDEVELOPMENT

GENERATEDLACKINGANIMALSREDUCED

179APOPTOSIS

DEATHCELL

INDUCEDBCL

CELLSAPOPTOTIC

CASPASEFAS

SURVIVALPROGRAMMED

MEDIATEDINDUCTIONCERAMIDE

EXPRESSION

37CDNA

AMINOSEQUENCE

ACIDPROTEIN

ISOLATEDENCODING

CLONEDACIDS

IDENTITYCLONE

EXPRESSEDENCODES

RATHOMOLOGY

289KDA

PROTEINPURIFIED

MOLECULARMASS

CHROMATOGRAPHYPOLYPEPTIDE

GELSDS

BANDAPPARENTLABELED

IDENTIFIEDFRACTIONDETECTED

75ANTIBODY

ANTIBODIESMONOCLONAL

ANTIGENIGG

MABSPECIFICEPITOPEHUMANMABS

RECOGNIZEDSERA

EPITOPESDIRECTED

NEUTRALIZING

Page 58: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

1. A generative model for documents

2. Discovering topics with Gibbs sampling

3. Results– Topics and classes– Mapping science– Topic dynamics

4. Future directions– Tagging abstracts

Page 59: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Future directions

• Including different kinds of knowledge– citations (Hofmann & Cohn, 2001)– author, title, keywords, other fields– word order information

• An example: scientific syntax and semantics

Page 60: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Scientific syntax and semantics

z

w

zz

w w

xxx

semantics: probabilistic topics

syntax: probabilistic regular grammar

Factorization of language based onstatistical dependency patterns:

long-range, document specific,dependencies

short-range dependencies constantacross all documents

Page 61: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

z = 1 0.4

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

z = 2 0.6

x = 1

THE 0.6 A 0.3MANY 0.1

x = 3

OF 0.6 FOR 0.3BETWEEN 0.1

x = 2

0.9

0.1

0.2

0.8

0.7

0.3

Page 62: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

THE 0.6 A 0.3MANY 0.1

OF 0.6 FOR 0.3BETWEEN 0.1

0.9

0.1

0.2

0.8

0.7

0.3

THE ………………………………

z = 1 0.4 z = 2 0.6

x = 1

x = 3

x = 2

Page 63: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

THE 0.6 A 0.3MANY 0.1

OF 0.6 FOR 0.3BETWEEN 0.1

0.9

0.1

0.2

0.8

0.7

0.3

THE LOVE……………………

z = 1 0.4 z = 2 0.6

x = 1

x = 3

x = 2

Page 64: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

THE 0.6 A 0.3MANY 0.1

OF 0.6 FOR 0.3BETWEEN 0.1

0.9

0.1

0.2

0.8

0.7

0.3

THE LOVE OF………………

z = 1 0.4 z = 2 0.6

x = 1

x = 3

x = 2

Page 65: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

HEART 0.2 LOVE 0.2SOUL 0.2TEARS 0.2JOY 0.2

SCIENTIFIC 0.2 KNOWLEDGE 0.2WORK 0.2RESEARCH 0.2MATHEMATICS 0.2

THE 0.6 A 0.3MANY 0.1

OF 0.6 FOR 0.3BETWEEN 0.1

0.9

0.1

0.2

0.8

0.7

0.3

THE LOVE OF RESEARCH ……

z = 1 0.4 z = 2 0.6

x = 1

x = 3

x = 2

Page 66: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Semantic topics29 46 51 71 115 125

AGE SELECTION LOCI TUMOR MALE MEMORYLIFE POPULATION LOCUS CANCER FEMALE LEARNING

AGING SPECIES ALLELES TUMORS MALES BRAINOLD POPULATIONS ALLELE BREAST FEMALES TASK

YOUNG GENETIC GENETIC HUMAN SPERM CORTEXCRE EVOLUTION LINKAGE CARCINOMA SEX SUBJECTS

AGED SIZE POLYMORPHISM PROSTATE SEXUAL LEFTSENESCENCE NATURAL CHROMOSOME MELANOMA MATING RIGHTMORTALITY VARIATION MARKERS CANCERS REPRODUCTIVE SONG

AGES FITNESS SUSCEPTIBILITY NORMAL OFFSPRING TASKSCR MUTATION ALLELIC COLON PHEROMONE HIPPOCAMPAL

INFANTS PER POLYMORPHIC LUNG SOCIAL PERFORMANCESPAN NUCLEOTIDE POLYMORPHISMS APC EGG SPATIALMEN RATES RESTRICTION MAMMARY BEHAVIOR PREFRONTAL

WOMEN RATE FRAGMENT CARCINOMAS EGGS COGNITIVESENESCENT HYBRID HAPLOTYPE MALIGNANT FERTILIZATION TRAINING

LOXP DIVERSITY GENE CELL MATERNAL TOMOGRAPHYINDIVIDUALS SUBSTITUTION LENGTH GROWTH PATERNAL FRONTAL

CHILDREN SPECIATION DISEASE METASTATIC FERTILITY MOTORNORMAL EVOLUTIONARY MICROSATELLITE EPITHELIAL GERM EMISSION

Page 67: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Syntactic classes

REMAINED

5 8 14 25 26 30 33IN ARE THE SUGGEST LEVELS RESULTS BEEN

FOR WERE THIS INDICATE NUMBER ANALYSIS MAYON WAS ITS SUGGESTING LEVEL DATA CAN

BETWEEN IS THEIR SUGGESTS RATE STUDIES COULDDURING WHEN AN SHOWED TIME STUDY WELLAMONG REMAIN EACH REVEALED CONCENTRATIONS FINDINGS DIDFROM REMAINS ONE SHOW VARIETY EXPERIMENTS DOES

UNDER REMAINED ANY DEMONSTRATE RANGE OBSERVATIONS DOWITHIN PREVIOUSLY INCREASED INDICATING CONCENTRATION HYPOTHESIS MIGHT

THROUGHOUT BECOME EXOGENOUS PROVIDE DOSE ANALYSES SHOULDTHROUGH BECAME OUR SUPPORT FAMILY ASSAYS WILLTOWARD BEING RECOMBINANT INDICATES SET POSSIBILITY WOULD

INTO BUT ENDOGENOUS PROVIDES FREQUENCY MICROSCOPY MUSTAT GIVE TOTAL INDICATED SERIES PAPER CANNOT

INVOLVING MERE PURIFIED DEMONSTRATED AMOUNTS WORK

THEYAFTER APPEARED TILE SHOWS RATES EVIDENCE ALSO

ACROSS APPEAR FULL SO CLASS FINDINGAGAINST ALLOWED CHRONIC REVEAL VALUES MUTAGENESIS BECOME

WHEN NORMALLY ANOTHER DEMONSTRATES AMOUNT OBSERVATION MAGALONG EACH EXCESS SUGGESTED SITES MEASUREMENTS LIKELY

Page 68: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Abstract tagging

• Highlight important words in text, to reduce demands on information users

• Can be done to identify different content:– words assigned to most prevalent topic reveal

important themes (see the paper!)– with syntactic/semantic factorization, we can

highlight words that determine semantic content

Page 69: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

(PNAS, 1991, vol. 88, 4874-4876)

A23 generalized49 fundamental11 theorem20 of4 natural46 selection46 is32 derived17 for5 populations46 incorporating22 both39 genetic46 and37 cultural46 transmission46. The14 phenotype15 is32 determined17 by42 an23 arbitrary49 number26 of4 multiallelic52 loci40 with22 two39-factor148 epistasis46 and37 an23 arbitrary49 linkage11 map20, as43 well33 as43 by42 cultural46 transmission46 from22 the14 parents46. Generations46 are8 discrete49 but37 partially19 overlapping24, and37 mating46 may33 be44 nonrandom17 at9 either39 the14 genotypic46 or37 the14 phenotypic46 level46 (or37 both39). I12 show34 that47 cultural46 transmission46 has18 several39 important49 implications6 for5 the14 evolution46 of4 population46 fitness46, most36 notably4 that47 there41 is32 a23 time26 lag7 in22 the14 response28 to31 selection46 such9 that47 the14 future137 evolution46 depends29 on21 the14 past24 selection46 history46 of4 the14 population46.

(graylevel = “semanticity”, the probability of using LDA over HMM)

Page 70: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

(PNAS, 1996, vol. 93, 14628-14631)

The14 ''shape7'' of4 a23 female115 mating115 preference125 is32 the14 relationship7 between4 a23 male115 trait15 and37 the14 probability7 of4 acceptance21 as43 a23 mating115 partner20, The14 shape7 of4 preferences115 is32 important49 in5 many39 models6 of4 sexual115 selection46, mate115 recognition125, communication9, and37 speciation46, yet50 it41 has18 rarely19 been33 measured17 precisely19, Here12 I9 examine34 preference7 shape7 for5 male115 calling115 song125 in22 a23 bushcricket*13 (katydid*48). Preferences115 change46 dramatically19 between22 races46 of4 a23 species15, from22 strongly19 directional11 to31 broadly19 stabilizing45 (but50 with21 a23 net49 directional46 effect46), Preference115 shape46 generally19 matches10 the14 distribution16 of4 the14 male115 trait15, This41 is32 compatible29 with21 a23 coevolutionary46 model20 of4 signal9-preference115 evolution46, although50 it41 does33 nor37 rule20 out17 an23 alternative11 model20, sensory125 exploitation150. Preference46 shapes40 are8 shown35 to31 be44 genetic11 in5 origin7.

Page 71: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

(PNAS, 1996, vol. 93, 14628-14631)

The14 ''shape7'' of4 a23 female115 mating115 preference125 is32 the14 relationship7 between4 a23 male115 trait15 and37 the14 probability7 of4 acceptance21 as43 a23 mating115 partner20, The14 shape7 of4 preferences115 is32 important49 in5 many39 models6 of4 sexual115 selection46, mate115 recognition125, communication9, and37 speciation46, yet50 it41 has18 rarely19 been33 measured17 precisely19, Here12 I9 examine34 preference7 shape7 for5 male115 calling115 song125 in22 a23 bushcricket*13 (katydid*48). Preferences115 change46 dramatically19 between22 races46 of4 a23 species15, from22 strongly19 directional11 to31 broadly19 stabilizing45 (but50 with21 a23 net49 directional46 effect46), Preference115 shape46 generally19 matches10 the14 distribution16 of4 the14 male115 trait15. This41 is32 compatible29 with21 a23 coevolutionary46 model20 of4 signal9-preference115 evolution46, although50 it41 does33 nor37 rule20 out17 an23 alternative11 model20, sensory125 exploitation150. Preference46 shapes40 are8 shown35 to31 be44 genetic11 in5 origin7.

Page 72: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Conclusion

• Probabilistic generative models can reveal the structure of knowledge domains

• We can use these models to – identify important themes– synthesize content– discover targets for funding and research– reduce the demands on information users

Page 73: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.
Page 74: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Gibbs sampling

For variables z = z1, z2, …, zn

Draw zi(t+1) from P(zi|z-i, w)

z-i = z1(t+1), z2

(t+1),…, zi-1(t+1), zi+1

(t), …, zn(t)

Page 75: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Gibbs sampling

• Need full conditional distributions for variables

• Since we only sample z we need

number of times word w assigned to topic j

number of times topic j used in document d

Page 76: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Gibbs sampling

i wi di zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

iteration1

Page 77: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

?

iteration1 2

Page 78: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

?

iteration1 2

Page 79: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

?

iteration1 2

Page 80: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

2?

iteration1 2

Page 81: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

21?

iteration1 2

Page 82: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

211?

iteration1 2

Page 83: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Gibbs sampling

i wi di zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

2112?

iteration1 2

Page 84: Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Gibbs sampling

i wi di zi zi zi123456789

101112...

50

MATHEMATICSKNOWLEDGE

RESEARCHWORK

MATHEMATICSRESEARCH

WORKSCIENTIFIC

MATHEMATICSWORK

SCIENTIFICKNOWLEDGE

.

.

.JOY

111111111122...5

221212212111...2

211222212212...1

222122212222...1

iteration1 2 … 1000