A Comparison of Different Strategies for Automated Semantic Document Annotation

44
1 Chifumi Nishioka [email protected], K-CAP 2015 Gregor Große- Bölting Chifumi Nishioka Ansgar Scherp A Comparison of Different Strategies for Automated Semantic Document Annotation

Transcript of A Comparison of Different Strategies for Automated Semantic Document Annotation

Page 1: A Comparison of Different Strategies for Automated Semantic Document Annotation

1Chifumi Nishioka [email protected], K-CAP 2015

Gregor Große-BöltingChifumi NishiokaAnsgar Scherp

A Comparison of Different Strategies for Automated Semantic Document Annotation

Page 2: A Comparison of Different Strategies for Automated Semantic Document Annotation

2Chifumi Nishioka [email protected], K-CAP 2015

Motivation [1/2]• Document annotation

– Facilitates users and search engines to find documents– Requires a huge amount of human effort– e.g., subject indexers in ZBW labeled 1.6 million scientific

documents in economics

• Semantic document annotation– Documents annotated with semantic entities– e.g., PubMed and MeSH, ACM DL and ACM CCS

Focus on semantic document annotation

Necessity of automated document annotation

Page 3: A Comparison of Different Strategies for Automated Semantic Document Annotation

3Chifumi Nishioka [email protected], K-CAP 2015

Motivation [2/2]• Small scale experiments so far

– Comparing a small number of strategies– Datasets containing a few hundred documents

• Comparing of 43 strategies for document annotation within the developed experiment framework– The largest number of strategies

• Experiments with three datasets from different domains– Contain full-texts of 100,000 documents annotated by subject

indexers– The largest dataset of scientific publications

We conducted the largest scale experiment

Page 4: A Comparison of Different Strategies for Automated Semantic Document Annotation

4Chifumi Nishioka [email protected], K-CAP 2015

Experiment Framework

Strategies are composed of methods from concept extraction, concept activation, and annotation selection

1. Concept Extractiondetect concepts (candidate annotations) from each document

2. Concept Activationcompute a score for each concept of a document

3. Annotation Selectionselect annotations from concepts for each document

4. Evaluationmeasure performance of strategies with ground truth

Page 5: A Comparison of Different Strategies for Automated Semantic Document Annotation

5Chifumi Nishioka [email protected], K-CAP 2015

Research Question• Research questions solved with the experiment

framework

(I) Which strategy performs best?

(II) Which concept extraction method performs best?

(III) Which concept activation method performs best?

(IV) Which annotation selection method performs best?

Page 6: A Comparison of Different Strategies for Automated Semantic Document Annotation

6Chifumi Nishioka [email protected], K-CAP 2015

Concept Extraction [1/2]• Entity

– Extract entities from documents using a domain-specific knowledge base

– Domain-specific knowledge base• Entities (subjects) in a specific domain (e.g., medicine)• One or more labels for each entity• Relationships between entities

– Detect entities by string matching with entity labels• Tri-gram

– Extract contigurous sequences of one, two, and three words in a document

Page 7: A Comparison of Different Strategies for Automated Semantic Document Annotation

7Chifumi Nishioka [email protected], K-CAP 2015

Concept Extraction [2/2]• RAKE (Rapid Automatic Keyword

Extraction) [Rose et al. 10] – Unsupervised method for extracting keywords– Incorporate cooccurrence and frequency of words

• LDA (Latent Dirichlet Allocation) [Blei et al. 03]– Unsupervised topic modeling method for inferring latent

topics in a document corpus– Topic model

• Topic: A probability distribution over words• Document: A probability distribution over topics

– Treat a topic as a concept

Page 8: A Comparison of Different Strategies for Automated Semantic Document Annotation

8Chifumi Nishioka [email protected], K-CAP 2015

Concept Activation [1/6]• Three types of concept activation

methods– Statistical Methods

• Baseline• Use only directly mentioned concepts

– Hierarchy-based Methods• Reveal concepts that are not mentioned explicitly using a

hierarchical knowledge base– Graph-based Methods

• Use only directly mentioned concepts• Represent concept

cooccurrences as a graph

Bank, Interest Rate, Financial Crisis, Bank, Central Bank, Tax, Interest Rate

Tax

Bank

Interest Rate

Financial Crisis

Central Bank

Page 9: A Comparison of Different Strategies for Automated Semantic Document Annotation

9Chifumi Nishioka [email protected], K-CAP 2015

Concept Activation [2/6]• Statistical Methods

– Frequency

• depends on Concept Extraction methods– The number of appearances (Entity and Tri-gram)– The score output by RAKE (RAKE) – The probability of a topic for a document (LDA)

– CF-IDF [Goossen et al. 11]• An extension of TF-IDF replacing words with concepts• Lower scores for concepts that appear in many documents

𝑠𝑐𝑜𝑟𝑒𝑐𝑓𝑖𝑑𝑓 (𝑐 ,𝑑)=𝑐𝑓 (𝑐 ,𝑑) ∙𝑙𝑜𝑔¿𝐷∨ ¿¿ {𝑑∈𝐷 } :{𝑐∈𝑑}∨¿¿

¿

𝑠𝑐𝑜𝑟𝑒 𝑓𝑟𝑒𝑞(𝑐 ,𝑑)= 𝑓𝑟𝑒𝑞(𝑐 ,𝑑 )

Page 10: A Comparison of Different Strategies for Automated Semantic Document Annotation

10Chifumi Nishioka [email protected], K-CAP 2015

Concept Activation [3/6]• Hierarchy-based Methods [1/2]

– Base Activation

• : a set of child concepts of a concept • : decay parameter, set • e.g.,

𝑠𝑐𝑜𝑟𝑒𝑏𝑎𝑠𝑒 (𝑐 ,𝑑 )= 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑 )+𝜆 ∙ ∑𝑐𝑖∈𝐶 𝑙 (𝑐)

𝑠𝑐𝑜𝑟𝑒𝑏𝑎𝑠𝑒(𝑐 𝑖 ,𝑑)

SocialRecommendation

SocialTagging

Web Searching Web Mining

SiteWrapping

Web LogAnalysis

World Wide Web

𝑐1

𝑐2

𝑐3

Page 11: A Comparison of Different Strategies for Automated Semantic Document Annotation

11Chifumi Nishioka [email protected], K-CAP 2015

Concept Activation [4/6]• Hierarchy-based Methods [2/2]

– Branch Activation

• : reciprocal of the number of concepts that are located one level above a concept

– OneHop Activation

• : set of concepts in a document • Activates concepts in a maximum distance of one hop

𝑠𝑐𝑜𝑟𝑒 h𝑏𝑟𝑎𝑛𝑐 (𝑐 ,𝑑 )= 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑)+𝜆∙𝐵𝑁 ∙ ∑𝑐𝑖∈𝐶 𝑙 (𝑐)

𝑠𝑐𝑜𝑟𝑒 h𝑏𝑟𝑎𝑛𝑐 (𝑐𝑖 ,𝑑)

𝑠𝑐𝑜𝑟𝑒 h𝑜𝑛𝑒 𝑜𝑝 (𝑐 ,𝑑 )={ 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑 ) if∨𝐶𝑙(𝑐 )∩𝐶𝑑∨≥2𝑓𝑟𝑒𝑞 (𝑐 ,𝑑)+𝜆 ∙ ∑

𝑐 𝑖∈𝐶𝑙 (𝑐)

𝑓𝑟𝑒𝑞 (𝑐 𝑖 ,𝑑 )otherwise

Page 12: A Comparison of Different Strategies for Automated Semantic Document Annotation

12Chifumi Nishioka [email protected], K-CAP 2015

Concept Activation [5/6]• Graph-based Methods [1/2]

– Degree [Zouaq et al. 12]

• : the number of edges linked with a concept • e.g.,

– HITS [Kleinberg 99; Zouaq et al. 12]• Link analysis algorithm for search engines [Kleinberg 99]

𝑠𝑐𝑜𝑟𝑒𝑑𝑒𝑔𝑟𝑒𝑒(𝑐 ,𝑑)=𝑑𝑒𝑔𝑟𝑒𝑒(𝑐 ,𝑑 )

𝑠𝑐𝑜𝑟𝑒h𝑖𝑡𝑠 (𝑐 ,𝑑 )=h𝑢𝑏 (𝑐 ,𝑑)+ h𝑎𝑢𝑡 (𝑐 ,𝑑 )

Tax

Bank

Interest Rate

Financial Crisis

Central Bank

Page 13: A Comparison of Different Strategies for Automated Semantic Document Annotation

13Chifumi Nishioka [email protected], K-CAP 2015

Concept Activation [6/6]• Graph-based Methods [2/2]

– PageRank [Page et al. 99; Mihalcea & Paul 04]• Link analysis algorithm for search engines• Based on the intuition that a node that is linked from many

important nodes is more important

• : set of concepts connected with incoming edges from • : set of concepts connected with outgoing edges from • : dumping factor,

𝑠𝑐𝑜𝑟𝑒𝑝𝑎𝑔𝑒 (𝑐 ,𝑑 )=(1−𝜇 )+𝜇 ∙ ∑𝑐 𝑖∈𝐶𝑖𝑛 (𝑐)

𝑠𝑐𝑜𝑟𝑒𝑝𝑎𝑔𝑒(𝑐 𝑖 ,𝑑 )¿𝐶𝑜𝑢𝑡(𝑐 𝑖)∨¿

¿

Page 14: A Comparison of Different Strategies for Automated Semantic Document Annotation

14Chifumi Nishioka [email protected], K-CAP 2015

Annotation Selection• Top-5 and Top-10

– Select concepts whose scores are ranked in top-k• k Nearest Neighbor (kNN) [Huang et al. 11]

– Based on the assumption that documents with similar concepts share similar annotations

1. Compute similarity scores between a target document and all documents with annotations

2. Select union of annotations of k nearest documents

Central bankLawFinancial crisis

FinanceChina

Human resourceLeadership

MarketingCompetition law

??

0.49

0.45

0.42

0.60

Example

- Selected annotationsFinance; China; Marketing; Competition Law

Page 15: A Comparison of Different Strategies for Automated Semantic Document Annotation

15Chifumi Nishioka [email protected], K-CAP 2015

Configurations [1/5]

Entity Tri-gram LDARAKE

StatisticalMethods(2 methods)

Hierarchy-basedMethods(3 methods)

Graph-basedMethods(3 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

Page 16: A Comparison of Different Strategies for Automated Semantic Document Annotation

16Chifumi Nishioka [email protected], K-CAP 2015

Configurations [2/5]

24 strategies

Entity Tri-gram LDARAKE

StatisticalMethods(2 methods)

Hierarchy-basedMethods(3 methods)

Graph-basedMethods(3 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

Page 17: A Comparison of Different Strategies for Automated Semantic Document Annotation

17Chifumi Nishioka [email protected], K-CAP 2015

15 strategies

Entity Tri-gram LDARAKE

StatisticalMethod

(2 methods)

Hierarchy-basedMethods(3 methods)

Graph-basedMethods(3 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

Configurations [3/5]

Page 18: A Comparison of Different Strategies for Automated Semantic Document Annotation

18Chifumi Nishioka [email protected], K-CAP 2015

3 strategies

Entity Tri-gram LDARAKE

StatisticalMethod

(Frequency)

Hierarchy-basedMethods(3 methods)

Graph-basedMethods(3 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

Configurations [4/5]

Page 19: A Comparison of Different Strategies for Automated Semantic Document Annotation

19Chifumi Nishioka [email protected], K-CAP 2015

Entity Tri-gram LDARAKE

StatisticalMethods(Frequency)

Hierarchy-basedMethods(3 methods)

Graph-basedMethods(3 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

Configurations [5/5]

43 strategies in total

Page 20: A Comparison of Different Strategies for Automated Semantic Document Annotation

20Chifumi Nishioka [email protected], K-CAP 2015

Datasets and Metrics of ExperimentsEconomics Political Science Computer Science

publication ZBW FIV SemEval 2010# of publications 62,924 28,324 244# of annotations 5.26 (± 1.84) 12.00 (± 4.02) 5.05 (± 2.41)knowledge base STW European Thesaurus ACM CCS# of entities 6,335 7,912 2,299# of labels 11,679 8,421 9,086

• Computer Science: SemEval 2010 dataset [Kim et al. 10]– Publications are annotated with keywords originally– We converted keywords to entities by string matching

• All publications and labels of entities are in English• We use full-texts of publications• All annotations are used as ground truth• Evaluation metrics: Precision, Recall, F-measure

Page 21: A Comparison of Different Strategies for Automated Semantic Document Annotation

21Chifumi Nishioka [email protected], K-CAP 2015

(I) Best Performing Strategies• Economics and Political Science datasets

– The best strategy: Entity × HITS × kNN– F-measure: 0.39 (economics), 0.28 (political science)

• Computer Science dataset– The best strategy: Entity × Degree × kNN– F-measure: 0.33 (computer science)

• Graph-based methods do not differ a lot

In general, a document annotation strategyEntity × Graph-based method × kNN performs best

Page 22: A Comparison of Different Strategies for Automated Semantic Document Annotation

22Chifumi Nishioka [email protected], K-CAP 2015

(II) Influence of Concept Extraction

• Concept Extraction method: Entity– Use domain-specific knowledge bases– Knowledge bases: freely available and of high quality– 32 thesauri listed in W3C SKOS Datasets

For Concept Extraction methods, Entity consistently outperforms Tri-gram, RAKE, and LDA

Page 23: A Comparison of Different Strategies for Automated Semantic Document Annotation

23Chifumi Nishioka [email protected], K-CAP 2015

(III) Influence of Concept Activation

• Poor performance of hierarchy-based methods– We use full-texts in the experiments

• Full-texts contain so many different concepts (203.80 unique entities (SD: 24.50)) that others do not have to be activated

– However, OneHop can work as well as graph-based methods• It activates concept in one hop distance

For Concept Activation methods, graph-based methods are better than statistical methods or hierarchy-based methods

Page 24: A Comparison of Different Strategies for Automated Semantic Document Annotation

24Chifumi Nishioka [email protected], K-CAP 2015

(IV) Influence of Annotation Selection

• kNN– No learning process– Confirms the assumption that documents with similar

concepts share similar annotations

For Annotation Selection methods, kNN can enhance the performance

Page 25: A Comparison of Different Strategies for Automated Semantic Document Annotation

25Chifumi Nishioka [email protected], K-CAP 2015

Conclusion• Large scale experiment for automated semantic

document annotation for scientific publications• Best strategy: Entity × Graph-based method × kNN

– Novel combination of methods• Best concept extraction method: Entity• Best concept activation method: Graph-based

methods– OneHop can achieve similar performance and requires

less computation cost

Page 26: A Comparison of Different Strategies for Automated Semantic Document Annotation

26Chifumi Nishioka [email protected], K-CAP 2015

Thank you!Questions?

Page 27: A Comparison of Different Strategies for Automated Semantic Document Annotation

27Chifumi Nishioka [email protected], K-CAP 2015

Appendix

Page 28: A Comparison of Different Strategies for Automated Semantic Document Annotation

28Chifumi Nishioka [email protected], K-CAP 2015

Research Question• Research questions solved with the experiment

framework

(I) Which strategy performs best?

(II) Which concept extraction method performs best?

(III) Which concept activation method performs best?

(IV) Which annotation selection method performs best?

Page 29: A Comparison of Different Strategies for Automated Semantic Document Annotation

29Chifumi Nishioka [email protected], K-CAP 2015

LDA (Latent Dirichlet Allocation)

source: D. M. Blei. Probabilistic topic models, CACM, 2012.

Page 30: A Comparison of Different Strategies for Automated Semantic Document Annotation

30Chifumi Nishioka [email protected], K-CAP 2015

Entity Extraction and Conversion• Entity extraction

– String matching with entity labels– Starting with longer entity labels

• e.g., From a text “financial crisis is …”, only an entity “financial crisis” is detected (not “crisis”).

• Converting to entities– Words and keywords are extracted in Tri-gram and RAKE– They are converted to entities by string matching with

entity labels before annotation selection– If no matched entity label is found, word or keyword is

discarded

Page 31: A Comparison of Different Strategies for Automated Semantic Document Annotation

31Chifumi Nishioka [email protected], K-CAP 2015

kNN [1/2]• Similarity measure

– Each document is represented as a vector where each element is a score of a concept

– Cosine similarity is used as a similarity measureGDPImmigrationPopulationBankInterest rateCanada

0.30.50.80.10.00.5

GDPImmigrationPopulationBankInterest rateCanada

0.60.00.40.80.40.2

cosine similarity between and

𝑑1 𝑑2

Page 32: A Comparison of Different Strategies for Automated Semantic Document Annotation

32Chifumi Nishioka [email protected], K-CAP 2015

kNN [2/2]• k = 1

• k = 2

Central bankLawFinancial crisis

FinanceChina

Human resourceLeadership

MarketingCompetition law

??

0.49

0.45

0.42

0.60

MarketingCompetitive law

Selected annotations

Central bankLawFinancial crisis

FinanceChina

Human resourceLeadership

MarketingCompetition law

??

0.49

0.45

0.42

0.60

MarketingCompetitive lawFinanceChina

Selected annotations

Page 33: A Comparison of Different Strategies for Automated Semantic Document Annotation

33Chifumi Nishioka [email protected], K-CAP 2015

Evaluation Metrics• Precision

• Recall

• F-measure

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=|{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }|∩∨{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨ ¿¿ {𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨¿¿

¿

𝑟𝑒𝑐𝑎𝑙𝑙=|{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }|∩∨{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨ ¿¿ {𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨¿¿

¿

𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒=2∙ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙𝑟𝑒𝑐𝑎𝑙𝑙𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙

Page 34: A Comparison of Different Strategies for Automated Semantic Document Annotation

34Chifumi Nishioka [email protected], K-CAP 2015

Datasets• Economics dataset

– 11 GB• Political science dataset

– 3.8 GB

Page 35: A Comparison of Different Strategies for Automated Semantic Document Annotation

35Chifumi Nishioka [email protected], K-CAP 2015

Experiments• Preprocessing documents

– lemmatization– stop words removal

• 10-fold cross validation– split a dataset into 10 equal sized subsets– 8 subset for training data– 1 subset for testing data– 1 subset for optimizing parameter

Page 36: A Comparison of Different Strategies for Automated Semantic Document Annotation

36Chifumi Nishioka [email protected], K-CAP 2015

Result Table: Entity [1/2]Economics

top-5 top-10 kNNRecall Precisio

nF Recall Precisio

nF Recall Precisio

nF

Frequency .14 (.17) .14 (.15) .13 (.15) .22 (.20) .11 (.10) .14 (.12) .08 (.21) .08 (.21) .08 (.21)CF-IDF .19 (.19) .18 (.17) .18 (.16) .24 (.21) .12 (.10) .15 (.12) .29 (.32) .30 (.32) .29 (.31)Base Act. .10 (.14) .09 (.13) .09 (.13) .18 (.19) .09 (.09) .12 (.11) .20 (.30) .20 (.30) .20 (.29)Branch Act. .08 (.14) .08 (.12) .08 (.12) .17 (.19) .08 (.09) .11 (.11) .17 (.28) .17 (.28) .17 (.27)OneHop .12 (.16) .12 (.14) .12 (.14) .19 (.19) .09 (.09) .12 (.11) .35 (.34) .36 (.34) .35 (.33)Degree .15 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)HITS .14 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.10) .14 (.12) .40 (.32) .40 (.32) .39 (.31)PageRank .14 (.17) .14 (.15) .14 (.15) .22 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)

Political Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

F Recall Precision

F

Frequency .12 (.11) .18 (.16) .14 (.12) .15 (.13) .12 (.10) .13 (.10) .14 (.17) .05 (.07) .07 (.09)CF-IDF .05 (.07) .12 (.16) .07 (.10) .07 (.09) .08 (.10) .07 (.09) .24 (.22) .14 (.14) .17 (.16)Base Act. .05 (.08) .10 (.13) .07 (.09) .10 (.10) .10 (.09) .09 (.09) .14 (.19) .07 (.10) .09 (.12)Branch Act. .04 (.07) .08 (.12) .05 (.08) .09 (.09) .09 (.09) .08 (.09) .12 (.17) .06 (.10) .08 (.11)OneHop .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .27 (.21) .26 (.21) .25 (.19)Degree .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .29 (.21) .28 (.21) .27 (.19)HITS .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .30 (.22) .29 (.21) .28 (.20)PageRank .10 (.09) .20 (.17) .13 (.11) .13 (.10) .14 (.11) .13 (.10) .29 (.22) .29 (.21) .27 (.20)

Page 37: A Comparison of Different Strategies for Automated Semantic Document Annotation

37Chifumi Nishioka [email protected], K-CAP 2015

Result Table: Entity [2/2]Computer Science

top-5 top-10 kNNRecall Precisio

nF Recall Precisio

nF Recall Precisio

nF

Frequency .18 (.21) .14 (.15) .15 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .24 (.16) .30 (.17)CF-IDF .02 (.08) .02 (.06) .02 (.06) .03 (.11) .01 (.04) .02 (.05) .47 (.29) .23 (.17) .29 (.18)Base Act. .17 (.20) .13 (.14) .14 (.15) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .22 (.15) .29 (.17)Branch Act. .17 (.20) .12 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .50 (.28) .22 (.15) .29 (.17)OneHop .17 (.20) .13 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .42 (.30) .25 (.21) .29 (.20)Degree .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .27 (.17) .33 (.18)HITS .18 (.21) .14 (.15) .15 (.16) .21 (.22) .08 (.08) .11 (.11) .48 (.31) .27 (.18) .32 (.20)PageRank .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .50 (.29) .25 (.15) .31 (.18)

Page 38: A Comparison of Different Strategies for Automated Semantic Document Annotation

38Chifumi Nishioka [email protected], K-CAP 2015

Result Table: Tri-gramEconomics

top-5 top-10 kNNRecall Precisio

nF Recall Precisio

nF Recall Precisio

nF

Frequency .12 (.15) .12 (.14) .11 (.14) .19 (.19) .10 (.10) .13 (.12) .08 (.22) .08 (.22) .08 (.21)CF-IDF .10 (.12) .10 (.12) .09 (.11) .17 (.17) .08 (.10) .12 (.12) .07 (.20) .06 (.22) .06 (.20)Degree .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .07 (.21) .07 (.21) .07 (.20)HITS .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .08 (.22) .08 (.22) .07 (.21)PageRank .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .10 (.20) .04 (.08) .05 (.11)

Political Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

F Recall Precision

F

Frequency .06 (.08) .14 (.16) .08 (.10) .10 (.10) .11 (.11) .10 (.09) .08 (.14) .05 (.08) .06 (.09)CF-IDF .05 (.05) .06 (.07) .05 (.06) .09 (.10) .09 (.10) .08 (.09) .09 (.15) .04 (.08) .06 (.10)Degree .01 (.03) .03 (.07) .01 (.04) .01 (.03) .03 (.07) .01 (.04) .11 (.14) .03 (.05) .05 (.07)HITS .01 (.03) .02 (.06) .01 (.03) .01 (.03) .00 (.06) .01 (.03) .12 (.14) .04 (.06) .06 (.08)PageRank .01 (.04) .03 (.08) .02 (.05) .01 (.04) .03 (.08) .02 (.05) .08 (.12) .03 (.05) .04 (.06)

Computer Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

F Recall Precision

F

Frequency .26 (.24) .20 (.18) .22 (.19) .54 (.30) .20 (.13) .29 (.17) .44 (.28) .25 (.18) .30 (.19)CF-IDF .23 (.24) .18 (.18) .19 (.19) .54 (.29) .22 (.14) .30 (.17) .48 (.28) .20 (.14) .26 (.15)Degree .09 (.15) .07 (.11) .07 (.12) .13 (.19) .05 (.07) .07 (.09) .48 (.29) .23 (.16) .29 (.18)HITS .05 (.14) .04 (.09) .04 (.10) .11 (.18) .04 (.06) .06 (.09) .39 (.29) .26 (.21) .28 (.19)PageRank .02 (.06) .02 (.05) .02 (.06) .03 (.08) .01 (.03) .02 (.05) .46 (.29) .25 (.18) .30 (.18)

Page 39: A Comparison of Different Strategies for Automated Semantic Document Annotation

39Chifumi Nishioka [email protected], K-CAP 2015

Result Table: RAKEEconomics

top-5 top-10 kNNRecall Precisio

nF Recall Precisio

nF Recall Precisio

nF

Frequency .08 (.14) .08 (.12) .08 (.12) .15 (.18) .07 (.08) .10 (.11) .34 (.33) .34 (.33) .33 (.32)

Political Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

F Recall Precision

F

Frequency .04 (.07) .08 (.13) .05 (.08) .07 (.09) .08 (.09) .07 (.08) .31 (.23) .18 (.15) .22 (.17)

Computer Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

F Recall Precision

F

Frequency .24 (.24) .17 (.16) .19 (.17) .42 (.28) .15 (.10) .22 (.14) .42 (.27) .20 (.13) .25 (.15)

Page 40: A Comparison of Different Strategies for Automated Semantic Document Annotation

40Chifumi Nishioka [email protected], K-CAP 2015

Result Table: LDAEconomics

kNNRecall Precisio

nF

Frequency .19 (.30) .19 (.30) .19 (.30)

Political SciencekNN

Recall Precision

F

Frequency .15 (.19) .15 (.18) .14 (.17)

Computer SciencekNN

Recall Precision

F

Frequency .28 (.27) .24 (.23) .24 (.22)

Page 41: A Comparison of Different Strategies for Automated Semantic Document Annotation

41Chifumi Nishioka [email protected], K-CAP 2015

Materials• Codes

– https://github.com/ggb/ShortStories• Datasets

– economics and political science• not publicly available yet• contact us directly, if you are interested in

– computer science• publicly available

Page 42: A Comparison of Different Strategies for Automated Semantic Document Annotation

42Chifumi Nishioka [email protected], K-CAP 2015

Presentation• K-CAP 2015

– International Conference on Knowledge Capture– Scope

• Knowledge Acquisition / Capture• Knowledge Extraction from Text• Semantic Web• Knowledge Engineering and Modelling• …

• Time slot– Presentation: 25 minutes– Q & A: 5 minutes

Page 43: A Comparison of Different Strategies for Automated Semantic Document Annotation

43Chifumi Nishioka [email protected], K-CAP 2015

Reference• [Blei et al. 03] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation,

JMLR, 2003. • [Blei 12] D. M. Blei. Probabilistic topic models, CACM, 2012.• [Goossen et al. 11] F. Goossen, W. IJntema, F. Frasincar, F. Hogenboom, and U.

Kaymak. News personalization using the CF-IDF semantic recommender, WIMS, 2011.

• [Grosse-Bolting et al. 15] G. Grosse-Bolting, C. Nishioka, and A. Scherp. Generic process for extracting user profiles from social media using hierarchical knowledge bases, ICSC, 2015.

• [Huang et al. 11] M. Huang, A. Névéol, and Z. Lu. Recommending MeSH terms for annotating biomedical articles, JAMIA, 2011.

• [Kapanipathi et al. 14] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User interests identification on Twitter using a hierarchical knowledge base, ESWC, 2014.

• [Kim et al. 10] S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin. Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles, International Workshop on Semantic Evaluation, 2010.

Page 44: A Comparison of Different Strategies for Automated Semantic Document Annotation

44Chifumi Nishioka [email protected], K-CAP 2015

Reference• [Kleinberg 99] J. M. Kleinberg. Authoritative sources in a hyperlinked

environment, Journal of the ACM, 1999.• [Mihalcea & Paul 04] R. Mihalcea and T. Paul. TextRank: Bringing order into texts,

EMNLP, 2004.• [Page et al. 99] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank

citation ranking: bringing order to the web, TR of Stanford InfoLab, 1999.• [Rose et al. 10] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword

extraction from individual documents, Text Mining, 2010. • [Zouaq et al. 12] A. Zouaq, G. Dragan, and H. Marek. Voting theory for concept

detection, ESWC, 2012.