A Comparison of Different Strategies for Automated Semantic Document Annotation

1Chifumi Nishioka [email protected], K-CAP 2015

Gregor Große-BöltingChifumi NishiokaAnsgar Scherp

A Comparison of Different Strategies for Automated Semantic Document Annotation


Motivation [1/2]• Document annotation

– Facilitates users and search engines to find documents– Requires a huge amount of human effort– e.g., subject indexers in ZBW labeled 1.6 million scientific

documents in economics

• Semantic document annotation– Documents annotated with semantic entities– e.g., PubMed and MeSH, ACM DL and ACM CCS

Focus on semantic document annotation

Necessity of automated document annotation


Motivation [2/2]• Small scale experiments so far

– Comparing a small number of strategies– Datasets containing a few hundred documents

• Comparing of 43 strategies for document annotation within the developed experiment framework– The largest number of strategies

• Experiments with three datasets from different domains– Contain full-texts of 100,000 documents annotated by subject

indexers– The largest dataset of scientific publications

We conducted the largest scale experiment


Experiment Framework

Strategies are composed of methods from concept extraction, concept activation, and annotation selection

1. Concept Extractiondetect concepts (candidate annotations) from each document

2. Concept Activationcompute a score for each concept of a document

3. Annotation Selectionselect annotations from concepts for each document

4. Evaluationmeasure performance of strategies with ground truth


Research Question• Research questions solved with the experiment

framework

(I) Which strategy performs best?

(II) Which concept extraction method performs best?

(III) Which concept activation method performs best?

(IV) Which annotation selection method performs best?


Concept Extraction [1/2]• Entity

– Extract entities from documents using a domain-specific knowledge base

– Domain-specific knowledge base• Entities (subjects) in a specific domain (e.g., medicine)• One or more labels for each entity• Relationships between entities

– Detect entities by string matching with entity labels• Tri-gram

– Extract contigurous sequences of one, two, and three words in a document


Concept Extraction [2/2]• RAKE (Rapid Automatic Keyword

Extraction) [Rose et al. 10] – Unsupervised method for extracting keywords– Incorporate cooccurrence and frequency of words

• LDA (Latent Dirichlet Allocation) [Blei et al. 03]– Unsupervised topic modeling method for inferring latent

topics in a document corpus– Topic model

• Topic: A probability distribution over words• Document: A probability distribution over topics

– Treat a topic as a concept


Concept Activation [1/6]• Three types of concept activation

methods– Statistical Methods

• Baseline• Use only directly mentioned concepts

– Hierarchy-based Methods• Reveal concepts that are not mentioned explicitly using a

hierarchical knowledge base– Graph-based Methods

• Use only directly mentioned concepts• Represent concept

cooccurrences as a graph

Bank, Interest Rate, Financial Crisis, Bank, Central Bank, Tax, Interest Rate

Tax

Bank

Interest Rate

Financial Crisis

Central Bank


Concept Activation [2/6]• Statistical Methods

– Frequency

• depends on Concept Extraction methods– The number of appearances (Entity and Tri-gram)– The score output by RAKE (RAKE) – The probability of a topic for a document (LDA)

– CF-IDF [Goossen et al. 11]• An extension of TF-IDF replacing words with concepts• Lower scores for concepts that appear in many documents

𝑠𝑐𝑜𝑟𝑒𝑐𝑓𝑖𝑑𝑓 (𝑐 ,𝑑)=𝑐𝑓 (𝑐 ,𝑑) ∙𝑙𝑜𝑔¿𝐷∨ ¿¿ {𝑑∈𝐷 } :{𝑐∈𝑑}∨¿¿

¿

𝑠𝑐𝑜𝑟𝑒 𝑓𝑟𝑒𝑞(𝑐 ,𝑑)= 𝑓𝑟𝑒𝑞(𝑐 ,𝑑 )


Concept Activation [3/6]• Hierarchy-based Methods [1/2]

– Base Activation

• : a set of child concepts of a concept • : decay parameter, set • e.g.,

𝑠𝑐𝑜𝑟𝑒𝑏𝑎𝑠𝑒 (𝑐 ,𝑑 )= 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑 )+𝜆 ∙ ∑𝑐𝑖∈𝐶 𝑙 (𝑐)

𝑠𝑐𝑜𝑟𝑒𝑏𝑎𝑠𝑒(𝑐 𝑖 ,𝑑)

SocialRecommendation

SocialTagging

Web Searching Web Mining

SiteWrapping

Web LogAnalysis

World Wide Web

𝑐1

𝑐2

𝑐3


Concept Activation [4/6]• Hierarchy-based Methods [2/2]

– Branch Activation

• : reciprocal of the number of concepts that are located one level above a concept

– OneHop Activation

• : set of concepts in a document • Activates concepts in a maximum distance of one hop

𝑠𝑐𝑜𝑟𝑒 h𝑏𝑟𝑎𝑛𝑐 (𝑐 ,𝑑 )= 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑)+𝜆∙𝐵𝑁 ∙ ∑𝑐𝑖∈𝐶 𝑙 (𝑐)

𝑠𝑐𝑜𝑟𝑒 h𝑏𝑟𝑎𝑛𝑐 (𝑐𝑖 ,𝑑)

𝑠𝑐𝑜𝑟𝑒 h𝑜𝑛𝑒 𝑜𝑝 (𝑐 ,𝑑 )={ 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑 ) if∨𝐶𝑙(𝑐 )∩𝐶𝑑∨≥2𝑓𝑟𝑒𝑞 (𝑐 ,𝑑)+𝜆 ∙ ∑

𝑐 𝑖∈𝐶𝑙 (𝑐)

𝑓𝑟𝑒𝑞 (𝑐 𝑖 ,𝑑 )otherwise


Concept Activation [5/6]• Graph-based Methods [1/2]

– Degree [Zouaq et al. 12]

• : the number of edges linked with a concept • e.g.,

– HITS [Kleinberg 99; Zouaq et al. 12]• Link analysis algorithm for search engines [Kleinberg 99]

𝑠𝑐𝑜𝑟𝑒𝑑𝑒𝑔𝑟𝑒𝑒(𝑐 ,𝑑)=𝑑𝑒𝑔𝑟𝑒𝑒(𝑐 ,𝑑 )

𝑠𝑐𝑜𝑟𝑒h𝑖𝑡𝑠 (𝑐 ,𝑑 )=h𝑢𝑏 (𝑐 ,𝑑)+ h𝑎𝑢𝑡 (𝑐 ,𝑑 )

Tax

Bank

Interest Rate

Financial Crisis

Central Bank


Concept Activation [6/6]• Graph-based Methods [2/2]

– PageRank [Page et al. 99; Mihalcea & Paul 04]• Link analysis algorithm for search engines• Based on the intuition that a node that is linked from many

important nodes is more important

• : set of concepts connected with incoming edges from • : set of concepts connected with outgoing edges from • : dumping factor,

𝑠𝑐𝑜𝑟𝑒𝑝𝑎𝑔𝑒 (𝑐 ,𝑑 )=(1−𝜇 )+𝜇 ∙ ∑𝑐 𝑖∈𝐶𝑖𝑛 (𝑐)

𝑠𝑐𝑜𝑟𝑒𝑝𝑎𝑔𝑒(𝑐 𝑖 ,𝑑 )¿𝐶𝑜𝑢𝑡(𝑐 𝑖)∨¿

¿


Annotation Selection• Top-5 and Top-10

– Select concepts whose scores are ranked in top-k• k Nearest Neighbor (kNN) [Huang et al. 11]

– Based on the assumption that documents with similar concepts share similar annotations

1. Compute similarity scores between a target document and all documents with annotations

2. Select union of annotations of k nearest documents

Central bankLawFinancial crisis

FinanceChina

Human resourceLeadership

MarketingCompetition law

??

0.49

0.45

0.42

0.60

Example

- Selected annotationsFinance; China; Marketing; Competition Law


Configurations [1/5]

Entity Tri-gram LDARAKE

StatisticalMethods(2 methods)

Hierarchy-basedMethods(3 methods)

Graph-basedMethods(3 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation



24 strategies


StatisticalMethods(2 methods)



Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation


15 strategies


StatisticalMethod

(2 methods)



Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation



3 strategies


StatisticalMethod

(Frequency)



Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation




StatisticalMethods(Frequency)



Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation


43 strategies in total


Datasets and Metrics of ExperimentsEconomics Political Science Computer Science

publication ZBW FIV SemEval 2010# of publications 62,924 28,324 244# of annotations 5.26 (± 1.84) 12.00 (± 4.02) 5.05 (± 2.41)knowledge base STW European Thesaurus ACM CCS# of entities 6,335 7,912 2,299# of labels 11,679 8,421 9,086

• Computer Science: SemEval 2010 dataset [Kim et al. 10]– Publications are annotated with keywords originally– We converted keywords to entities by string matching

• All publications and labels of entities are in English• We use full-texts of publications• All annotations are used as ground truth• Evaluation metrics: Precision, Recall, F-measure


(I) Best Performing Strategies• Economics and Political Science datasets

– The best strategy: Entity × HITS × kNN– F-measure: 0.39 (economics), 0.28 (political science)

• Computer Science dataset– The best strategy: Entity × Degree × kNN– F-measure: 0.33 (computer science)

• Graph-based methods do not differ a lot

In general, a document annotation strategyEntity × Graph-based method × kNN performs best


(II) Influence of Concept Extraction

• Concept Extraction method: Entity– Use domain-specific knowledge bases– Knowledge bases: freely available and of high quality– 32 thesauri listed in W3C SKOS Datasets

For Concept Extraction methods, Entity consistently outperforms Tri-gram, RAKE, and LDA


(III) Influence of Concept Activation

• Poor performance of hierarchy-based methods– We use full-texts in the experiments

• Full-texts contain so many different concepts (203.80 unique entities (SD: 24.50)) that others do not have to be activated

– However, OneHop can work as well as graph-based methods• It activates concept in one hop distance

For Concept Activation methods, graph-based methods are better than statistical methods or hierarchy-based methods


(IV) Influence of Annotation Selection

• kNN– No learning process– Confirms the assumption that documents with similar

concepts share similar annotations

For Annotation Selection methods, kNN can enhance the performance


Conclusion• Large scale experiment for automated semantic

document annotation for scientific publications• Best strategy: Entity × Graph-based method × kNN

– Novel combination of methods• Best concept extraction method: Entity• Best concept activation method: Graph-based

methods– OneHop can achieve similar performance and requires

less computation cost


Thank you!Questions?


Appendix


Research Question• Research questions solved with the experiment

framework

(I) Which strategy performs best?

(II) Which concept extraction method performs best?

(III) Which concept activation method performs best?

(IV) Which annotation selection method performs best?


LDA (Latent Dirichlet Allocation)

source: D. M. Blei. Probabilistic topic models, CACM, 2012.


Entity Extraction and Conversion• Entity extraction

– String matching with entity labels– Starting with longer entity labels

• e.g., From a text “financial crisis is …”, only an entity “financial crisis” is detected (not “crisis”).

• Converting to entities– Words and keywords are extracted in Tri-gram and RAKE– They are converted to entities by string matching with

entity labels before annotation selection– If no matched entity label is found, word or keyword is

discarded


kNN [1/2]• Similarity measure

– Each document is represented as a vector where each element is a score of a concept

– Cosine similarity is used as a similarity measureGDPImmigrationPopulationBankInterest rateCanada

0.30.50.80.10.00.5

GDPImmigrationPopulationBankInterest rateCanada

0.60.00.40.80.40.2

cosine similarity between and

𝑑1 𝑑2


kNN [2/2]• k = 1

• k = 2


FinanceChina



??

0.49

0.45

0.42

0.60

MarketingCompetitive law

Selected annotations


FinanceChina



??

0.49

0.45

0.42

0.60

MarketingCompetitive lawFinanceChina

Selected annotations


Evaluation Metrics• Precision

• Recall

• F-measure

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=|{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }|∩∨{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨ ¿¿ {𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨¿¿

¿

𝑟𝑒𝑐𝑎𝑙𝑙=|{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }|∩∨{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨ ¿¿ {𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨¿¿

¿

𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒=2∙ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙𝑟𝑒𝑐𝑎𝑙𝑙𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙


Datasets• Economics dataset

– 11 GB• Political science dataset

– 3.8 GB


Experiments• Preprocessing documents

– lemmatization– stop words removal

• 10-fold cross validation– split a dataset into 10 equal sized subsets– 8 subset for training data– 1 subset for testing data– 1 subset for optimizing parameter


Result Table: Entity [1/2]Economics

top-5 top-10 kNNRecall Precisio

nF Recall Precisio

nF Recall Precisio

nF

Frequency .14 (.17) .14 (.15) .13 (.15) .22 (.20) .11 (.10) .14 (.12) .08 (.21) .08 (.21) .08 (.21)CF-IDF .19 (.19) .18 (.17) .18 (.16) .24 (.21) .12 (.10) .15 (.12) .29 (.32) .30 (.32) .29 (.31)Base Act. .10 (.14) .09 (.13) .09 (.13) .18 (.19) .09 (.09) .12 (.11) .20 (.30) .20 (.30) .20 (.29)Branch Act. .08 (.14) .08 (.12) .08 (.12) .17 (.19) .08 (.09) .11 (.11) .17 (.28) .17 (.28) .17 (.27)OneHop .12 (.16) .12 (.14) .12 (.14) .19 (.19) .09 (.09) .12 (.11) .35 (.34) .36 (.34) .35 (.33)Degree .15 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)HITS .14 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.10) .14 (.12) .40 (.32) .40 (.32) .39 (.31)PageRank .14 (.17) .14 (.15) .14 (.15) .22 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)

Political Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

F Recall Precision

F



Result Table: Entity [2/2]Computer Science


nF Recall Precisio

nF Recall Precisio

nF



Result Table: Tri-gramEconomics


nF Recall Precisio

nF Recall Precisio

nF

Frequency .12 (.15) .12 (.14) .11 (.14) .19 (.19) .10 (.10) .13 (.12) .08 (.22) .08 (.22) .08 (.21)CF-IDF .10 (.12) .10 (.12) .09 (.11) .17 (.17) .08 (.10) .12 (.12) .07 (.20) .06 (.22) .06 (.20)Degree .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .07 (.21) .07 (.21) .07 (.20)HITS .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .08 (.22) .08 (.22) .07 (.21)PageRank .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .10 (.20) .04 (.08) .05 (.11)


Recall Precision

F Recall Precision

F Recall Precision

F


Computer Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

F Recall Precision

F



Result Table: RAKEEconomics


nF Recall Precisio

nF Recall Precisio

nF

Frequency .08 (.14) .08 (.12) .08 (.12) .15 (.18) .07 (.08) .10 (.11) .34 (.33) .34 (.33) .33 (.32)


Recall Precision

F Recall Precision

F Recall Precision

F

Frequency .04 (.07) .08 (.13) .05 (.08) .07 (.09) .08 (.09) .07 (.08) .31 (.23) .18 (.15) .22 (.17)

Computer Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

F Recall Precision

F

Frequency .24 (.24) .17 (.16) .19 (.17) .42 (.28) .15 (.10) .22 (.14) .42 (.27) .20 (.13) .25 (.15)


Result Table: LDAEconomics

kNNRecall Precisio

nF

Frequency .19 (.30) .19 (.30) .19 (.30)

Political SciencekNN

Recall Precision

F

Frequency .15 (.19) .15 (.18) .14 (.17)

Computer SciencekNN

Recall Precision

F

Frequency .28 (.27) .24 (.23) .24 (.22)


Materials• Codes

– https://github.com/ggb/ShortStories• Datasets

– economics and political science• not publicly available yet• contact us directly, if you are interested in

– computer science• publicly available

https://github.com/ggb/ShortStories

https://github.com/ggb/ShortStories


Presentation• K-CAP 2015

– International Conference on Knowledge Capture– Scope

• Knowledge Acquisition / Capture• Knowledge Extraction from Text• Semantic Web• Knowledge Engineering and Modelling• …

• Time slot– Presentation: 25 minutes– Q & A: 5 minutes


Reference• [Blei et al. 03] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation,

JMLR, 2003. • [Blei 12] D. M. Blei. Probabilistic topic models, CACM, 2012.• [Goossen et al. 11] F. Goossen, W. IJntema, F. Frasincar, F. Hogenboom, and U.

Kaymak. News personalization using the CF-IDF semantic recommender, WIMS, 2011.

• [Grosse-Bolting et al. 15] G. Grosse-Bolting, C. Nishioka, and A. Scherp. Generic process for extracting user profiles from social media using hierarchical knowledge bases, ICSC, 2015.

• [Huang et al. 11] M. Huang, A. Névéol, and Z. Lu. Recommending MeSH terms for annotating biomedical articles, JAMIA, 2011.

• [Kapanipathi et al. 14] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User interests identification on Twitter using a hierarchical knowledge base, ESWC, 2014.

• [Kim et al. 10] S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin. Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles, International Workshop on Semantic Evaluation, 2010.


Reference• [Kleinberg 99] J. M. Kleinberg. Authoritative sources in a hyperlinked

environment, Journal of the ACM, 1999.• [Mihalcea & Paul 04] R. Mihalcea and T. Paul. TextRank: Bringing order into texts,

EMNLP, 2004.• [Page et al. 99] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank

citation ranking: bringing order to the web, TR of Stanford InfoLab, 1999.• [Rose et al. 10] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword

extraction from individual documents, Text Mining, 2010. • [Zouaq et al. 12] A. Zouaq, G. Dragan, and H. Marek. Voting theory for concept

detection, ESWC, 2012.

A Comparison of Different Strategies for Automated Semantic Document Annotation

Internet

Transcript of A Comparison of Different Strategies for Automated Semantic Document Annotation