Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for...

40
Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester [email protected]

Transcript of Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for...

Page 1: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Are you ready for the golden age of text mining?

John McNaughtDeputy Director, National Centre for Text Mining

University of [email protected]

Page 2: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 2

Overview

• Text mining in a nutshell• Enriching content, enhancing search, enabling

discovery, reducing costs• Interoperability and evaluation• The C change

McNaught

Page 3: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 3

How do we (humans) discover?

• Find, read, learn, analyse a lot• Ask “What if…?”• Construct hypotheses, test them

– Explore many avenues, associations• Work collaboratively• Share results and data with others

– Reproducibility validation• Integrate heterogeneous data/information/knowledge • (vs. Serendipity: by lucky accident)

McNaught

Page 4: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 4

Barriers to discovery

• Find: document oriented, too many hits• Read: too much to read, even if we find relevant hits• Learn: too fast growth to keep up, to know most things• Analyse: duplication of efforts, many new results to

document• Construct hypotheses: hard, can’t tell which are most

promising, or if have missed any• Share: primary vehicles are documents and curated

databases (massive curation backlog)• Integrate: document often the key, hard to link in to

different worlds of data, information, knowledgeMcNaught

Page 5: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 5

How does TM aid discovery?• Find: more precise, relevant information, within and across

documents• Read: much faster than human• Learn: extracts, packages, links, synthesises, summarises, reduces

burden• Analyse: recognises duplication; clusters, classifies, drives semantic

author aids• Construct hypotheses: rapidly finds and ranks unknown associations

for testing• Share: reduces curation effort, complements and validates data bases• Integrate: links documents deeply into worlds of data, information

and knowledge

McNaught

Page 6: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Text mining in a nutshell

Otherdata

ApplicationsSemantic searchData mining

McNaught

Page 7: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

McNaught

Words

Terms

Entities

Relations

Events

Wordform co-occurrence, pattern matching, …

Term recognition and normalisation

Named entity recognition

Relation extraction

Event extraction

Associations

Metaknowledgeextraction

Dat

a m

inin

g, C

lust

erin

g

What is known aboutthis disease, protein, person?

What is linked with X?

{Who, what} Xed {whom, what} where, when and how?

What if…?

Keywordsearch

Is X possible, certain, probable, suggested, past, to come?

What is this paper about?

Increased sophistication? Increased customisation!

Page 8: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

A complex space

LanguagesEnglish French GermanSpanishPortugueseItalianPolish….ChineseHinduArabicUrduJapaneseKorean….

TasksTranslationInformation extractionSemantic searchQuestion answeringSentiment analysisSummarizationKnowledge discoveryDatabase curationSystematic reviewingPathway reconstruction….

Domains Finance/BusinessHealthBiologySocial SciencesHumanities…

Text Types Scientific articles(Full papers/abstracts)Social mediaPatentsClinical records, EMRBooks, theses, reportsNewswire…

TechnologyTokenizersSentence SplittersParagraph SplittersNP ChunkersSyntactic parsersSemantic parsersNE recognizersRelation extractorsEvent extractors…

Diversity of Languages and Language Resourcesincluding temporal diversityDiversity of Contexts

Diversity of Applications8

Resources(mono- and multilingual)GazetteersAnnotated corporaLexiconsTerminologiesWordnetsThesauriOntologiesGrammars…

Page 9: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Europe’s Languages and Language Technology support

McNaught 9

DutchFrenchGermanItalianSpanish

CatalanCzechFinnish

HungarianPolish

Portuguese

Swedish

BasqueBulgarianDanishGalicianGreek

Norwegian

RomanianSlovakSlovene

CroatianEstonianIcelandic

IrishLatvian

LithuanianMalteseSerbian

English

good support through Language

Technology

weak orno support

(no ‘excellent’ support)

http://www.meta-net.eu

Page 10: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 10

Enhancing historical collections

• If you have a domain collection going back centuries– How easy is it for users to find answers to research

questions?• Language evolves, terms come and go,

concepts drift, …• TM can enhance collections in many ways

– Handling temporal aspects of language is key– Enabling event-based semantic search

McNaught

Page 11: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 11

Looking into the past

• Semantic search for historians of medicine– Treatment and prevention of diseases over time– Medical and public health perspectives

• British Medical Journal archive (from 1840)– Around 350K articles

• London Medical Officer of Health reports (1848-1972) (Wellcome Library)– Around 5,000 reports from different boroughs

McNaught

Page 12: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

In historical collections, same concept expressed by different terms across different time periods

Users miss information due to unfamiliar terminology

TM to extract/link diachronic synonyms, organize in thesaurus

Use diachronic thesaurus for time-sensitive search

Page 13: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Traditional searchUser searches for

”pulmonary tuberculosis” but doesn’t know historical synonym

“pulmonary phthisis”User expands query

Narrow down results according to faceted search(facets derived both from

document metadata and from text mining)

System automatically suggests related terms

Distribution of “pulmonary tuberculosis” and “pulmonary

phthisis” across time

(A mock-up for user feedback)

Page 14: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Analysing events of interest to historians

Type Description Participants

Affect An entity or event is affected, infected, changed or transformed, possibly by another entity or event

Cause: of the affectionTarget: Entity or event affectedSubject: Medical subject affected

Cause An entity or event results in manifestation of another entity or event

Cause: of the eventResult: Resulting entity or eventSubject: Medical subject affected

Page 15: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 15

Classic case of working together• End user (typically) not a text miner• Text miner (typically) not a domain expert• Requirements and evaluation: challenge for both• Need to work together to understand

– How TM can help, what it can and cannot do– What questions are of interest– What role human has– What outcomes are desirable– What existing resources can be exploited

McNaught

Page 16: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

http://miningbiodiversity.org

Page 17: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Mining Biodiversity

AimTransform Biodiversity Heritage Library into a next-generation social digital library130,000 volumes of digitised legacy literature

A multi-disciplinary approach 1. Text Mining2. Machine learning3. Data visualisation4. History of Science5. Environmental History & Studies6. Library and Information Science7. Social Media

Mining Biodiversity

Page 18: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Mining BiodiversitySemantic metadata

extraction to support search

Observation

Habitation

Nutrition

Page 19: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 19

Finding evidence• Event extraction can drive semantic search as

we’ve seen. We can go a step further… • Example: application for Europe PubMed Central• Deeply analyse documents• Index relationships• Key off search term, to dynamically generate

from indexed relationships questions that have known answers– Not auto-completion

McNaught

Page 20: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

EvidenceFinder: a new way to discover

83,717,24 Sentences about genes, proteins, diseases & metabolites2,550,328 Documents

How can you tell if an article is relevant to you in your listed search results? Are there hidden gems in the full-text literature that you might be missing?Are there smarter ways to browse the biomedical literature?

Europe PMC’s EvidenceFinder enriches your literature exploration by suggesting questions alongside your search results, providing a way to find informationburied in full text articles that is directly relevant to you. This helps you identify articles and research that you might have overlooked throughdirect key word searching.

http://europepmc.org/

Page 21: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

McNaught

Page 22: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.
Page 23: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Finding unknown associations

• Need massive amounts of text to find unknown associations, generate hypotheses

• Must go across collections: silos irrelevant to researcher

• Must go across disciplines: cognate and distant – all can shed light

• Information often available in literature many years before, but unsuspected as not explicitly written down

Page 24: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Reproducing a finding - reported (11/2011) in Nature Medicine - with FACTA+, using MEDLINE prior to date

Info=degree of surprise

http://www.nactem.ac.uk/facta-visualizer/

SGK1 gene, enzyme and symptom: high level of enzyme = infertilelow level = miscarriage

Page 25: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 25

Building models

• In many domains, build models to understand relationships and processes

• Rely on literature to provide evidence• Slow, laborious work• Example: reconstruction of biological

pathways

McNaught

Page 26: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Nodes : 652

Links: 444

600 papers were read to

construct the pathway:

“inevitable gaps” due to manual methods

Oda & Kitano (2006) in Mol Syst Biol

Page 27: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

www.nactem.ac.uk 27

Mapping reactions and text: PathText

Link to text mining results(green icon)

Page 28: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Building models based on textual evidence

1. The mitotic arrest-deficient protein Mad1 forms a complex with Mad2, which is required for imposing mitotic arrest on cells in which the spindle assembly is perturbed. PMID: 18981471

2. Mad1, an upstream regulator of Mad2, forms a tight core complex with Mad2 and facilitates Mad2 binding to Cdc20. PMID: 18318601

28

2013

Page 29: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 29

Systematic reviews, etc.• Systematic reviews, evidence-based public health reviews

– Balanced reviews to aid policy, guideline, best practice development

• Trade-offs: cost, time available, number of hits to screen/retain, number of full texts to read– May miss relevant items

• EBPH reviews: complex questions, exploration of scope required

• Even basic TM can save 75% of manual effort (EPPI-Centre, IoE)

• Use of TM to identify, rank, cluster most relevant items• NaCTeM & Univ Liverpool currently working with NICE on

supporting EBPH reviewersMcNaught

Page 30: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

30

Interoperability and evaluation

• TM involves many processes and resources• May be no need to customise, just to select from

repositories of available tools and resources• But tools and resources often incompatible at

linguistic/semantic levels• Difficult to mix and match, to find best

combination for task at hand• Hence drive towards interoperability to enable

users to get best out of TM

McNaught London Info International

Page 31: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 31

A tool can show different results when trained onone corpus and tested on another, compared totraining and testing on same corpus

McNaught

Training data

Test data

AIMed GENETAG GENIA GGP PennBioIE PIR

AIMed 89.5 38.5 63.3 40.8 54.7

GENETAG 58.4 75.2 43.1 31.3 56.0

GENIA GGP 66.3 31.0 90.7 34.1 42.6

PennBioIE 65.9 41.2 55.4 84.1 54.0

PIR 54.3 42.0 49.0 37.0 83.6

Importance of evaluating tools

Page 32: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Text mining workflows:Rapid TM development, interoperability, common data representation, sharable type system, evaluation

IBM Journal of Research and Development (2011)

U-Compare: a modular NLP workflow construction and evaluation system.

Kano, Y., Miwa, M., Cohen, K. B., Hunter, L., Ananiadou, S. and Tsujii, J.

Database: The Journal of Biological Databases and Curation (2012)

Argo: an integrative, interactive, text mining-based workbench supporting curation.

Rak, R., Rowley, A., Black, W.J. and Ananiadou, S

Page 33: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

POS taggerB

SentenceSplitter B

library

POS taggerA

Sentence Splitter A

NER

Sentence Splitter ASentence Splitter ASentence Splitter A

SentenceSplitter BSentenceSplitter BSentenceSplitter B

POS taggerA

POS taggerA

POS taggerA

POS taggerB

POS taggerB

POS taggerB

NERNERNER

Workflow A Workflow B Workflow C

F-Score A F-Score B F-Score C

U-Compare: Evaluate and Compare TM Workflows

UIMA SSOpenNLP

SSGENIA SS

UIMA TokenizerOpenNLP Tokenizer

GENIA Tagger as Tokenizer

GENIA TaggerStepp Tagger

OpenNLP Tagger

ABNERMedT-NER

GENIA Tagger as

NER

Page 34: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

34

• Web-based application• Interactive creation of

workflows • Cloud and high-

performance computing

• Integrated TM/NLP processing system• GUI for workflow creation• Library of ready-to-use processing components• Statistics, visualizations, developer APIs• Supports UIMA and sharable type system• http://argo.nactem.ac.uk

Page 35: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Open AIRE-COAR Conference 35

Workflow Editor

Page 36: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

Evaluation of Chemical NER workflowsSupplies gold

standard corpus

Removes gold annotations so that they can be created

automatically

Combinations of syntactic and semantic components create

annotations

Compares and reports precision, recall and F1 of the different branches against the gold standard corpus

Page 37: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 37

The C change in TM in the UK

• 1/7/2014: Copyright exception for text and data mining for non-commercial purposes

• 1/10/2014: Copyright exception for quotation• If have lawful access to any text, you can now

– Copy it for non-commercial text mining purposes– Display/communicate results (e.g., annotations, associations) of

TM to others– Illustrate results with snippets from text (quotations)

• None of this can be overridden by contract (licence, Ts&Cs)• https://www.gov.uk/government/uploads/system/

uploads/attachment_data/file/375954/Research.pdf

McNaught

Page 38: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 38

Current state in the EU

• Copyright and licensing in relation to TM is a hot topic

• “The right to read is the right to mine” (Open Knowledge Foundation)

• Hope on the horizon:– EC President Jean-Claude Juncker to take steps

within his first 6 months to modernise copyright rules “in light of digital revolution and changed consumer behaviour”

McNaught

Page 39: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 39

Take home messages

• Text mining can be applied in any domain and for many tasks

• In text mining, no one size fits all– Text miners and users must work closely together

• Content (at least in UK) can be mined on a massive scale for non-commercial purposes– but even a modest collection can benefit from text

mining• Who is your text mining champion?McNaught

Page 40: Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk.

London Info International 40

Contact and Acknowledgements

• www.nactem.ac.uk• Funders and sponsors: MRC, AHRC, JISC,

BBSRC, ESRC, NIH, DARPA, Europe PubMed Central funders (Wellcome Trust + 25 funders), NHS, European Commission

• Previous funding from: AstraZeneca, Pfizer, Elsevier, Nature Publishing Group, BBC

McNaught