Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics,...

28
Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester

Transcript of Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics,...

Page 1: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Supporting the Research Process

The NaCTeM Text Mining Service

William BlackInformatics, Manchester

Page 2: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Contents

• What is Text Mining/What is NaCTeM?• Approaches/Methods• Text Mining Tasks

– IE, Argumentative Zoning, Terminology Discovery

• End-user services for researchers• NaCTeM activities with social scientists

Page 3: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

What is Text Mining?

• Knowledge discovery from textual sources– Primary sources

• Documents, News, Web

– Scientific Literatures

• Using NLP, Ontologies, IR on a large scale

Page 4: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

What is the Text Mining Centre? http://www.nactem.ac.uk

• Established in 2004 in response to a JISC/EPSRC/BBSRC initiative

• A Manchester and Liverpool collaboration– Formerly also UMIST, Salford – Accommodated in the Manchester Interdisciplinary

Biocentre (MIB)

• Develop a variety of national services based on the application to biological sciences, with deployment from Autumn 2006

• Initially in biological sciences, with a second focus on social science during 2006-7

Page 5: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Text Mining - Approaches

• Distinguished from IR by semantic analysis leading to extraction of entities, facts, events, not mere documents.

• Distinguished from the Semantic Web by use of automated analysis based on robust natural language processing.

• A wide variety of methods and analyses ranging from domain-independent to domain-specific.

Page 6: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Methods of Text Mining

• Pipelined processes performing increasing levels of analysis common to all approaches– Document structure analysis, tokenization,

tagging, phrasal chunking, named entity recognition/classification, fact and event extraction.

– Indexed to provide conceptual IR services

Page 7: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Sample text mining sub-tasks

• Named entity recognition and classification.• Terminology discovery and ontology

maintenance• Information extraction (IE) in limited domains -

for intelligence analysts and scientists• Summarization - informative, tailored,

multilingual, multi-document• Open-domain IE and QA• Association mining over databases of extracted

facts.

Page 8: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Illustrations of IE on successive full-page screenshots

• Named entity phrase bracketing

• Named entity extraction

• Fact extraction and slot filling

• An application to a research literature

Page 9: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.
Page 10: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.
Page 11: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.
Page 12: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.
Page 13: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Terminology Discovery - Ananiadou, NaCTeM

• A form of unsupervised learning, whose only required resource is a general purpose PoS tagger.

• Can be applied to text in any language, domain or genre to reveal terminology on the basis of phrasehood and distribution.

• TerMine will be among the first deployed NaCTeM tools.

Page 14: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 15: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Argumentative ZoningSimone Teufel, Cambridge Computing Lab

• BKG: General scientific background (yellow)• OTH: Neutral descr’s of others’ work (orange)• OWN: Neutral descr’s of own, new work (blue)• AIM: Stmts of particular aim of current paper (pink)• TXT: Stmts of textual org. of current paper (red)• CTR: Contrastive or comparative stmts incl. explicit

mention of weaknesses of other work (green)• BAS: Stmts that own work is based on other work

(purple)

Page 16: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Argumentative Zoning Example

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 17: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

End-user services based on full NLP and conceptual indexing

• Two conceptual IR services based on prior full-scale NLP analysis of Medline at Tsujii Lab, University of Tokyo

– InfoPubMed: A complex tool supporting a research

workflow for literature review and knowledge

discovery/hypothesis generation

– Medie: A simple IR interface as intuitive as

Google, but returning fact-bearing sentences,

which are more than document surrogates.

Page 18: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Gene/gene productsyou are interested in

Page 19: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Fields

By clicking this button,you can restrict search fields

By clicking this button,you can restrict species.

GeneBoxes

Page 20: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Drag this GeneBox to the Interaction Viewer

Page 21: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.
Page 22: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Drag this InteractionBoxto ContentViewer

Page 23: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Sentence Box

Property which means the co-occurrenceIn the sentence is a direct evidence of interaction

Property which means the co-occurrenceIn the sentence is a mere co-occurrence

Page 24: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 25: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Possible end-user service based on AZ

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

More than Google’s PageRank™, because the links are typed.

Page 26: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

NaCTeM and Social Science/Humanities

• In Year 3 (from Oct 2006), develop pilot service aimed at social science.

• Local links with NCESS• Preparatory invited workshop held in May,

2006.• Text-mining and Digitised C19th Research

Resources Workshop with British Library

Page 27: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

Workshop on Text Mining in Social SciencesPresentations available at NaCTeM Web page

– Bridging qualitative and quantitative methods for social sciences using text mining techniques (Sophia Ananiadou)

– Text Mining Activities at the National Centre (Sophia Ananiadou, Jun-ich Tsujii, Paul Watry)

– Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD (Louise Corti)

– Author Identification (Katerina T. Frantzi) – Sentiment Analysis and Financial Grids (Lee Gillam) – Concordances and semi-automatic coding in qualitative analysis:

possibilities and barriers (Graham R. Gibbs) – Bridging quantitative and qualitative methods for social sciences using

text mining techniques (Tetsuya Nasukawa) – Computer-Assisted Content Analysis (Andrew Wilson)

Page 28: Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.

NaCTeM status

• NaCTeM is almost at the end of its tool development phase

• Moving to deployment of services this Autumn

• Will include domain-independent terminology management from the outset

• Other applications of interest to social science researchers will be appearing approx. 1 year from now.