Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics,...

Post on 28-Mar-2015

217 views 0 download

Tags:

Transcript of Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics,...

Supporting the Research Process

The NaCTeM Text Mining Service

William BlackInformatics, Manchester

Contents

• What is Text Mining/What is NaCTeM?• Approaches/Methods• Text Mining Tasks

– IE, Argumentative Zoning, Terminology Discovery

• End-user services for researchers• NaCTeM activities with social scientists

What is Text Mining?

• Knowledge discovery from textual sources– Primary sources

• Documents, News, Web

– Scientific Literatures

• Using NLP, Ontologies, IR on a large scale

What is the Text Mining Centre? http://www.nactem.ac.uk

• Established in 2004 in response to a JISC/EPSRC/BBSRC initiative

• A Manchester and Liverpool collaboration– Formerly also UMIST, Salford – Accommodated in the Manchester Interdisciplinary

Biocentre (MIB)

• Develop a variety of national services based on the application to biological sciences, with deployment from Autumn 2006

• Initially in biological sciences, with a second focus on social science during 2006-7

Text Mining - Approaches

• Distinguished from IR by semantic analysis leading to extraction of entities, facts, events, not mere documents.

• Distinguished from the Semantic Web by use of automated analysis based on robust natural language processing.

• A wide variety of methods and analyses ranging from domain-independent to domain-specific.

Methods of Text Mining

• Pipelined processes performing increasing levels of analysis common to all approaches– Document structure analysis, tokenization,

tagging, phrasal chunking, named entity recognition/classification, fact and event extraction.

– Indexed to provide conceptual IR services

Sample text mining sub-tasks

• Named entity recognition and classification.• Terminology discovery and ontology

maintenance• Information extraction (IE) in limited domains -

for intelligence analysts and scientists• Summarization - informative, tailored,

multilingual, multi-document• Open-domain IE and QA• Association mining over databases of extracted

facts.

Illustrations of IE on successive full-page screenshots

• Named entity phrase bracketing

• Named entity extraction

• Fact extraction and slot filling

• An application to a research literature

Terminology Discovery - Ananiadou, NaCTeM

• A form of unsupervised learning, whose only required resource is a general purpose PoS tagger.

• Can be applied to text in any language, domain or genre to reveal terminology on the basis of phrasehood and distribution.

• TerMine will be among the first deployed NaCTeM tools.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Argumentative ZoningSimone Teufel, Cambridge Computing Lab

• BKG: General scientific background (yellow)• OTH: Neutral descr’s of others’ work (orange)• OWN: Neutral descr’s of own, new work (blue)• AIM: Stmts of particular aim of current paper (pink)• TXT: Stmts of textual org. of current paper (red)• CTR: Contrastive or comparative stmts incl. explicit

mention of weaknesses of other work (green)• BAS: Stmts that own work is based on other work

(purple)

Argumentative Zoning Example

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

End-user services based on full NLP and conceptual indexing

• Two conceptual IR services based on prior full-scale NLP analysis of Medline at Tsujii Lab, University of Tokyo

– InfoPubMed: A complex tool supporting a research

workflow for literature review and knowledge

discovery/hypothesis generation

– Medie: A simple IR interface as intuitive as

Google, but returning fact-bearing sentences,

which are more than document surrogates.

Gene/gene productsyou are interested in

Fields

By clicking this button,you can restrict search fields

By clicking this button,you can restrict species.

GeneBoxes

Drag this GeneBox to the Interaction Viewer

Drag this InteractionBoxto ContentViewer

Sentence Box

Property which means the co-occurrenceIn the sentence is a direct evidence of interaction

Property which means the co-occurrenceIn the sentence is a mere co-occurrence

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Possible end-user service based on AZ

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

More than Google’s PageRank™, because the links are typed.

NaCTeM and Social Science/Humanities

• In Year 3 (from Oct 2006), develop pilot service aimed at social science.

• Local links with NCESS• Preparatory invited workshop held in May,

2006.• Text-mining and Digitised C19th Research

Resources Workshop with British Library

Workshop on Text Mining in Social SciencesPresentations available at NaCTeM Web page

– Bridging qualitative and quantitative methods for social sciences using text mining techniques (Sophia Ananiadou)

– Text Mining Activities at the National Centre (Sophia Ananiadou, Jun-ich Tsujii, Paul Watry)

– Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD (Louise Corti)

– Author Identification (Katerina T. Frantzi) – Sentiment Analysis and Financial Grids (Lee Gillam) – Concordances and semi-automatic coding in qualitative analysis:

possibilities and barriers (Graham R. Gibbs) – Bridging quantitative and qualitative methods for social sciences using

text mining techniques (Tetsuya Nasukawa) – Computer-Assisted Content Analysis (Andrew Wilson)

NaCTeM status

• NaCTeM is almost at the end of its tool development phase

• Moving to deployment of services this Autumn

• Will include domain-independent terminology management from the outset

• Other applications of interest to social science researchers will be appearing approx. 1 year from now.