Download - Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Transcript
Page 1: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Data mining dissertationsAdventures and Experiences in the World of Chemistry

Antony WilliamsCLIR/DLF Postdoctoral Fellowship Summer Seminar,

July 2014

Page 2: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

What a small world…

Page 3: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Who’s got an ORCID?• Who has heard of/involved with AltMetrics?• Who has edited a Wikipedia page?• Who has direct experience of text mining?

• All slides already on Slideshare here:• www.slideshare.net/AntonyWilliams

Before we start….

Page 4: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Context – why do we want to mine data?

• Our experiences in extracting theses:– Text and data mining– Chemistry as an example– Before you start– Resources and tools

Contents

Page 5: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Let’s map together all historical chemistry data and build systems to integrate

• Heck, let’s integrate chemistry and biology data and add in disease data too

• Let’s model the data and see if we can extract new relationships – quantitative and qualitative

• Let’s make it all available on the web

Taking on a big challenge…

Page 6: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
Page 7: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• We’re going to map the world• We’re going to take photos of as many

places as we can and link them together• We’ll let people annotate and curate the map• Then let’s make it available free on the web• We’ll make it available for decision making • Put it on Mobile Devices, Give it Away

What about this….

Page 8: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

I’m from here…on Google

Page 9: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Wikipedia

Page 10: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Wikipedia

Page 11: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
Page 12: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
Page 13: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
Page 14: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

The Power of Contribution

Page 15: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

How do you spell Afonwen?

Page 16: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

And there’s Denbigh…

Page 17: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• So the world can be mapped…• We can enter a 3D world within the map• We can add annotations• We can use the data, reference it, we can

extract it, we can make decisions with it• And we can do it on our lap, in our hands• Let’s do this for chemistry…

Whoa…

Page 18: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Once upon a time we built a database….

In a basement not far away…

Page 19: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

ChemSpider

Page 20: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

ChemSpider and Data Validation

Page 21: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Dictionary Linking

Page 22: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Dictionary Linking

Page 23: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• This is not new, you known the story…• So much data of value contained within a

publication and delivered in a PDF form• “PDF files, and especially unclear licensing,

don’t allow me at the data so I can rework, reuse, repurpose, text mine etc.”

• “I specialize in XXXX. I want a database of YYYY extracted from publications and made available, for free, with capabilities I need, and the publishers should just do it”

Data in a Scientific Publication

Page 24: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

It is so difficult to navigate…

What’s the structure?

Are they in our file?

What’s similar?

What’s the target?Pharmacology

data?

Known Pathways?

Working On Now?Connections

to disease?

Expressed in right cell type?

Competitors?

IP?

Page 25: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Manage “all” of the chemistry data associated with chemical substances

• Data to be downloadable, reusable, interactive• Build a platform that enables the scientist

• Data storage, validation, standardization and curation

• Collaborative data sharing

• Provide data platform that can enable and enhance publishing of scientific papers

We set a vision…

Page 26: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Every compound from every article at RSC is extracted, in a database, and linked

• Chemical properties are extracted, databased and used for predictive models

• Data tables are downloadable, interactive and not just “dumb-PDFs”

• …and what can we extract from chemistry theses too?

XXX Years from Now at RSC

Page 27: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• We are seen as one of the repositories for published AND unpublished research data

• An intuitive platform for research data management in the cloud

• Individual, collaborative and public data management of diverse data in the cloud

• …and where all data referenced in a thesis is available at a button click

XXX Years from Now at RSC

Page 28: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• But how does it map onto your domain??

So this is chemistry…

Page 29: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Mining as an allegory

Page 30: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• You have a mountain of stuff which contains valuable nuggets

• You (more or less) know what you’re looking for

• You know what you’re going to do with it once you have it

Mining as an allegory - intent

Page 31: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• You get lots of stuff out

• It requires sifting and grading

• It’s a triumph if you manage to extract 80-90% of what is there

• You will go back to the heap and redo it

Mining as an allegory - result

Page 32: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• That which is easy to get out - is well known and unlikely to be novel

• The novel and interesting stuff is likely to be rare and not easily defined

Mining as an allegory - effort

Page 33: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Do the initial investigations by hand

• Send in the machines later

• Still needs some humans tweaking

Mining as allegory - automation

Page 34: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Context

Page 35: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• From Utopia Documents team• Good at extracting structure from typeset pdfs• http://pdfx.cs.man.ac.uk/

PDFX

Page 36: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

OCR recognition

• Underlining doesn’t help OCR

• In this case it was the only signpost to the department, supervisor and funding details

Page 37: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Hardcopy• Scanned and OCR’d PDF• PDF derived from Word• Word or LaTeX

• …and for OCR not all are borne equal• …and of course history and language is a

major influence. “Oil of vitriol”

Building blocks to mine…

Page 38: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Ontologies, taxonomies, dictionaries• But these are very domain focussed…

• As an example, Open PHACTS spend a lot of effort mapping biology to chemistry to disease over many data sources

More building blocks

Page 39: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
Page 40: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Provide a controlled vocabulary – what

your data describes, where it came from

• Provide a shared vocabulary for

integrating with other people’s data

What can ontologies do for me?

Page 41: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Questions to ask:

(1) Has someone already produced an ontology

covering your area? (Places to look: Bioportal,

OBO Foundry.)

(2) Do they take requests?

(3) Are they responsive?

(4) Is the ontology kept up to date?

Early days for ontologies and any ontology will almost

certainly be a long way from complete!

Best practices: experiences from biomedical ontologies

Page 42: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Best that these don’t change • Best that everyone calls them the same things• Best that they are unambiguous

• Meanwhile, back in the real world

What things are you looking for?

Page 43: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Place names – somewhat ambiguous• Species names – can change with time• Diseases – every pharmaceutical company

has a different list

• People – can be very ambiguous: Authors and researchers are hard to map…except for Google it seems!

How easy?

Page 46: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Thankfully people follow…

Page 48: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

ORCID take up???

Page 49: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• All publications easily connected but also– Important in early scientific career –

consider every data point contribution, every “research object”

– Every article– Every presentation– Thesis and dissertation– Provenance….and feeding AltMetrics

So the benefits of ORCIDs?

Page 50: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

The Alt-Metrics Manifesto

Page 51: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
Page 52: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

AltMetrics via Plum Analytics

Page 53: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Usage, Citations, Social Media

Page 54: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Detailed Usage Statistics

Page 55: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Indexed and Searchable

Page 56: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

ORCIDS for reputation…

Page 57: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Tinman - mutant fly embryos lack a heart.

Van Gogh - hair-like bristles on wings have a swirling pattern.

INDY - acronym for I'm Not Dead Yet, they live twice as long as normal; from the scene in the movie "Monty Python and the Holy Grail"

Ken and Barbie - males and females lack external genitalia.

Tribbles - some cells divide uncontrollably

Cheap date - flies are extra-sensitive to alcohol.

Cleopatra - flies die when Cleopatra gene interacts with another gene, Asp.

Kojak - no hairs on wings.

Maggie - fly development is arrested; named after Maggie Simpson, who's development also seems to be arrested.

Oh my..Fruitfly gene names

http://stlists.blogspot.co.uk/2005/05/fruitfly-gene-names.html

Page 58: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
Page 59: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• those that belong to the Emperor, • embalmed ones, • those that are trained, • suckling pigs, • mermaids, • fabulous ones, • stray dogs, • those included in the present classification, • those that tremble as if they were mad, • innumerable ones, • those drawn with a very fine camelhair brush, • others, • those that have just broken a flower vase, • those that from a long way off look like flies.

Allegedly from “Celestial Emporium of Benevolent Knowledge”The Analytical Language of John Wilkins, Jorge Luis Borges

Animal classification

Page 60: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Are you just identifying entities?• Are you looking for sentiment?• In chemistry names will lead you to a

recipe for synthesis, and analytical data about that compound

Classification after “things”

Page 61: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Used to aid discovery - directly• Used to aid discovery - indirectly• Extract data in electronic form for reuse• Needs to be use case driven – why, then

what/how comes later

End result

Page 62: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Automation can give good results• Especially looked at in bulk• Less easy to judge at the article level

• People accept discovery is fuzzy• Not so with data points• (but maybe can screen out)

Quality

Page 63: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Chemical names are both difficult and rewarding.

• Difficult in the sense that they can break standard software.

• Rewarding in the sense that you can extract useful information about the molecule they’re referring to without a dictionary.

• Some examples…

Chemistry-specific challenges and opportunities

Page 64: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
Page 65: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
Page 66: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
Page 67: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
Page 68: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
Page 69: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• …and it gets worse

Page 70: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

A series of mono and di-N-2,3-epoxypropyl N-phenylhydrazones have been prepared on a large scale by reaction of the corresponding N-phenylhydrazones of 9-ethyl-3-carbazolecarbaldehyde, 9-ethyl-3,6-carbazoledicarbaldehyde, 4-dimethyl-amino-, 4-diethylamino-, 4-benzylethylamino-, 4-(diphenylamino)-, 4-(4,4-4′-dimethyl-diphenylamino)-, 4-(4-formyldiphenylamino)- and 4-(4-formyl-4′-methyldiphenyl-amino)benzaldehyde with epichlorohydrin in the presence of KOH and anhydrous Na(2)SO(4).

From Molecules, via the BioNLP list

Annotate this...

Page 71: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

How many explicit compounds?

• How many numbered compounds actually are named in a given paper?

• iloprost (1)• tributyl-1-hexynylstannane (2)• the desired 2-heptyne (3)• methyl–Pd(II) iodide 4 or 4′• alkynylstannane 5• the hypervalent stannate 6• (alkynyl)(methyl)Pd(II) complex 7• the desired methylalkyne 8• compounds 9–14

• the stannyl precursors 15 and 16• methylated compounds 17 and 18• stannyl precursor 19• iloprost methyl ester 20

• “iloprost methyl ester” is the real name, but you need to know that iloprost is a monocarboxylic acid!

Page 72: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
Page 73: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Names from structures• Systematic names can be generated FROM

chemical structures algorithmically

Page 74: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

General-purpose parsers do NOT get chemical names

Visualization by bpodgursky.com using d3.js; parsing by Stanford’s CoreNLP.

Page 75: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

But names can reverse back to structures…

Page 76: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• OPSIN (chemical name to structure) http://opsin.ch.cam.ac.uk/

Tools to try

Page 77: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Not all names are systematic..Antony Williams vs Identifiers

Passport ID

Dad, Tony, others

SSN

Green Card

License5 email addressesChemSpiderman (blog, Twitter account, Facebook, Friendfeed)OpenID….

Page 78: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Many Names, One Structure

Page 79: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Aspirin on ChemSpider

Page 80: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Unique Structure Identifiers

Page 81: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Structure Searching the Web

Page 82: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Certainly happens with Welsh!

Page 83: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• All of the tasks below are possible to varying extents.

Pioneered on journal abstracts and journal full text.– Named entity recognition: what is this about? Where are

the places mentioned? Who are the people?

– Clustering and classification: which other dissertations

are like this one? What genres of dissertations are there?

– Event extraction: what processes (chemical reactions,

gene expression) occur? What are the participants?

– Citation analysis: who do dissertations cite?

– What sentiments towards the citations do authors express?

Dissertation analysis

Page 84: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Dissertation copyright varies• Institution• Author• Published or not?

Copyright issues

Page 85: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Probably less structured than papers• Not much work has been done here before

Dissertation specifics

Page 86: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• For example:• Stylometrics (to find out who wrote this)• Language identification• What else?....

• In addition to above, there are different tasks we can perform on scientific publications and dissertations

Digital Humanities textual analysis tasks

Page 87: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• We would LOVE to bring data out of our archive

• What could we do?• Find chemical names and generate structures• Find chemical images and generate structures• Find reactions – and make a database!• Find data (MP, BP, LogP) and host. Build models!• Find figures and database them• Find spectra (and link to structures)• Validate the data algorithmically

“Data enable” publications?

Page 88: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

RSC Archive – since 1841

Page 89: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Text Mining

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

Page 90: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Text Mining

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

Page 91: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)

Text spectra?

Page 92: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

Page 93: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Turn “Figures” Into Data

Page 94: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Make it interactive

Page 95: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

SO MANY reactions!

Page 96: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Reactions From Patents

Page 97: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Experimental data checker

Page 98: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• http://chemicaltagger.ch.cam.ac.uk/

Tools to try: ChemicalTagger

Page 99: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Tools to try: ChemicalTagger

Page 100: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• ChemicalTagger

Tools to try

Page 101: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

How is DERA going?

• We have text-mined all 21st century articles… >100k articles from 2000-2013

• Marked up with XML and published onto the HTML forms of the articles

• Required multiple iterations based on dictionaries, markup, text mining iterations

• New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!

Page 102: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Work in Progress

Page 103: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Work in Progress

Page 104: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Work in Progress

Page 105: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Work in Progress

Page 106: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Dictionary(ontologies)

RSC ontologies(methods, reactions)

Dictionary(chemistry)

Text-mining

Curated dictionaries for known names

ACD N2S

OPSIN

Unknown names: automated name to structure conversion

XML ready for publication

Marked-up XML

Production processes

CDX integration (coming soon)

Chemical structures SD

file

Is It Easy?

Page 107: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Our Supporting Ontologies

Page 108: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• The ‘National Compound Collection’

• Extracting compounds manually from theses• 700 theses, 44,000 compounds (growing…)• 4 months, 12 UK institutions• Deposited into ChemSpider

A pilot examining theses

Page 109: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Screening for interesting drug candidates• Mapping the chain from author to institution

to data to industry• British Library involved (EThOS collection) • Build a business model for this

Pilot objectives

Page 110: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Funders encouraging submission from new dissertations

• Mining of old collections (mostly automated, likely to need manual QA)

• Extension to other areas of chemical science

…and future (ideal)

Page 111: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• Don’t reinvent the wheel• Research your domain to find work already

underway and test tools for value/utility

In your domain???

Most Domains are Active

Page 112: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

A good place to start

Page 113: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• NaCTeM tools for e.g sentiment analysis http://www.nactem.ac.uk/opminpackage/opinion_analysis

Tools to try

Page 114: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

• NaCTeM tools for e.g sentiment analysis

Tools to try

Page 115: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

There is always something new

Page 116: Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

Email: [email protected]: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

Thank you