Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

download Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

If you can't read please download the document

  • date post

  • Category


  • view

  • download


Embed Size (px)


This presentation was given at the CLIR/DLF Postdoctoral Fellowship Summer Seminar at Bryn Mawr college in Pennsylvania on July 29th 2014. The intention was to communicate what we are doing in the fields of text and data mining in the domain of chemistry and specifically around mining the RSC archive publication and chemistry dissertations and theses. How would these experiences map over to the humanities?

Transcript of Data Mining Dissertations and Adventures and Experiences in the World of Chemistry

  • 1.Data mining dissertations Adventures and Experiences in the World of Chemistry Antony Williams CLIR/DLF Postdoctoral Fellowship Summer Seminar, July 2014

2. What a small world 3. Whos got an ORCID? Who has heard of/involved with AltMetrics? Who has edited a Wikipedia page? Who has direct experience of text mining? All slides already on Slideshare here: Before we start. 4. Context why do we want to mine data? Our experiences in extracting theses: Text and data mining Chemistry as an example Before you start Resources and tools Contents 5. Lets map together all historical chemistry data and build systems to integrate Heck, lets integrate chemistry and biology data and add in disease data too Lets model the data and see if we can extract new relationships quantitative and qualitative Lets make it all available on the web Taking on a big challenge 6. Were going to map the world Were going to take photos of as many places as we can and link them together Well let people annotate and curate the map Then lets make it available free on the web Well make it available for decision making Put it on Mobile Devices, Give it Away What about this. 7. Im from hereon Google 8. Wikipedia 9. Wikipedia 10. The Power of Contribution 11. How do you spell Afonwen? 12. And theres Denbigh 13. So the world can be mapped We can enter a 3D world within the map We can add annotations We can use the data, reference it, we can extract it, we can make decisions with it And we can do it on our lap, in our hands Lets do this for chemistry Whoa 14. Once upon a time we built a database. In a basement not far away 15. ChemSpider 16. ChemSpider and Data Validation 17. Dictionary Linking 18. Dictionary Linking 19. This is not new, you known the story So much data of value contained within a publication and delivered in a PDF form PDF files, and especially unclear licensing, dont allow me at the data so I can rework, reuse, repurpose, text mine etc. I specialize in XXXX. I want a database of YYYY extracted from publications and made available, for free, with capabilities I need, and the publishers should just do it Data in a Scientific Publication 20. It is so difficult to navigate Whats the structure? Are they in our file? Whats similar? Whats the target?Pharmacology data? Known Pathways? Working On Now?Connections to disease? Expressed in right cell type? Competitors? IP? 21. Manage all of the chemistry data associated with chemical substances Data to be downloadable, reusable, interactive Build a platform that enables the scientist Data storage, validation, standardization and curation Collaborative data sharing Provide data platform that can enable and enhance publishing of scientific papers We set a vision 22. Every compound from every article at RSC is extracted, in a database, and linked Chemical properties are extracted, databased and used for predictive models Data tables are downloadable, interactive and not just dumb-PDFs and what can we extract from chemistry theses too? XXX Years from Now at RSC 23. We are seen as one of the repositories for published AND unpublished research data An intuitive platform for research data management in the cloud Individual, collaborative and public data management of diverse data in the cloud and where all data referenced in a thesis is available at a button click XXX Years from Now at RSC 24. But how does it map onto your domain?? So this is chemistry 25. Mining as an allegory 26. You have a mountain of stuff which contains valuable nuggets You (more or less) know what youre looking for You know what youre going to do with it once you have it Mining as an allegory - intent 27. You get lots of stuff out It requires sifting and grading Its a triumph if you manage to extract 80-90% of what is there You will go back to the heap and redo it Mining as an allegory - result 28. That which is easy to get out - is well known and unlikely to be novel The novel and interesting stuff is likely to be rare and not easily defined Mining as an allegory - effort 29. Do the initial investigations by hand Send in the machines later Still needs some humans tweaking Mining as allegory - automation 30. Context 31. From Utopia Documents team Good at extracting structure from typeset pdfs PDFX 32. OCR recognition Underlining doesnt help OCR In this case it was the only signpost to the department, supervisor and funding details 33. Hardcopy Scanned and OCRd PDF PDF derived from Word Word or LaTeX and for OCR not all are borne equal and of course history and language is a major influence. Oil of vitriol Building blocks to mine 34. Ontologies, taxonomies, dictionaries But these are very domain focussed As an example, Open PHACTS spend a lot of effort mapping biology to chemistry to disease over many data sources More building blocks 35. Provide a controlled vocabulary what your data describes, where it came from Provide a shared vocabulary for integrating with other peoples data What can ontologies do for me? 36. Questions to ask: (1)Has someone already produced an ontology covering your area? (Places to look: Bioportal, OBO Foundry.) (2)Do they take requests? (3)Are they responsive? (4)Is the ontology kept up to date? Early days for ontologies and any ontology will almost certainly be a long way from complete! Best practices: experiences from biomedical ontologies 37. Best that these dont change Best that everyone calls them the same things Best that they are unambiguous Meanwhile, back in the real world What things are you looking for? 38. Place names somewhat ambiguous Species names can change with time Diseases every pharmaceutical company has a different list People can be very ambiguous: Authors and researchers are hard to mapexcept for Google it seems! How easy? 39. 40. 41. Thankfully people follow 42. Google Scholar Citations 43. ORCID take up??? 44. All publications easily connected but also Important in early scientific career consider every data point contribution, every research object Every article Every presentation Thesis and dissertation Provenance.and feeding AltMetrics So the benefits of ORCIDs? 45. The Alt-Metrics Manifesto 46. AltMetrics via Plum Analytics 47. Usage, Citations, Social Media 48. Detailed Usage Statistics 49. Indexed and Searchable 50. ORCIDS for reputation 51. Tinman - mutant fly embryos lack a heart. Van Gogh - hair-like bristles on wings have a swirling pattern. INDY - acronym for I'm Not Dead Yet, they live twice as long as normal; from the scene in the movie "Monty Python and the Holy Grail" Ken and Barbie - males and females lack external genitalia. Tribbles - some cells divide uncontrollably Cheap date - flies are extra-sensitive to alcohol. Cleopatra - flies die when Cleopatra gene interacts with another gene, Asp. Kojak - no hairs on wings. Maggie - fly development is arrested; named after Maggie Simpson, who's development also seems to be arrested. Oh my..Fruitfly gene names 52. those that belong to the Emperor, embalmed ones, those that are trained, suckling pigs, mermaids, fabulous ones, stray dogs, those included in the present classification, those that tremble as if they were mad, innumerable ones, those drawn with a very fine camelhair brush, others, those that have just broken a flower vase, those that from a long way off look like flies. Allegedly from Celestial Emporium of Benevolent Knowledge The Analytical Language of John Wilkins, Jorge Luis Borges Animal classification 53. Are you just identifying entities? Are you looking for sentiment? In chemistry names will lead you to a recipe for synthesis, and analytical data about that compound Classification after things 54. Used to aid discovery - directly Used to aid discovery - indirectly Extract data in electronic form for reuse Needs to be use case driven why, then what/how comes later End result 55. Automation can give good results Especially looked at in bulk Less easy to judge at the article level People accept discovery is fuzzy Not so with data points (but maybe can screen out) Quality 56. Chemical names are both difficult and rewarding. Difficult in the sense that they can break standard software. Rewarding in the sense that you can extract useful information about the molecule theyre referring to without a dictionary. Some examples Chemistry-specific challenges and opportunities 57. and it gets worse 58. A series of mono and di-N-2,3-epoxypropyl N- phenylhydrazones have been prepared on a large scale by reaction of the corresponding N-phenylhydrazones of 9-ethyl-3-carbazolecarbaldehyde, 9-ethyl-3,6- carbazoledicarbaldehyde, 4-dimethyl-amino-, 4- diethylamino-, 4-benzylethylamino-, 4-(diphenylamino)-, 4-(4,4-4-dimethyl-diphenylamino)-, 4-(4- formyldiphenylamino)- and 4-(4-formyl-4- methyldiphenyl-amino)benzaldehyde with epichlorohydrin in the presence of KOH and anhydrous Na(2)SO(4). From Molecules, via the BioNLP list Annotate this... 59. How many explicit compounds? How many numbered compounds actually are named in a given paper? iloprost (1) tributyl-1-hexynylstannane (2) the desired 2-heptyne (3) methylPd(II) iodide 4 or 4 alkynylstannane 5 the hypervalent stannate 6 (alkynyl)(methyl)Pd(II) complex 7 the desired methylalkyne 8 compounds 914 the stannyl precursors 15 and 16 methylated compounds 17 and 18 stannyl precursor 19 iloprost methyl ester 20 iloprost methyl ester is the real name, but you need to know that iloprost is a monocarboxylic acid! 60. Names from structures Systematic names can be generated FROM chemical structures algorithmical