ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature
-
Upload
geraintduck -
Category
Science
-
view
79 -
download
1
Transcript of ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature
![Page 1: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/1.jpg)
Extrac'ng pa,erns of database and so3ware usage from the bioinforma'cs literature
Geraint Duck, Goran Nenadic, Andy Brass, David L. Robertson and Robert
Stevens
The University of Manchester, UK h,p://www.cs.man.ac.uk/~duckg/ h,p://bionerds.sourceforge.net/networks/
![Page 2: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/2.jpg)
Introduc'on
• Methods are fundamental to science – Judgement – Replica'on – Extension
• Methods in bioinforma'cs: – In silico: Data and tools – Workflows
• Objec've representa'on • Sharing and reuse
2
![Page 3: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/3.jpg)
Bioinforma'cs
• Resource focused domain: “Resourceome” – Our research suggests:
• Around 200,000 unique resources in the literature • Over 4 million men'ons • … and s'll growing!
• Resource/method search and selec'on… – Best-‐prac'ce – Common-‐prac'ce
• What are the main pa,erns in bioinforma'cs resources, and associated methods? 3
![Page 4: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/4.jpg)
Approach
• Use bioinforma'cs literature (to answer this ques'on)
• Extract database and so3ware men'ons • Combine resources to form pairs • Combine pairs to forms pa,erns – Common-‐prac'ce – Method?
4
PHYLIPClustalW
ModellerBLAST PROCHECK
![Page 5: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/5.jpg)
Document Collec'on
• PubMed Central open-‐access full-‐text ar'cles • Bioinforma2cs[MeSH] • 22,376 ar'cles • 67 journals • 3 journals were > 50% of total documents
5
!"
#!!"
$%!!!"
$%#!!"
&%!!!"
&%#!!"
'%!!!"
'%#!!"
(%!!!"
(%#!!"
$))*" &!!!" &!!&" &!!(" &!!+" &!!*" &!$!" &!$&" &!$("
!"#
$%&'(
)'*(+"#
%,-.'
/%0&'
![Page 6: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/6.jpg)
bioNerDS
• bioNerDS – Bioinforma'cs named en'ty recogniser for databases and so3ware
– Full-‐text; Men'on level – Rule-‐based – F-‐score 63-‐91% – Previously compared resource usage in: • Genome Biology • BMC Bioinforma'cs
• Networks filter: – 702,937 total men'ons – 167,697 document level men'ons
– 31,053 unique names – 93% single men'on
• Duck et al. (2013) BMC Bioinforma'cs
6 h,p://bionerds.sourceforge.net/
![Page 7: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/7.jpg)
bioNerDS
Genome Biology • “Biological” focus
– GenBank – Ensembl – GEO – GO
BMC Bioinforma6cs • “Resource” focus
– R – PDB – PubMed
7 h,p://bionerds.sourceforge.net/
![Page 8: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/8.jpg)
Men'on Filtering
• Filter resources not men'oned within a minimum of 2 documents – Removed 25% of men'ons – Removes less likely names
• Generic resources – R – Bioconductor
• Categorise to database/so3ware – Removed some ‘unknown’ resources
8
![Page 9: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/9.jpg)
Methods Sec'ons
• Removed resources not in the methods sec'on – Method or non-‐method
• Regular expression based 'tle detec'on – Tested on 100 ar'cles – Precision: 97%; Recall: 79%
• Resul'ng in: – 69,466 database men'ons (1,711 unique) – 65,451 so3ware men'ons (3,289 unique)
9
![Page 10: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/10.jpg)
Extrac'ng Pairs • Co-‐occurrence within text • Two sets of pairs: – So3ware only pairs – Database and so3ware pairs (any combina'on of)
• This provided us with: – 22,880 so3ware pairs (13,965 unique) – 54,562 database/so3ware pairs (29,066 unique)
• Removed pairs only within a single document – 53% of the so3ware pairs – 46% of the database/so3ware pairs 10
![Page 11: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/11.jpg)
Common Pairs
• With sufficient data, the most common order of a pairing is the correct one…
• Binomial test – each order is equally likely • Two confidence thresholds: – 95%
• 2,518 so3ware pairs (145 unique) • 7,001 database/so3ware pairs (297 unique)
– 99% • 1,450 so3ware pairs (55 unique) • 3,383 database/so3ware pairs (95 unique) 11
![Page 12: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/12.jpg)
Most Common Pairs
SoAware only pairs Directed Pair Count %
BLAST è ClustalW 205 14.1
BLAST è PSI-‐BLAST 103 7.1
Phred è Phrap 89 6.1
ClustalW è MEGA 77 5.3
Cluster è Tree View 75 5.2
Phrap è Consed 51 3.5
Modeller è PROCHECK 44 3.0
BLAST è ClustalX 43 3.0
ClustalW è PHYLIP 41 2.8
BLAST è MUSCLE 40 2.8
SoAware and database pairs Direct Pair Count %
GO è KEGG 350 10.3
BLAST è GO 195 5.8
BLAST è ClustalW 150 4.4
GEO è GO 129 3.8
Phred è Phrap 89 2.6
BLAST è PSI-‐BLAST 87 2.6
PDB è Modeller 85 2.5
Swiss-‐Prot è TrEMBL 82 2.4
Ensembl è BioMart 82 2.4
ClustalW è MEGA 77 2.3
12
![Page 13: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/13.jpg)
13
![Page 14: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/14.jpg)
14
![Page 15: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/15.jpg)
15
![Page 16: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/16.jpg)
16
![Page 17: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/17.jpg)
17
![Page 18: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/18.jpg)
18
![Page 19: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/19.jpg)
Resource Pa,erns
Databases • Data sources
• GO is an excep'on
– Major sink – Data Annota'on
• Numerous ‘same’ links – Enumera'on in text?
SoAware • Data sinks
• Represents the primary in silico pipeline(s)
• Again, sequence alignment is central
19
![Page 20: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/20.jpg)
Pa,erns through Time
2004 to 2006
2007 to 2009
20
![Page 21: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/21.jpg)
Pa,erns through Time
21
2010 to 2012
![Page 22: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/22.jpg)
Phylogene'cs Pa,erns
• Case-‐study… • Eales et al. (2008) BMC Bioinforma2cs, 9, 359 – Mapped phylogene'cs methods into 4 steps:
• Sequence Alignment • Tree Inference • Sta's'cal Tes'ng • Tree Visualisa'on
– Using the same corpus selec'on, we built a network… • PubMed search for “phylogen*” in 'tles or abstracts
22
![Page 23: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/23.jpg)
Phylogene'cs Pa,erns
23
![Page 24: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/24.jpg)
Phylogene'cs Pa,erns
• Our automated extrac'on can recreate these steps – Given some ambiguous resources
• Encouraging… – Viable in silico pa,ern extrac'on – “Common prac'ce”
• Next step: Apply this to other (sub-‐)domains
24
![Page 25: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/25.jpg)
Conclusion
• Can extract pa,erns of resource usage – Can we describe the method through these?
• High level overview of common-‐prac'ce – With lower thresholds, can access resources specific (but “common”) to different subdomains
– Not best-‐prac'ce… • Workflows? – Requires increased granularity – Could help inform their crea'on
25
![Page 26: ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature](https://reader033.fdocuments.net/reader033/viewer/2022042702/55d02259bb61eb836f8b4606/html5/thumbnails/26.jpg)
Thank-‐you • Acknowledgements – Co-‐authors
– Manchester IT Services • Computa'onal facili'es
– Funding:
– Travel:
26