Data mining and data linking

66
Data mining and data linking

Transcript of Data mining and data linking

Page 1: Data mining and data linking

Data mining and data linking

Page 2: Data mining and data linking
Page 3: Data mining and data linking

Getting data from papers(beyond the PDF)

http://dx.doi.org/10.1016/j.ympev.2009.07.011

Page 4: Data mining and data linking

Extracting tables

Page 5: Data mining and data linking

Tables from paper as comma separated values (CSV)

Taxon and institutional vouchera,Locality ID,Collection locality,Geographic coordinates/approximate location,Elevation (m),GenBank accession number 12S,16S,COI,c-myc1. UTA A-52449,1,"Puntarenas, CR","(10°18′N, 84°48′W)",1520,EF562312,EF562365,None,EF5624172. MVZ 149813,2,"Puntarenas, CR","(10°18′N, 84°42′W)",1500,EF562319,EF562373,EF562386,EF5624303. FMNH 257669,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562320,EF562372,EF562380,EF5624324. FMNH 257670,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562317,EF562336,EF562376,EF5624215. FMNH 257671,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562314,EF562374,EF562409,None6. FMNH 257672,1,"Puntarenas, CR","(10°18′N, 84°47′W)",1500,EF562318,None,EF562382,None

Page 6: Data mining and data linking

Cleaning data

Page 7: Data mining and data linking

(10°18’N, 84°42’W)

We can read this, but a computer would prefer just numbers

2. MVZ 149813,2,"Puntarenas, CR","(10°18′N, 84°42′W)",1500,EF562319,EF562373,EF562386,EF562430

Page 8: Data mining and data linking

Tools for cleaning data

• Spreadsheets like Excel and Google Docs can be used to clean data using simple formula (such as combining cells)

• Google Refine offers regular expressions, filtering, and the ability to call external services

Page 9: Data mining and data linking

Achatina fulica (giant African snail)

Page 10: Data mining and data linking

Reconciliation services

• By default Google Refine uses Freebase• But we can add our own services…

Page 11: Data mining and data linking

Names reconciled using uBio and Google Refine

Page 12: Data mining and data linking

What can we do with data mining?

Page 13: Data mining and data linking

Extract information on ecological relationships

Page 14: Data mining and data linking
Page 15: Data mining and data linking

Text mining

Page 16: Data mining and data linking

Morphological and molecular description of Haematoloechus meridionalis n. sp. (Digenea: Plagiorchioidea: Haematoloechidae) from Rana vaillanti brocchi of Guanacaste, Costa Rica

Halipegus eschi n. sp. (Digenea: Hemiuridae) in Rana vaillanti from Guanacaste Province, Costa Rica

Haematoloechus danbrooksi n. sp. (Digenea: Plagiorchioidea) from Rana vaillanti from Los Tuxtlas, Veracruz, Mexico

Page 17: Data mining and data linking

<parasite name> (n. sp.) from <host name>

Page 18: Data mining and data linking

Sources of host-parasite associations

• Titles of papers

• Sequence databased (GenBank)

Page 19: Data mining and data linking

What do crustaceans live on?

Greenplants

Bacteria

Fungi

Vertebrates

Arthropods

Page 20: Data mining and data linking

What do insects live on?

Greenplants

Bacteria

Fungi

Vertebrates

Arthropods

Page 21: Data mining and data linking

Host names in GenBank

• acorn gall on Quercus pyrenaica• Aconitum napellus• Aconitum napellus L.• Acinonyx jubatus (Cheetah)• Actinidia chinensis Hort 16A• Alces alces (intermediate host)• alfalfa

Page 22: Data mining and data linking

Extracting links between data sets

Page 23: Data mining and data linking

http://iphylo.org/~rpage/challenge

Page 24: Data mining and data linking
Page 25: Data mining and data linking

Citation links

Page 26: Data mining and data linking

Are there other kinds of links?

Page 27: Data mining and data linking

data linking

Page 28: Data mining and data linking

Extracting these links

• Look for Genbank sequences

• Look for specimen identifiers

• Look for taxonomic names

• Look for geographic localities

Page 29: Data mining and data linking

Regular expressions to the rescue!

Page 30: Data mining and data linking

Regular expressions

• Rules for matching strings• Allow for approximate or variable matches• More flexible than “search and replace”

• [0-9]{4} matches a string with four digits (such as a year)

Page 31: Data mining and data linking

demo

Page 32: Data mining and data linking

Perils of data mining

(matching the wrong things)

Page 33: Data mining and data linking

Taxa found in one paper

Image search on taxonomic name

Page 34: Data mining and data linking

Electra pilosa

Page 35: Data mining and data linking

Carmen Electra versus Electra

(guess which one is more popular?)

Page 36: Data mining and data linking

But what about this?

Page 37: Data mining and data linking

Homo sapiens

Page 38: Data mining and data linking

AJ711044

Page 39: Data mining and data linking

should be AJ971044

Page 40: Data mining and data linking

Error in paper lead to wrong image

How do I fix this error in the paper?

Page 41: Data mining and data linking

Is there a better way to make these links?

(what if they were made for us?)

Page 42: Data mining and data linking

Digital Object Identifier(DOI)

Page 43: Data mining and data linking
Page 44: Data mining and data linking

Identifies a publication

Page 45: Data mining and data linking

Globally unique

Page 46: Data mining and data linking

10.1016/j.ympev.2006.04.006

Page 47: Data mining and data linking

Paper

Page 48: Data mining and data linking

Why have DOIs?

Page 49: Data mining and data linking

Link rot

Page 50: Data mining and data linking

Refs

Page 51: Data mining and data linking
Page 52: Data mining and data linking
Page 53: Data mining and data linking

2006

Cites

2006

Page 54: Data mining and data linking

Forward Cites

2006 2009

Page 55: Data mining and data linking

Shoulders of giants

Page 56: Data mining and data linking

progress is incremental

Page 57: Data mining and data linking

reuse past results

Page 58: Data mining and data linking

Forward Cites

2006 2008

Page 59: Data mining and data linking
Page 60: Data mining and data linking

Species

Genes

Page 61: Data mining and data linking

data linking

Page 62: Data mining and data linking

Data citation

Page 63: Data mining and data linking
Page 64: Data mining and data linking

Linked data

• Use same, globally unique identifiers for same thing (e.g., use DOI for a paper)

• Identifier can be resolved (put it in a browser and get something back)

• Use the same terms to describe the same thing

Page 65: Data mining and data linking
Page 66: Data mining and data linking

What does the future hold?

• Identifiers for data (as well as papers)?

• Citation metrics for data?

• Regular expressions become less important (wishful thinking?)

• Linked data (problem is lack of links)