Workshop finding and accessing data - fiona - lunteren april 18 2016

54
Genome sharing projects around the world – and how you find data for your research Fiona Nielsen Lunteren, April 18 2016 Slides will be made available online

Transcript of Workshop finding and accessing data - fiona - lunteren april 18 2016

Page 1: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Genome sharing projects around the world

– and how you find data for your research

Fiona NielsenLunteren, April 18 2016

Slides will be made available online

Page 2: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Follow us on twitter:@repositiveio

Fiona Nielsen, April 18 2016

Find me on twitter: @glyn_dk

Page 3: Workshop   finding and accessing data - fiona - lunteren april 18 2016

1. What data are you looking for? And Why?

2. Data resources from around the world3. Tips on how to find and access data4. Hands-on using Repositive

5. Summary and feedback

Workshop outline

Page 4: Workshop   finding and accessing data - fiona - lunteren april 18 2016

1. What data are you looking for?

This workshop will focus on finding and accessing human genomic data.

… And why would you be looking for genomic data for your research?

Are you researching cancer or genetic diseases?

Page 5: Workshop   finding and accessing data - fiona - lunteren april 18 2016

How much data do you need to publish a paper?

2001: 1 human genome

2012: 1000 Genomes (1092 genomes, since increased to ~2500)

2015: UK10K, Icelandic population (2,636 + 100k imputed), Cancer genome atlas ~11,000 genomesExac consortium 65,000 exomes

?

Page 6: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Statistically speaking, you still need 10s of thousands of samples for validation

The more severe the phenotype and the more complete penetrance, the easier it will be for you to find your variant, but

“As the genetic complexity of the disease increases (for example, reduced penetrance and increased locus heterogeneity), issues of statistical power quickly become paramount.” http://

www.nature.com/nrg/journal/v15/n5/full/nrg3706.html

But I am just looking at this one disease…

Page 7: Workshop   finding and accessing data - fiona - lunteren april 18 2016

What can I do?

PRO TIP: involve a statistician early on in your study design!

Page 8: Workshop   finding and accessing data - fiona - lunteren april 18 2016

How can I determine significance?

“One potentially powerful approach is to assess conservation across and within multiple species as whole-genome sequence data become more abundant.”

Look at extreme phenotypes “Sampling cases or controls from the extremes of an appropriate quantitative distribution can often increase power”

Look at non-SNP variants, they are more likely to have functional effects

- “how to account for the technical features of sequencing, such as incomplete sequencing and biased coverage over the genome?”

Page 9: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Think of how you can provide evidence that your result is not just a local technical variation or sampling bias

e.g. data from same cell type, same seq technology, same alignment…

How to account for bias?

PRO TIP: include more reference data in your analysis

Page 10: Workshop   finding and accessing data - fiona - lunteren april 18 2016

• Know what data is available in your lab, your dept, your org

• Survey from Qiagen showed that one of the main reasons researchers collaborate is to get access to data!

How can I access more data for my research?

Page 11: Workshop   finding and accessing data - fiona - lunteren april 18 2016

How can I find collaborators?

PRO TIP: Search for collaborators who have the data you need

PRO TIP: Tell your colleagues and peers what type of data you have in your lab

Page 12: Workshop   finding and accessing data - fiona - lunteren april 18 2016

2. Data resources from around the world

public repositories• some you apply for access,

especially if data contains clinical info or whole genome PID

• some are open access: GEO, SRA, PGP, OpenSNP, GigaDB, …

• some are consented for general research use, some have specific consent

Page 13: Workshop   finding and accessing data - fiona - lunteren april 18 2016

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Large amounts of data, but not accessible

≈ .5 PB Sequence available

80+ PB

Sequenced every year

WGS data available in public repos

Exponential growth rate

Under-utilised data has huge potential for

medical research

Page 14: Workshop   finding and accessing data - fiona - lunteren april 18 2016

DATA is fragmented

Page 15: Workshop   finding and accessing data - fiona - lunteren april 18 2016

It may be confusing

Page 16: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Hundreds of data sources…but they aren’t easy to find!

Jan-15 Mar-15 Jun-15 Sep-15 Dec-15 Mar-160

20406080

100120140160180200

1025 33 35

102

163

http://dx.doi.org/10.1371/journal.pbio.1002418 First 30 data sources listed here:

Page 17: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Data source content

Assay Types

Dedicated to…

Page 18: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Number of samples in Data sources

0.2

2

20

200

2000

20000

200000

2000000

Chart Title

Sam

ple

# (L

og10

)

Top 5:GEO (1.8M)PMI Cohort Program (1M)Auria Biopankki (1M)EGA (~0.6M)SRA (~0.5M)

Page 19: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Data accessibility

Can download the data straight away or after logging in.

Need to apply for access to the data.

Has both Open and Restricted access data within one repository.

Page 20: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Online Data source ’types’University – Affiliated to a university. Often only members of that university can upload/download to/from it. Catalogue – doesn’t have raw

data but lists studies/datasets.

Initiative/Consortium – Has a specific purpose/aim. Often focussed on a question or disease.

Repository – Can download from, has data from multiple institutions. Often can also upload your own data there.

Company – For profit organisation. Listing data is not their main purpose.

Biobank – many have sequence data of their biological samples.

Page 21: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Sequenced ethnicities

Aboriginals

African Americans

Africans

Australians

Chinese

MalaysIndians

DanishDutch Estonian

Russian

European Ancestry

FinnishIcelandic

JapaneseKorean

Latin Americans

Saudi

Swedish

Page 22: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Machines & Data sources

9475600

88

660

26

68

5062

3

25

0

0

23 International

Interesting site to look at: http://omicsmaps.com/stats

Page 23: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Main Repository funders

BGI = 4

EBI = 9NIH = 10NCBI = 9

The Broad = 8

Wellcome = 4

EBI total 104 services, 19 repositories http://www.ebi.ac.uk/services/all

NCBI total 67 databases http://www.ncbi.nlm.nih.gov/guide/all/#databases_

Page 24: Workshop   finding and accessing data - fiona - lunteren april 18 2016

• Case study: DNA data on Cancer

3. Tips to find and access data

Page 25: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Case Study – DNA data on Cancer

Repositories youhave heard of:

Ask around (word of mouth):

Repository Data Type AccessArrayExpress Expression Open

GEO Espression Open

EGA Mixed Restricted

dbGaP Mixed Restricted

Encode Healthy Reference Open

1000 Genomes Healthy Reference Open

Repository Data Type AccessCOSMIC Somatic mutations & WGS Open

ClinVar Variant information Open

ExAC Allele Freq. but not raw data  Open

SRA Individual sequences Open

TCGA Clinical & high level data  Open

CGHub Low level data (DNA data) Restricted

Page 26: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Case Study – DNA data on Cancer

We have identified the first 27 cancer specific data sources

And many more that contain cancer data alongside other data types.

AbcodiaAmbryShareBRCA ExchangeBreast Cancer Now Tissue BankBroad Cancer programme datasetsCancer Moonshot 2020CanGEMCGCICGHubChinese cancer genome consortiumChinese national human genome centreFollicular Lymphoma Genome DataG-DOCGenoMelICGCNational Mesothelioma Virtual BankNCIP Hub

Project GENIETargetTCGATexa cancer research biobankNCI-60CCLECOSMICFantomcancer methylome systemCancer therepeutics response portal

Page 27: Workshop   finding and accessing data - fiona - lunteren april 18 2016

1. Register for eRA account

2. Request access to specific dataset of interest

3. Download data

Registering for CGHubhttps://cghub.ucsc.edu/keyfile/newuser.html

‘Principle signing official’ registers Email to verify

Email to confirm/deny access

to website

Email with temporary password

Change password Electronic signature

Login Fill in contact info,

Complete ‘424’ form (research application

form)Request reviewed by

DAC

Email to confirm/deny access

to data

Login Retrieve personal access token

Download!

Page 28: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Often a long process

Bottlenecks: • Finding relevant and usable

data• Getting authorisation to

access data• Formatting data• Storing and moving data

We studied the problem by qualitative interviews followed by a survey of researchers in

human genetics

Page 29: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Often a long process

T. A. van Schaik et alThe need to redefine genomic data sharing: a focus on data

accessibility, Applied & Translational Genomics, 2014

10.1016/j.atg.2014.09.013

Researchers spend months to find and access genomic data, and often choose to not access

data at all

Page 30: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Why the barrier?

Page 31: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Why the barrier?

• Benefits: strict governance, review of consent, applicant signs for full responsibility for governance

• Disadvantages: No control of data once access is given, high barrier for access – too high?

Page 32: Workshop   finding and accessing data - fiona - lunteren april 18 2016

• Start planning your data needs early in your project• When you find the data you need, start application• Use Open Access data

How can I save time?

PRO Tip: If you use human genomic data, apply for the GRU datasets in dbGaP, one application – access to all the GRU datasets

Page 33: Workshop   finding and accessing data - fiona - lunteren april 18 2016

• Some data is Open Access requires specific consent

• OpenSNP.org (Bastian)• Personal Genomes Projects• Individuals who put their genomes online, e.g. Manuel Corpas

and his family “the Corpasome”

• http://manuelcorpas.com/about/

Not all data is restricted

Page 34: Workshop   finding and accessing data - fiona - lunteren april 18 2016

• Some data is Open Access requires specific consent

• Individuals who put their genomes online, e.g. Manuel Corpas and his family “the Corpasome”

• http://manuelcorpas.com/about/

• OpenSNP.org • Personal Genomes Projects

Not all data is restricted

Page 35: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Personal Genome ProjectPGP Harvard PGP Canada PGP UK Genom Austria

Host institution Harvard Medical School Boston

SickKids Toronto University College London CeMM Research Center for Molecular Medicine

Principal Investigator George Church Steven Scherer Stephan Beck Christoph Bock &Giulio Superti-Furga

Launch year 2005 2012 2013 2014Geographic scope USA, mainly Boston Canada United Kingdom Mainly Austria

Enrollment eligibility At least 18 years old, able to make an informed decision, perfect score in the PGP enrollment exam, certain vulnerable groups excluded

Data Generated Whole genome sequencing, upload of additional data possible

Mainly whole genome sequencing

Whole genome sequencing, DNA methylome sequencing, RNA transcriptome sequencing

Mainly whole genome sequencing

Number of genomes 100s 10s 10s 10sData access

http://personalgenomes.org/harvard/data http://genomaustria.at/unser-genom/#genome-der-pionierinnen

Project funding Discretional funds and corporate sponsoring

Institutional startup funds Discretional funds and corporate sponsoring

Institutional startup funds

Areas of emphasis Integration with phenotypic data, collaboration with other personal omics initiatives

Genome donations, synergy with massive-scale clinical genome sequencing projects

Genomes and society, genetic literacy, school projects, education

Website http://personalgenomes.org/harvard/ http://personalgenomes.org/canada/ http://personalgenomes.org/uk/ http://genomaustria.at/

Page 36: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Summary of data access barriers

Data is uploaded to repository

Data is discovered by potential user

Data is accessed by potential user

Page 37: Workshop   finding and accessing data - fiona - lunteren april 18 2016

• “even when researchers are authorised to share data they report reluctance to do so because of the amount of effort required“ http://www.sciencedirect.com/science/article/pii/S2212066114000386

• “Clinical geneticists cited a lack of time because their main priority is diagnosing patients. Industrial researchers cited a lack of time because of the pressure to meet the deadlines in their job. Researchers in academia cited both a concern about the potential loss of future publications once unpublished data is shared, and the lack of time and incentive to share data as this does not contribute to their publication record. Researchers from all categories felt that they lacked sufficient resources to make their data available.”

The barrier of making data available

But I do not want to share my data

Page 38: Workshop   finding and accessing data - fiona - lunteren april 18 2016

• If you expect data to be available to you – you have to make your data available too!

• Encourage collaborations: power by numbers

1. Get credit – publish and make your data available2. Give credit – cite data sources3. Understand consent – for all uses of clinical data

Best practices

Page 39: Workshop   finding and accessing data - fiona - lunteren april 18 2016

• Use all available tools to make your life easier: • Data publications visibility and citations for your data, e.g.

GigaScience and Scientific Data

• Figshare, Zenodo, Dryad for sharing open access data

• PhenomeCentral, Matchmaker exchange for rare disease research

• Repositive for finding data across repositories and make your own data discoverable

Best practices: use the tools

Page 40: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Does data sharing matter atgrant proposal evaluation?

Based on: Winning Horizon 2020 with Open Science, http://dx.doi.org/10.5281/zenodo.12247

Best practices: Plan into your grant proposals

Page 41: Workshop   finding and accessing data - fiona - lunteren april 18 2016

“Weakness: Involvement of non-academic beneficiaries is limited”

“Weakness: highly focused on academic activities, and lacks an advanced communication strategy”

“Weakness: limited exposure to non-academic partners & infrastructures”

Excellence

Impact

Implementation

“data accessibility is unclear!”

“data storage & access not considered”

Best practices: Plan into your grant proposals

Page 42: Workshop   finding and accessing data - fiona - lunteren april 18 2016

“Strengths: extensive dissemination of data to the scientific community (open access, databases)”

“outreach activities to a broad audience”

“research software is freely available”

Impact:Best practices: Plan into your grant proposals

Page 43: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Best practices: Plan into your grant proposals

Page 44: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Make the (research) world a better place by sharing in return

Best practices: Share in return!

Page 45: Workshop   finding and accessing data - fiona - lunteren april 18 2016

• Digital consent: towards automatic processing of applications

• Dynamic consent and power to the patient, e.g. PatientsKnowBest

• Privacy-preserving access to datasets: preserving control and governance with data custodian, lower barrier for access

What the future holds

Page 46: Workshop   finding and accessing data - fiona - lunteren april 18 2016

4. Hands-on session using Repositive

What if finding data was as easy as finding a book on Amazon, book a hotel on Expedia?

Page 47: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Repositive promotes best practices

Discover new data sources

EASY SEARCH

Page 48: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Repositive promotes best practices

Make your data visible

SHARE KNOWLEDGE

Page 49: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Repositive promotes best practices

Build a data community

BUILDTRUST

Page 50: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Benefit for both sides of data collaboration

Data consumers Data producers

Find relevant data faster

Feedback from other users through ratings and comments to evaluate data quality

Find collaborators with data

Make your data visible

Build credibility as a trusted provider of quality data

Find collaborators to analyse your data

Page 51: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Live demo http://discover.repositive.io

Use activation code: BioBS16

Page 52: Workshop   finding and accessing data - fiona - lunteren april 18 2016

5. Summary and feedback

• Get credit – publish data• Give credit – cite data• Understand consent

Page 53: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Tell us your thoughts: @repositiveio

@glyn_dk

And read more on http://repositive.io

Page 54: Workshop   finding and accessing data - fiona - lunteren april 18 2016

Thank you!