Data management for Quantitative Biology -Basics and challenges in biomedical data management, Apr...

Dr. Sven Nahnsen, Quantitative Biology Center (QBiC)

Data Management for Quantitative Biology Lecture 2: Basics and Challenges of Biological and Biomedical Data Management

Overview •  Recap from last week/remaining slides from last week

•  Basics of data management

•  Data management plan

•  Challenges in relation to biomedical research

-  Tier-1 challenges

§ Data transfer

§ Storage concepts

§ Data sharing/data dissemination

§ User logging

•  Data privacy considerations

2

Basic concept of data management

Data management

Data Outcome

Biological data (e.g. NGS, proteomics, metabolomics)

Store, analyze, integrate, … (see data life cycle, DMBoK)

Generate added value; enable collaboration

sustainability, reproducibility, ….

Tier 0

Tier 1

Tier 2

Tier 3

Data management plan

Data transfer Handling/storage Sharing/dissemination

Annotation (metadata) Data processing Data access logging

… … …

Dom

ain specificity

3

Data management plan (DMP) •  There is no standard for a data management plan •  However, funding agencies, journals and institutions require

researchers to have a data management plan •  The DAMA-DMBOK provides a good orientation for any DMP •  Not all aspects are relevant to a biological/biomedical research

project •  Essential aspects concern:

-  Data acquisition, standards, file formats -  Data sharing -  Data preservation

4

Data management plan

5

Creating a DMP (1/4) •  https://dmptool.org

-  Title of study -  DMP creator -  Affiliation -  Time stamp -  Copyright

6


-  Data collection, formats and standards

-  Data storage and preservation

-  Dissemination methods

7


-  Roles -  Responsibilities

8


-  Policies for sharing -  Public access

9

Data management plan – other sources •  A guide to write a data management plan

-  http://data.research.cornell.edu

10

Data transfer • Why has data to be moved?

-  Data generation instruments are usually not at the place of the data center

-  If data is shared globally -> big issue -  If raw data has to be brought together with other raw data (e.g.

mass spec with NGS data) •  Data transfer may be a security hole •  There are many different data transfer technologies

-  FTP (File Transfer Protocol), based on TCP (Transmission Control Protocol)

-  HTTP (Hypertext Transfer Protocol), based on TCP -  R-Sync -  FASP (Aspera) -  We can not cover all protocols

•  Best solution: avoid data transfer!

11

Data transfer •  Classical protocols such as ftp, scp and http work well if the data is

in MB or low GB range •  Large data sets may not be suitable to such transfer protocols •  However, compression is an option (reduce the size to be

transferred, but requires compute power on sender and recipient side)

-  Data compression explained (http://mattmahoney.net/dc/dce.html.), Matt Mahoney, 2013

-  Bottlenecks §  Time §  Memory usage

Science Blogs: Good Math, bad math, Carrol 2009. http://goo.gl/Mf1G9L

A compression is a function C, mapping a string x to a string C(x) with,

12

R-Sync •  Initial release 1998 •  R-sync is a widely-used utility to keep copies of a file on two

computer systems the same

•  Can invoke integrity checks with checksum •  Example: longitudinal parity byte

-  Break data into words of n bits -  Compute the Exclusive OR (XOR)

-  Append resulting word to the message -  Check integrity by calculting the XOR -  Result needs to a word with n zeros

13 http://en.wikipedia.org/wiki/Checksum, accessed 22/04/2015, 5 PM http://en.wikipedia.org/wiki/Exclusive_or, accessed 22/04/2015, 6 PM

W1 W2

Fast and secure protocol (FASP) •  Developed by Aspera (bought by IBM) •  Hundred times faster than FTP and HTTP (both TCP based) •  Built-in security using open standards •  Open architecture

14

ASPERA (FASP) •  Capabilities:

-  High-Performance data transfer software,

-  Enterprise File Sync and Share (Ad-hoc),

-  Email/mobile/cloud integration

•  Founded:

-  2004 in Emeryville, California

-  Acquired by IBM in 2014 •  Markets Served:

-  Over 3000 customers in -  Engineering, Media & Entertainment,

Government, Telecommunications, Life Sciences, etc.

•  The Aspera Difference: -  Patented FASP™ transport technology

Slide, courtesy from Arnd Kohrs, Aspera, 2015. 15

Challenges with TCP and alternative technologies •  Distance degrades conditions on all networks

-  Latency (or Round Trip Times) increase -  Packet losses increase -  Fast networks just as prone to degradation

•  TCP performance degrades with distance -  Increased latency and packet loss

•  TCP does not scale with bandwidth -  TCP designed for low bandwidth -  Adding more bandwidth does not improve throughput

•  Alternative Technologies -  TCP-based - Network latency and packet loss must be low -  Modified TCP – Improves TCP performance but insufficient for

fast networks -  Data caching - Inappropriate for many large file transfer workflows -  Data compression - Time consuming and impractical for certain file types

Slide, courtesy from Arnd Kohrs, Aspera, 2015. 16

Latency: the time from the source sending a packet to the destination receiving it Packet loss: Packet loss occurs when data packets fail to reach the destination

fasp™ — High-performance Data Transport •  Maximum transfer speed

-  Optimal end-to-end throughput efficiency -  Scales with bandwidth independent of latency and resilient to packet loss

•  Congestion Avoidance and Policy Control -  Automatic, full utilization of available bandwidth -  Ready to use on existing network due to adaptive rate-control

•  Uncompromising security and reliability -  SSL/SSH2: Secure, user/endpoint authentication -  AES cryptography in transit and at-rest

•  Central management, monitoring and control -  Real-time progress, performance and bandwidth utilization -  Detailed transfer history, logging, and manifest

•  Low Overhead -  Less than 0.1% overhead on 30% packet loss -  High performance with large files or large sets of small files

•  Resulting in -  Transfers up to orders of magnitude faster than FTP -  Precise and predictable transfer times -  Extreme scalability (size, bandwidth, distance, number of

endpoints, and concurrency)

17 17 Slide, courtesy from Arnd Kohrs, Aspera, 2015.

Data Storage •  Data archive vs. data backup

-  Economic factor -  Use case example

•  RAID technology (e.g., RAID 1 and RAID 5)

18

Universität Tübingen/Jörg Jäger

Data archive vs. data backup

Archive •  Stores data that's no longer in

day-to-day use but must still be retained

•  Speed of retrieval is not as crucial

•  Archiving time requirements can be many years/decades

•  Data that is archived should contain native (standardized) raw data

Backup •  Provides rapid recovery of

operational data •  Use cases: data corruption,

accidental deletion or disaster recovery (DR) scenarios

•  Speed is crucial •  Time requirements can be

several weeks/months •  Data is mostly kept in proprietary

formats

Data backup vs archiving: What's the difference?, Antony Adshead, 2009. www.computerweekly.com

19

Data backup vs. archive: Big difference in costs

•  Backup is essential •  Example:

-  10 GbE clients -  4 daily backups of 100 TB -  A full weekly backup

saved for 4 weeks -  End of months backup

saved for a year -  End of year backup saved for

seven years

•  25 times the production data •  Full recovery within 24 h needs

1.2 Gbit/sec

How Archive Can Fix Backup? Spectra Logic, 2011. https://www.spectralogic.com/how-archive-can-fix-backup/

20

Data backup vs. archive: Big difference in costs

•  Consider what data really needs to go into backup (how long?)

•  From experience a max of 20% is really “hot data”

•  80% of data can go into the archive

Still 25 times the production data, which is now only 0.5 TB and 0.24 Gbit/sec connection to recover within 24 h

How Archive Can Fix Backup? Spectra Logic, 2011. https://www.spectralogic.com/how-archive-can-fix-backup/

21

Technologies

Archive •  Tape storage

-  Inexpensive; can host large volumes; very robust

-  Slow (data is read in blocks); hardware management

•  Optical media storage -  Write (W) once, read many times (no

physical contact) -  Low capacity and rather slow

•  Disk storage -  Random access; falling prices; fast;

RAID compatible

Backup •  Tape and optical media are not really an

option •  Disk storage (magnetic drives)

-  Fast access is essential; RAID compatibility is essential

-  Continuous energy is needed and backup will fail while power outage

•  Solid state drives -  Very fast; falling prices -  Capacity is still an issue

Data archiving techniques: Choosing the best archive, Pierre Dorion. http://searchdatabackup.techtarget.com/tip/Data-archiving-techniques-Choosing-the-best-archive-media

On the horizon for archive and backup, alike: Cloud technologies

22

RAID (Redundant array of independent disks) •  Data storage virtualization • Many disks into one logical volume for improved redundancy and

performance •  Note: a RAID is not a backup, nor an archive •  There are different RAID levels, indicating the level of redundancy •  Raids use the parity concept to enable cost-efficient redundancy

-  A parity bit, or check bit is a bit added to the end of a string of binary code that indicates whether the number of bits in the string with the value one is even or odd.

-  Distinguish between even and odd parity

http://en.wikipedia.org/wiki/RAID http://en.wikipedia.org/wiki/Parity_bit 23

RAID (Redundant array of independent disks)

http://en.wikipedia.org/wiki/RAID http://en.wikipedia.org/wiki/Parity_bit

•  There are many different RAID level (differing on performance, availability and costs). We discuss RAID 1 and RAID 5

•  RAID 1 -  Data mirroring (no parity) -  Good read performance and reliability -  But not cost efficient

•  For the assignments you will also need to use other raid technologies

24

RAID (Redundant array of independent disks)

http://en.wikipedia.org/wiki/RAID http://en.wikipedia.org/wiki/Parity_bit

•  There are many different RAID level (differing in performance, availability and costs). We discuss RAID 1 and RAID 5

•  RAID 5 -  Block-level striping with distributed parity -  Can tolerate the loss of one disk -  At least three disks are

required

25

Sharing data • Methods for sharing data:

-  Distinguish between post- and prepublication sharing -  Biomart (biomart.org) -  Integrated management solutions (e.g., Lab key server,

openBIS) -  Many example can be found in the Baker paper…

•  Decentralized authentication system: OpenID

Nature Methods 9, 39–41 (2012) doi:10.1038/nmeth.1815 Published online 28 December 2011 26

Biomart •  Free software and data sources for the scientific services for the

scientific community •  Researches can set up their own data source • Own scientific data can be exposed to the world of researchers • Own data can be federated with data from others

27

Integrated Data management tools LabKey • Will be discussed thoroughly in lecture 11 • Most of these tools are open source (partly with commercial

support, e.g., LabKey) •  Complete workflow:

from source to the sharing

•  Sharing can be public or with dedicated users

28

OpenID •  Relatively recent concept (OpenID

Connect 1.0, release: 02/2014) •  Non-profit OpenID Foundation •  Authentication via co-operating sites •  Users can log in without the need to

enter all information again •  Close to a unified webID • Many data sharing platforms in biology

and biomedicine are adapting the OpenID concept (e.g., ICGC)

•  Advantages vs. disadvantages will be discussed in the problem sessions

http://openid.net http://openidexplained.com

Companies involved in OpenID

29

OpenID for scientific data •  Allows logging of the usage of shared data •  Important step towards guaranteeing intellectual property rights for

openly shared data • World-wide data usage can become possible •  Avoids overhead for user management

http://openid.net 30

Data privacy considerations

31

Some facts •  Surnames are paternally inherited in most western countries •  Co-segregation with Y-Chromosome haplotypes •  Breakers: Adoption, non-paternity, mutations.. •  Business model of genetic genealogy companies (e.g., find

biological father) •  There are many (big) databases containing haplotype information

(e.g., HapMap project or www.smgf.org) §  You need to enter a combination of Y-STR alleles (Y chromosome short tandem

repeats) §  You receive matching records: surnames with geographical location and pedigrees

Definition haplotype: A haplotype is a set of DNA variations, or polymorphisms, that tend to be inherited together. A haplotype can refer to a combination of alleles or to a set of single nucleotide polymorphisms (SNPs) found on the same chromosome.

http://ghr.nlm.nih.gov/glossary=haplotype Gymrek et al., 2013. Science .Identifying personal genomes by surname inference. 32

Can surnames be interfered from genome data? •  Personal genome data is getting increasingly available •  Open databases containing genealogy information •  39 k unique surnames vs. 135 k records (R2=0.78) •  Given a haplotype profile the correct surname can be discovered in 95% •  If additional demographic data is available (internet searches), the

individual identity can almost be assigned completely

Gymrek et al., 2013. Science .Identifying personal genomes by surname inference

Records per surname (US data) Cumulative distribution of US males

Matching age, state, surname

Matching only state and age

33

NGS for haplotype information •  100 bp (base pair read length) PE (paired end sequencing), 13 x

average coverage •  Haplotype information can be reconstructed with a 99% accuracy •  Using Craig Venter’s genome sequence, genealogy analysis

revealed the correct surname •  Surname + data of birth + state reveals

the correct individual

Gymrek et al., 2013. Science .Identifying personal genomes by surname inference 34

Multiple matchings •  If large genealogy information is available (e.g., procedure is

common in a family), the search may lead to many candidates

•  Surname has been recovered •  Publically available internet information is

added (obituaries, search engine records •  Demographic information from genealogy

databases •  Resulted in full identification of the

corresponding individuals •  Note the implication that social media profiles

may have !?

Gymrek et al., 2013. Science .Identifying personal genomes by surname inference 35

Summary – surname inference •  There is potential for vulnerability •  Accuracy is depending on the Y chromosome read coverage

(longer reads will lead to higher coverage) • Gymrek et al suggest

-  Establishing global data sharing policies -  Educating participants about potential risks and benefits of

genetic studies -  Development of legislation of proper usage of genetic data

Gymrek et al., 2013. Science .Identifying personal genomes by surname inference. Rodriguez et al., 2013. Science. The Complexities of Genomic Identifiability

36

Contact: Quantitative Biology Center (QBiC) Auf der Morgenstelle 10 72076 Tübingen · Germany [email protected]

Thanks for listening – See you next week

Data management for Quantitative Biology -Basics and challenges in biomedical data management, Apr...

Education

Transcript of Data management for Quantitative Biology -Basics and challenges in biomedical data management, Apr...