DNA analysis on your laptop: Spot the differences

download DNA analysis on your laptop: Spot the differences

If you can't read please download the document

Transcript of DNA analysis on your laptop: Spot the differences

DNA analysis on your laptop:
Spot the differences

Tech Tuesday11 October 2016

Barbera van [email protected]

https://ebbailey.wordpress.com/general-information/

Things I like

During the day I'm a bioinformatician

In my spare time I ...Go to concerts and festivals

Cook (all cuisines)

Read (fantasy, popular science/philosophy, Dutch literature)

Make things (sewing, electronics, laser cutting, welding, 3d printing)

Look into self-hosted cloud services

Grow vegetables in my garden

Overview

What is bioinformaticsDNA basicsDNA sequencing(Large) DNA projectsPublic databasesDNA sequence analysisYour own DNA

What is Bioinformatics?

Extraction of biological knowledge from complex data

How does molecule A interact with protein B?

A schematic visual model of oxygen-binding process, showing all fourmonomersandhemes, andprotein chainsonly as diagramatic coils, to facilitate visualization into the molecule. (http://en.wikipedia.org/wiki/Hemoglobin)

What is bioinformatics?

Image: BII

Understand biological systems

Find interesting bits in(heaps of) complex data

Computer simulations/modelsto understand what happens

https://phet.colorado.edu/en/simulation/legacy/natural-selection

Bioinformatics tools

... one of the results *might* be a tool you can use

Image: CSI game

It never looks like this though

Image: Oblivion (Universal Pictures)

Or this

Image: Prometheus (Scott Free Productions)

Usually it looks more like this

DNA

https://en.wikipedia.org/wiki/DNA

"Eukaryote DNA-en" by Eukaryote_DNA.svg: *Difference_DNA_RNA-EN.svg: *Difference_DNA_RNA-DE.svg: Sponk (talk)translation: Sponk (talk)Chromosome.svg: *derivative work: Tryphon (talk)Chromosome-upright.png: Original version: Magnus Manske, this version with upright chromosome: User:Dietzel65Animal_cell_structure_en.svg: LadyofHats (Mariana Ruiz)derivative work: Radio89derivative work: Radio89 - This file was derived from Eukaryote DNA.svg:. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:Eukaryote_DNA-en.svg#/media/File:Eukaryote_DNA-en.svg

DNA

ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACC

Differences between species

Differences between people

https://www.broadinstitute.org/news/106

Less than 1 percent

Mutations as we age

https://en.wikipedia.org/wiki/Jeanne_Calment

Mutations caused by environment

Image:http://jjshaver.deviantart.com/art/Smoking-makes-you-look-cool-146734265

How mutations occur

SNP/mutation/variantACATTTGCTTCTACATCTGCTTCT

InsertionACAT-TTGCTTCTACATATTGCTTCT

DeletionACATTTGCTTCTACAT-TGCTTCT

Labsession: Compare two DNA sequences

The game:Place as many matching letters as possible opposite each otherIntroduce mutations, insertions and deletionsScoring scheme:- Matching letters: +2- Mismatching letters: -1- Insertion or deletion: -1Sum all matching letters, mutations and indelsGet the maximum score

DNA sequencing
Examples

Detect viral DNA or RNA

Image: https://www.quia.com/jg/1272861list.html

Which gene(s) causes disease X?

http://www.accessexcellence.org/AE/AEPC/NIH/gene14.html

Study of families with particular disease.Some people are affected (in grey)Search for mutations or genes which are involved (bioinformatics)About chances that a particular gene is important (biostatistics)

Study migration

http://www.nature.com/nrg/journal/v16/n9/fig_tab/nrg3966_F5.html

Study ancient DNA

Metagenomics and microbial diversity

Study genomic content in a complex mixture of microorganisms(bacteria or viruses in some environment)Identify new species

https://www.nih.gov/news-events/news-releases/nih-human-microbiome-project-defines-normal-bacterial-makeup-body10x more microorganism cells than human cellsOnly 1-3% of body mass

Plant genomes

DNA sequencing
Technique

https://en.wikipedia.org/wiki/DNA_sequencing

DNA structure

1953: Watson and Crick

Franklin

Wilkins

Manual DNA sequencing

1977: Sanger sequencing

Automated DNA sequencing

~400 sequences per runScale-up by using manyDNA sequencers in parallel

Sequence center at Whitehead institute

Next generation sequencing

Run: 24 hrsData: 0.7 GB

Run: 7-14 daysData: 120 GB

Run: 3-10 daysData: 600 GB

2005-now: Next generation sequencing Millions to billions of sequences

DNA sequencing
Human genome projects

Human Genome Project

http://web.ornl.gov/sci/techresources/Human_Genome/index.shtml

Human Genome Project

http://web.ornl.gov/sci/techresources/Human_Genome/index.shtml

1000 genomes project

http://www.1000genomes.org/

Genome of the Netherlands

http://www.nlgenome.nl/

4000 genomes6000 exomeshttp://www.uk10.org/

The 100K genomes project

The project will focus onpatients with a rare disease andtheir families and patients withcancer. The first samples forsequencing are being takenfrom patients living in Englandwith discussions taking placewith Scotland, Wales andNorthern Ireland aboutpotential future involvement.

http://www.genomicsengland.co.uk/

Personal genomes

100,000 genomes plus medical recordshttp://www.personalgenomes.org/

Genome projects

Genome projects

Human genome project (1 individual)

Exome sequencing (~10 individuals)

Genome of the Netherlands (770 individuals)

1000 genome project (1000 individuals)

10K UK project (10,000 individuals)Upgraded to 100,000 genomes

Personal genomes project

many centers have one or more high throughput sequencers

http://omicsmaps.com/

Sign @ Wellcome-Sanger, Cambridge, UK

Data size challenges

COMPUTINGSequencing rate is higherthan Moore's law

STORAGESequencing costs lowerthan data storage

Stein, Genome Biology, 2010

Hayden, Nature, 2014

Analysis on PCs and small servers

Cluster and/or Cloud

Databases

Publicly available biological databases

Nucleotide sequence databases

International Nucleotide Sequence Database Collaboration

Daily exchange of sequence data

https://www.ncbi.nlm.nih.gov/https://www.ebi.ac.uk/http://www.ddbj.nig.ac.jp/

Nucleotide sequence databases

Image: http://www.davelunt.net/

GenBank

Release 200.0 (12 Feb 2014)has 171,123,749 non-WGS, non-CON records containing 157,943,793,171 base pairs of sequence data. In addition, there are 139,725,795 WGS records containing 591,378,698,544 base pairs of sequence data.For downloading purposes, please keep in mind that the GenBank flatfiles are approximately 625 GB (sequence files only). The ASN.1 data are approximately 522 GB.

https://www.ncbi.nlm.nih.gov/genbank/statistics/

Labsession: databases

Go to: https://www.ncbi.nlm.nih.gov/genbank/

Search for: NM_000518

Take the first link, click on fasta

Copy/paste the record in notepad or word>blah and the sequence

Store it on your desktop as HBB.txt

Do the same for: M25113Store this as sickle.txt

DNA sequence analysis

DNA sequence comparison

Function prediction (similarity, sequence search)Conservation (motifs, functional blocks)Localisation (gene finding)Grouping (genes, protein families)SNPs and mutations (variations)

DNA sequence comparison

Pairwise alignment: in-exact matching of 2 sequences

Multiple sequence alignment: in-exact matchingof >2 sequences

Database search with DNA sequence

Blast output - alignments

Score: how similarExpect: could this hit occur by chanceQuery: input sequenceSbjct: database sequenceNumbers: from where to where are the sequences similarVertical bars: matching nucleotidesNo vertical bar: indicates mismatches.

Labsession: sequence alignment (1)

Go to: https://blast.ncbi.nlm.nih.gov/

Choose Nucleotide Blast

Tick the box Align two or more sequences

Copy/paste the HBB.txt sequence in the first box, and the sickle.txt sequence in the second box

Scroll down and click BLAST

Can you spot the differences between the healthy and ill person?

Labsession: sequence alignment (2)

Go to: https://blast.ncbi.nlm.nih.gov/

Choose Nucleotide Blast

Copy/paste the 'unknown' sequence in the boxSequence on meetup page

Scroll down and click BLAST

Variation between people

Variation between people

http://www.nature.com/ng/journal/v43/n9/fig_tab/ng.894_F1.htmlhttp://what-when-how.com/genomics/haplotype-mapping-genomics/

Disease causing variants

Structural variation

Not just mutations, insertions and deletions

Larger 'blocks' of DNA differ

http://www.nature.com/nmeth/journal/v9/n2/full/nmeth.1858.html

How to determine variants

Extract DNA& amplify to get enoughfor measurement

Sequence the DNA

Map DNA fragments on human reference genome

Determine variants compared to reference genome

1

2

3

4

What could possibly
go wrong?

Errors during the amplification step

Errors during the DNA sequencing process

Errors during mapping of the DNA fragments to the reference genome

Low genome coverage

Reference genome not complete

Etc, etc

Have your own DNA sequenced

http://isogg.org/wiki/List_of_personal_genomics_companiesWhole genome: $1799-$5000

Whole exome (the protein coding part of the genome): $850-$1000

Mitochondrial DNA or Y-chromosome

Only variants: ~$200

http://www.wikihow.com/Extract-Your-DNA

.. and then compare it with other data

HapMap

1000 genomes and other genome projects

Known (disease) variants

Other animals

Family members

Labsession: what do your variants tell about you?

23andme dataset as example

Geographic location

Neanderthal DNA

Disease risks

Before you continue...Try everything witha public dataset first!

Why?1) First have an outsiders look on the data2) Verify what will happen with your data when you send it to some website

Explore public data

https://my.pgp-hms.org/Public data > Whole genome sequences and other data

Data type: 23andme (dropdown menu)

Download one of the datasets

Unzip the file

More about this project:http://personalgenomes.org/

Selection of DNA tools

Interpretomehttp://esquilax.stanford.edu/

Ancestry: PCA and Painting

Prometheasehttp://snpedia.com/index.php/Promethease

Sample report: 23andme v4 (2014)

Codegenhttps://codegen.eu/

Try the demo

More tools

http://www.23andyou.com/3rdparty

http://isogg.org/wiki/Autosomal_DNA_tools

Your DNA

What does 'risk' mean?It is a risk (most of the time), not (always) a definitive destination

Consult a doctor

What is genetic, what is caused by environment?

How accurate is the underlying data?

Image from: http://newproductvisions.com/blog/?p=625

Your DNA

Sequencing is affordable

Please remember: it is identifiable data!

Overview

What is bioinformaticsDNA basicsDNA sequencing(Large) DNA projectsPublic databasesDNA sequence analysisYour own DNA

TED talk tips

Svante Paabo DNA clues to our inner Neanderthalhttps://www.ted.com/talks/svante_paeaebo_dna_clues_to_our_inner_neanderthal

Sebastian Kraves The era of personal DNA testing is herehttps://www.ted.com/talks/sebastian_kraves_the_era_of_personal_dna_testing_is_here

Jennifer Doudna We can now edit our DNA, but let's do it wiselyhttps://www.ted.com/talks/jennifer_doudna_we_can_now_edit_our_dna_but_let_s_do_it_wisely

Ellen Jorgensen What you need to know about CRISPRhttps://www.ted.com/talks/ellen_jorgensen_what_you_need_to_know_about_crispr

Juan Enriquez We can reprogram life. How to do it wiselyhttps://www.ted.com/talks/juan_enriquez_we_can_reprogram_life_how_to_do_it_wisely

Follow up questions

Hoe gerelateerd zijn sequenties?

Tree of life (voorbeeld: road-trip Boston)

Kanker

Immunologie

Mutating viruses

Roadtrip USA

Kosakovsky Pond S, Wadhawan S, Chiaromonte F, Ananda G, Chung WY, Taylor J, Nekrutenko A; Galaxy Team (2009) Windshield splatter analysis with the Galaxy metagenomic pipeline. Genome Research, 19(11), 2144-2153.

Dodge caravan

Ductape on bumper

Route

Route

Laboratory

DNA sequencing

Phylogenetic tree

Which insects and otherspecies did theyencounter and how muchare they alike?

Phylogenetische boom

Count species

Citrobacter6682120.317

Cronobacter43220.512

Dickeya410.25

Enterobacter414255071.33

Enterovibrio310.333

Erwinia2240120

Escherichia8112990.369

Francisella111

Haemophilus310.333

Halomonas1040.4

Klebsiella1512116950.112

Kluyvera1410.071

Marinobacter341.333

Pantoea32140.438

Pectobacterium122590.484

Photorhabdus5710.018

Proteus2610.038

Providencia12230.025

Pseudomonas16163830.237

Bacterie Route A Route B Ratio: B/A

Counted theamount ofspecies on bumperfor route A and B

Determinedthe ratio ofeach specieson route Acompared to route B

Click to edit the title text formatClick to edit Master title style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Second level

Third level

Fourth level

Fifth level

10/11/16