DNA analysis on your laptop: Spot the differences
-
Upload
barbera-van-schaik -
Category
Science
-
view
159 -
download
1
Transcript of DNA analysis on your laptop: Spot the differences
DNA analysis on your laptop:
Spot the differences
Tech Tuesday11 October 2016
Barbera van [email protected]
https://ebbailey.wordpress.com/general-information/
Things I like
During the day I'm a bioinformatician
In my spare time I ...Go to concerts and festivals
Cook (all cuisines)
Read (fantasy, popular science/philosophy, Dutch literature)
Make things (sewing, electronics, laser cutting, welding, 3d printing)
Look into self-hosted cloud services
Grow vegetables in my garden
Overview
What is bioinformaticsDNA basicsDNA sequencing(Large) DNA projectsPublic databasesDNA sequence analysisYour own DNA
What is Bioinformatics?
Extraction of biological knowledge from complex data
How does molecule A interact with protein B?
A schematic visual model of oxygen-binding process, showing all fourmonomersandhemes, andprotein chainsonly as diagramatic coils, to facilitate visualization into the molecule. (http://en.wikipedia.org/wiki/Hemoglobin)
What is bioinformatics?
Image: BII
Understand biological systems
Find interesting bits in(heaps of) complex data
Computer simulations/modelsto understand what happens
https://phet.colorado.edu/en/simulation/legacy/natural-selection
Bioinformatics tools
... one of the results *might* be a tool you can use
Image: CSI game
It never looks like this though
Image: Oblivion (Universal Pictures)
Or this
Image: Prometheus (Scott Free Productions)
Usually it looks more like this
DNA
https://en.wikipedia.org/wiki/DNA
"Eukaryote DNA-en" by Eukaryote_DNA.svg: *Difference_DNA_RNA-EN.svg: *Difference_DNA_RNA-DE.svg: Sponk (talk)translation: Sponk (talk)Chromosome.svg: *derivative work: Tryphon (talk)Chromosome-upright.png: Original version: Magnus Manske, this version with upright chromosome: User:Dietzel65Animal_cell_structure_en.svg: LadyofHats (Mariana Ruiz)derivative work: Radio89derivative work: Radio89 - This file was derived from Eukaryote DNA.svg:. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:Eukaryote_DNA-en.svg#/media/File:Eukaryote_DNA-en.svg
DNA
ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACC
Differences between species
Differences between people
https://www.broadinstitute.org/news/106
Less than 1 percent
Mutations as we age
https://en.wikipedia.org/wiki/Jeanne_Calment
Mutations caused by environment
Image:http://jjshaver.deviantart.com/art/Smoking-makes-you-look-cool-146734265
How mutations occur
SNP/mutation/variantACATTTGCTTCTACATCTGCTTCT
InsertionACAT-TTGCTTCTACATATTGCTTCT
DeletionACATTTGCTTCTACAT-TGCTTCT
Labsession: Compare two DNA sequences
The game:Place as many matching letters as possible opposite each otherIntroduce mutations, insertions and deletionsScoring scheme:- Matching letters: +2- Mismatching letters: -1- Insertion or deletion: -1Sum all matching letters, mutations and indelsGet the maximum score
DNA sequencing
Examples
Detect viral DNA or RNA
Image: https://www.quia.com/jg/1272861list.html
Which gene(s) causes disease X?
http://www.accessexcellence.org/AE/AEPC/NIH/gene14.html
Study of families with particular disease.Some people are affected (in grey)Search for mutations or genes which are involved (bioinformatics)About chances that a particular gene is important (biostatistics)
Study migration
http://www.nature.com/nrg/journal/v16/n9/fig_tab/nrg3966_F5.html
Study ancient DNA
Metagenomics and microbial diversity
Study genomic content in a complex mixture of microorganisms(bacteria or viruses in some environment)Identify new species
https://www.nih.gov/news-events/news-releases/nih-human-microbiome-project-defines-normal-bacterial-makeup-body10x more microorganism cells than human cellsOnly 1-3% of body mass
Plant genomes
DNA sequencing
Technique
https://en.wikipedia.org/wiki/DNA_sequencing
DNA structure
1953: Watson and Crick
Franklin
Wilkins
Manual DNA sequencing
1977: Sanger sequencing
Automated DNA sequencing
~400 sequences per runScale-up by using manyDNA sequencers in parallel
Sequence center at Whitehead institute
Next generation sequencing
Run: 24 hrsData: 0.7 GB
Run: 7-14 daysData: 120 GB
Run: 3-10 daysData: 600 GB
2005-now: Next generation sequencing Millions to billions of sequences
DNA sequencing
Human genome projects
Human Genome Project
http://web.ornl.gov/sci/techresources/Human_Genome/index.shtml
Human Genome Project
http://web.ornl.gov/sci/techresources/Human_Genome/index.shtml
1000 genomes project
http://www.1000genomes.org/
Genome of the Netherlands
http://www.nlgenome.nl/
4000 genomes6000 exomeshttp://www.uk10.org/
The 100K genomes project
The project will focus onpatients with a rare disease andtheir families and patients withcancer. The first samples forsequencing are being takenfrom patients living in Englandwith discussions taking placewith Scotland, Wales andNorthern Ireland aboutpotential future involvement.
http://www.genomicsengland.co.uk/
Personal genomes
100,000 genomes plus medical recordshttp://www.personalgenomes.org/
Genome projects
Genome projects
Human genome project (1 individual)
Exome sequencing (~10 individuals)
Genome of the Netherlands (770 individuals)
1000 genome project (1000 individuals)
10K UK project (10,000 individuals)Upgraded to 100,000 genomes
Personal genomes project
many centers have one or more high throughput sequencers
http://omicsmaps.com/
Sign @ Wellcome-Sanger, Cambridge, UK
Data size challenges
COMPUTINGSequencing rate is higherthan Moore's law
STORAGESequencing costs lowerthan data storage
Stein, Genome Biology, 2010
Hayden, Nature, 2014
Analysis on PCs and small servers
Cluster and/or Cloud
Databases
Publicly available biological databases
Nucleotide sequence databases
International Nucleotide Sequence Database Collaboration
Daily exchange of sequence data
https://www.ncbi.nlm.nih.gov/https://www.ebi.ac.uk/http://www.ddbj.nig.ac.jp/
Nucleotide sequence databases
Image: http://www.davelunt.net/
GenBank
Release 200.0 (12 Feb 2014)has 171,123,749 non-WGS, non-CON records containing 157,943,793,171 base pairs of sequence data. In addition, there are 139,725,795 WGS records containing 591,378,698,544 base pairs of sequence data.For downloading purposes, please keep in mind that the GenBank flatfiles are approximately 625 GB (sequence files only). The ASN.1 data are approximately 522 GB.
https://www.ncbi.nlm.nih.gov/genbank/statistics/
Labsession: databases
Go to: https://www.ncbi.nlm.nih.gov/genbank/
Search for: NM_000518
Take the first link, click on fasta
Copy/paste the record in notepad or word>blah and the sequence
Store it on your desktop as HBB.txt
Do the same for: M25113Store this as sickle.txt
DNA sequence analysis
DNA sequence comparison
Function prediction (similarity, sequence search)Conservation (motifs, functional blocks)Localisation (gene finding)Grouping (genes, protein families)SNPs and mutations (variations)
DNA sequence comparison
Pairwise alignment: in-exact matching of 2 sequences
Multiple sequence alignment: in-exact matchingof >2 sequences
Database search with DNA sequence
Blast output - alignments
Score: how similarExpect: could this hit occur by chanceQuery: input sequenceSbjct: database sequenceNumbers: from where to where are the sequences similarVertical bars: matching nucleotidesNo vertical bar: indicates mismatches.
Labsession: sequence alignment (1)
Go to: https://blast.ncbi.nlm.nih.gov/
Choose Nucleotide Blast
Tick the box Align two or more sequences
Copy/paste the HBB.txt sequence in the first box, and the sickle.txt sequence in the second box
Scroll down and click BLAST
Can you spot the differences between the healthy and ill person?
Labsession: sequence alignment (2)
Go to: https://blast.ncbi.nlm.nih.gov/
Choose Nucleotide Blast
Copy/paste the 'unknown' sequence in the boxSequence on meetup page
Scroll down and click BLAST
Variation between people
Variation between people
http://www.nature.com/ng/journal/v43/n9/fig_tab/ng.894_F1.htmlhttp://what-when-how.com/genomics/haplotype-mapping-genomics/
Disease causing variants
Structural variation
Not just mutations, insertions and deletions
Larger 'blocks' of DNA differ
http://www.nature.com/nmeth/journal/v9/n2/full/nmeth.1858.html
How to determine variants
Extract DNA& amplify to get enoughfor measurement
Sequence the DNA
Map DNA fragments on human reference genome
Determine variants compared to reference genome
1
2
3
4
What could possibly
go wrong?
Errors during the amplification step
Errors during the DNA sequencing process
Errors during mapping of the DNA fragments to the reference genome
Low genome coverage
Reference genome not complete
Etc, etc
Have your own DNA sequenced
http://isogg.org/wiki/List_of_personal_genomics_companiesWhole genome: $1799-$5000
Whole exome (the protein coding part of the genome): $850-$1000
Mitochondrial DNA or Y-chromosome
Only variants: ~$200
http://www.wikihow.com/Extract-Your-DNA
.. and then compare it with other data
HapMap
1000 genomes and other genome projects
Known (disease) variants
Other animals
Family members
Labsession: what do your variants tell about you?
23andme dataset as example
Geographic location
Neanderthal DNA
Disease risks
Before you continue...Try everything witha public dataset first!
Why?1) First have an outsiders look on the data2) Verify what will happen with your data when you send it to some website
Explore public data
https://my.pgp-hms.org/Public data > Whole genome sequences and other data
Data type: 23andme (dropdown menu)
Download one of the datasets
Unzip the file
More about this project:http://personalgenomes.org/
Selection of DNA tools
Interpretomehttp://esquilax.stanford.edu/
Ancestry: PCA and Painting
Prometheasehttp://snpedia.com/index.php/Promethease
Sample report: 23andme v4 (2014)
Codegenhttps://codegen.eu/
Try the demo
More tools
http://www.23andyou.com/3rdparty
http://isogg.org/wiki/Autosomal_DNA_tools
Your DNA
What does 'risk' mean?It is a risk (most of the time), not (always) a definitive destination
Consult a doctor
What is genetic, what is caused by environment?
How accurate is the underlying data?
Image from: http://newproductvisions.com/blog/?p=625
Your DNA
Sequencing is affordable
Please remember: it is identifiable data!
Overview
What is bioinformaticsDNA basicsDNA sequencing(Large) DNA projectsPublic databasesDNA sequence analysisYour own DNA
TED talk tips
Svante Paabo DNA clues to our inner Neanderthalhttps://www.ted.com/talks/svante_paeaebo_dna_clues_to_our_inner_neanderthal
Sebastian Kraves The era of personal DNA testing is herehttps://www.ted.com/talks/sebastian_kraves_the_era_of_personal_dna_testing_is_here
Jennifer Doudna We can now edit our DNA, but let's do it wiselyhttps://www.ted.com/talks/jennifer_doudna_we_can_now_edit_our_dna_but_let_s_do_it_wisely
Ellen Jorgensen What you need to know about CRISPRhttps://www.ted.com/talks/ellen_jorgensen_what_you_need_to_know_about_crispr
Juan Enriquez We can reprogram life. How to do it wiselyhttps://www.ted.com/talks/juan_enriquez_we_can_reprogram_life_how_to_do_it_wisely
Follow up questions
Hoe gerelateerd zijn sequenties?
Tree of life (voorbeeld: road-trip Boston)
Kanker
Immunologie
Mutating viruses
Roadtrip USA
Kosakovsky Pond S, Wadhawan S, Chiaromonte F, Ananda G, Chung WY, Taylor J, Nekrutenko A; Galaxy Team (2009) Windshield splatter analysis with the Galaxy metagenomic pipeline. Genome Research, 19(11), 2144-2153.
Dodge caravan
Ductape on bumper
Route
Route
Laboratory
DNA sequencing
Phylogenetic tree
Which insects and otherspecies did theyencounter and how muchare they alike?
Phylogenetische boom
Count species
Citrobacter6682120.317
Cronobacter43220.512
Dickeya410.25
Enterobacter414255071.33
Enterovibrio310.333
Erwinia2240120
Escherichia8112990.369
Francisella111
Haemophilus310.333
Halomonas1040.4
Klebsiella1512116950.112
Kluyvera1410.071
Marinobacter341.333
Pantoea32140.438
Pectobacterium122590.484
Photorhabdus5710.018
Proteus2610.038
Providencia12230.025
Pseudomonas16163830.237
Bacterie Route A Route B Ratio: B/A
Counted theamount ofspecies on bumperfor route A and B
Determinedthe ratio ofeach specieson route Acompared to route B
Click to edit the title text formatClick to edit Master title style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level
10/11/16