Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... ·...
Transcript of Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... ·...
![Page 2: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/2.jpg)
• Laurea Triennale: ScienzeBiologiche, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;
• Laurea Specialistica: ScienzeBiomolecolari e Cellulari, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;
• PhD in Genetics, University of Leicester, prof. Mark A. Jobling;
• Post‐doctoral fellow EPHE‐MNHN, Paris, Dr. Stefano Mona.
Background
![Page 3: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/3.jpg)
• Laurea Triennale: ScienzeBiologiche, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;
• Laurea Specialistica: ScienzeBiomolecolari e Cellulari, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;
• PhD in Genetics, University of Leicester, prof. Mark A. Jobling;
• Post‐doctoral fellow EPHE‐MNHN, Paris, Dr. Stefano Mona.
Background
Cusco, Marzo 2009
![Page 4: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/4.jpg)
Muséum national d'Histoire naturelle ‐ Paris
![Page 5: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/5.jpg)
Informazioni pratiche
• Teoria + pratica;
• Software and tools;
• Files;
• Slides on the website;
• Argomenti nuovi / argomenti gia’ trattati;
![Page 6: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/6.jpg)
Programma
• next‐generation sequencing (NGS)…come, quando, perche’?
• un esempio di gestione e analisi dati NGS:
• tipo di dato;• file e formati;• programmi;• interpretazione dei risultati;• stima dell’errore;• quando fermarsi?
• Applicazioni e/o progetti su diversi organismi.
![Page 7: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/7.jpg)
capture: exome/custom/cancer
amplicon sequencing
whole genome
mapping to a reference genome
de‐novoassembly
sequencing
unalignedreads QC
mapping refinement
mapping QCassembly QC
whole transcriptome
amplicon sequencing: fixed/custom
DNA‐seq
RNA‐seq
reads trimming
NGS: come, quando, perché?
Filtering
Validation
![Page 8: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/8.jpg)
NGS: come, quando, perché?
Domanda: quando? Domanda: perche’?
![Page 9: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/9.jpg)
Domanda: quando?
Risposta: quando ha senso!
NGS: come, quando, perché?
Domanda: perche’?
![Page 10: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/10.jpg)
Domanda: quando?
Risposta: quando ha senso!
• Amplicone 400bp in 100 individui? → Sanger sequencing
NGS: come, quando, perché?
Domanda: perche’?
![Page 11: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/11.jpg)
Domanda: quando?
Risposta: quando ha senso!
• Amplicone 400bp in 100 individui? → Sanger sequencing
• 50 ampliconi in 100 individui? → NGS + target capture
NGS: come, quando, perché?
Domanda: perche’?
![Page 12: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/12.jpg)
Domanda: quando?
Risposta: quando ha senso!
• Amplicone 400bp in 100 individui? → Sanger sequencing
• 50 ampliconi in 100 individui? → NGS + target capture
• Gene conversion, elementiripetuti, recombination breakpoints? → NGS + Sanger sequencing
NGS: come, quando, perché?
Domanda: perche’?
![Page 13: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/13.jpg)
Domanda: quando?
Risposta: quando ha senso!
• Amplicone 400bp in 100 individui? → Sanger sequencing
• 50 ampliconi in 100 individui? → NGS + target capture
• Gene conversion, elementiripetuti, recombination breakpoints? → NGS + Sanger sequencing
Domanda: perche’?
Risposta: la vostra idea per un progetto!
NGS: come, quando, perché?
![Page 14: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/14.jpg)
un esempio di gestione e analisi dati NGS
![Page 15: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/15.jpg)
un esempio di gestione e analisi dati NGS
Nanopore minIon/gridIon
Pacific Bioscience (PacBio)
Ion torrent PGM/Proton
Roche 454
Illumina MiSeq/HiSeq
![Page 16: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/16.jpg)
capture: exome/custom/cancer
amplicon sequencing
whole genome
mapping to a reference genome
de‐novoassembly
sequencing
unalignedreads QC
mapping refinement
mapping QCassembly QC
whole transcriptome
amplicon sequencing: fixed/custom
DNA‐seq
RNA‐seq
reads trimming
Filtering
Validation
un esempio di gestione e analisi dati NGS
![Page 17: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/17.jpg)
un esempio di gestione e analisi dati NGS
• progetto
![Page 18: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/18.jpg)
un esempio di gestione e analisi dati NGS
• progetto
• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)
![Page 19: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/19.jpg)
un esempio di gestione e analisi dati NGS
• progetto
• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)
• progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs…)
![Page 20: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/20.jpg)
un esempio di gestione e analisi dati NGS
• progetto
• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)
• progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs…)
• progetto:applicazione:scopo:coverage (SNPs, indels, repeatedelements, CNVs…)
![Page 21: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/21.jpg)
Project:
• Saccharomyces cerevisiae;
• Genome: 16 chromosomes, ~12.5Mb, ~6200 genes;
• Whole genome sequencing;
• Illumina platform;
• Paired‐end reads, 1 library, 2 lanes.
![Page 22: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/22.jpg)
fragment ========================================fragment + adaptors ~~~========================================~~~SE read ‐‐‐‐‐‐‐‐‐>PE reads R1‐‐‐‐‐‐‐‐‐> <‐‐‐‐‐‐‐‐‐R2unknown gap ..................................................
Single‐end (SE) or paired‐end (PE) sequencing.
![Page 23: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/23.jpg)
fragment ========================================fragment + adaptors ~~~========================================~~~SE read ‐‐‐‐‐‐‐‐‐>PE reads R1‐‐‐‐‐‐‐‐‐> <‐‐‐‐‐‐‐‐‐R2unknown gap ..................................................
Single‐end (SE) or paired‐end (PE) sequencing.
![Page 24: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/24.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant calling
SNPs/indels
single/multi‐sample
samtools
raw variants (.vcf)
ready‐to‐use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtools
![Page 25: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/25.jpg)
.fa/.fasta
.fastq
.sam (.sai)
.bam (.bai)
.vcf
sequences
read data
mapped reads
mapped reads (binary)
variant information
![Page 26: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/26.jpg)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
raw reads (.fastq)
![Page 27: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/27.jpg)
raw reads (.fastq)
gedit s‐6‐1.fastq
OR
Terminal: more s‐6‐1.fastq OR head s‐6‐1.fastq
![Page 28: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/28.jpg)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
raw reads (.fastq)
Instrument ID
![Page 29: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/29.jpg)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
Instrument ID
raw reads (.fastq)
![Page 30: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/30.jpg)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
Instrument ID Tile
raw reads (.fastq)
![Page 31: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/31.jpg)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
coordinates of the cluster
Instrument ID Tile
raw reads (.fastq)
![Page 32: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/32.jpg)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
coordinates of the cluster
Instrument ID Tile
Index number
raw reads (.fastq)
![Page 33: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/33.jpg)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
coordinates of the cluster
Instrument ID
First mate in the pair (paired‐end reads)
TileIndex number
raw reads (.fastq)
![Page 34: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/34.jpg)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
coordinates of the cluster
read
Instrument ID
First mate in the pair (paired‐end reads)
TileIndex number
raw reads (.fastq)
![Page 35: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/35.jpg)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
coordinates of the cluster
read
Quality values for each nucleotide
Instrument ID
First mate in the pair (paired‐end reads)
TileIndex number
raw reads (.fastq)
![Page 36: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/36.jpg)
@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9
Lane
coordinates of the cluster
read
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Lowest HighestASCII
33 126
Instrument ID
0.2......................26...31........41
Illumina 1.8+ Phred+33, raw reads typically (0, 41)
First mate in the pair (paired‐end reads)
TileIndex number
Quality values for each nucleotide (base quality score)
raw reads (.fastq)
![Page 37: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/37.jpg)
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Lowest HighestASCII
33 1260.2......................26...31........41
Illumina 1.8+ Phred+33, raw reads typically (0, 41)
Phred‐scale value:
Q = ‐10*log_10P → P = 10‐Q/10
Phred Quality Score(Q)
Probability of incorrect base call
(P)Base call accuracy
10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%
raw reads (.fastq)
![Page 38: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/38.jpg)
raw reads (.fastq)
• Move into folder lane2;
• Open s‐7‐1.fastq
• gedit s‐7‐1.fastq
OR
Terminal: more s‐7‐1.fastq OR head s‐6‐1.fastq
• Are s‐6‐1.fastq and s‐7‐1.fastq coming from two different lanes?
![Page 39: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/39.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant calling
SNPs/indels
single/multi‐sample
samtools
raw variants (.vcf)
ready‐to‐use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtools
![Page 40: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/40.jpg)
1‐ Fastq quality control + trimming
Fastqc: quality control of the raw data coming out from the sequencer
![Page 41: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/41.jpg)
1‐ Fastq quality control + trimming
Fastqc: quality control of the raw data coming out from the sequencer
• Evaluation of the quality of the generated data;
![Page 42: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/42.jpg)
1‐ Fastq quality control + trimming
Fastqc: quality control of the raw data coming out from the sequencer
• Evaluation of the quality of the generated data;
• Basic summary statistics of the raw data;
![Page 43: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/43.jpg)
1‐ Fastq quality control + trimming
Fastqc: quality control of the raw data coming out from the sequencer
• Evaluation of the quality of the generated data;
• Basic summary statistics of the raw data;
• Several modules to evaluate different features (i.e. adapters; base quality, etc…)
![Page 44: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/44.jpg)
1‐ Fastq quality control + trimming
Fastqc: quality control of the raw data coming out from the sequencer
• Evaluation of the quality of the generated data;
• Basic summary statistics of the raw data;
• Several modules to evaluate different features (i.e. adapters; base quality, etc…)
• Feedback (green, orange, red): do not fully rely on that, think what does it mean!!
![Page 45: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/45.jpg)
1‐ Fastq quality control + trimming
![Page 46: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/46.jpg)
1‐ Fastq quality control + trimming
Per base sequence quality: warning
![Page 47: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/47.jpg)
1‐ Fastq quality control + trimming
What can we do to improve the quality at the end of the reads?
Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores
![Page 48: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/48.jpg)
What can we do to improve the quality at the end of the reads?
Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores
1‐ Fastq quality control + trimming
![Page 49: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/49.jpg)
1‐ Fastq quality control + trimming
What can we do to improve the quality at the end of the reads?
Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores
![Page 50: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/50.jpg)
1‐ Fastq quality control + trimming
What can we do to improve the quality at the end of the reads?
Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores
![Page 51: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/51.jpg)
1‐ Fastq quality control + trimming
95‐99 bp 90‐94 bp
What can we do to improve the quality at the end of the reads?
Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores
![Page 52: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/52.jpg)
1‐ Fastq quality control + trimming
Per sequence quality score: pass
![Page 53: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/53.jpg)
1‐ Fastq quality control + trimming
Sequence length: pass
![Page 54: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/54.jpg)
Adapters removal1‐ Fastq quality control + trimming
Failed
Warning
![Page 55: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/55.jpg)
Adapters removal1‐ Fastq quality control + trimming
Pass
![Page 56: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/56.jpg)
Overrepresented sequences
1‐ Fastq quality control + trimming
Removal of overrepresented sequences (PCR primers).
![Page 57: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/57.jpg)
FASTQC references:
• Software website:http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
• Manual:https://insidedna.me/tool_page_assets/pdf_manual/fastqc.pdf
![Page 58: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/58.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant calling
SNPs/indels
single/multi‐sample
samtools
raw variants (.vcf)
ready‐to‐use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtools
![Page 59: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/59.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant calling
SNPs/indels
single/multi‐sample
samtools
raw variants (.vcf)
ready‐to‐use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtools
![Page 60: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/60.jpg)
Alignment : process of determining the most likelylocation within the genome for the observed DNA read
raw reads reference genome
2‐ Alignment to a reference genome
![Page 61: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/61.jpg)
trade‐off: speed vs sensitivity – the higher the accuracy the longer the alignment run
two classes of methods:
Burrows‐Wheeler
• Fast• less robust at high divergence
with reference genome• e.g. bwa
Hashing
• slow (needs more memory)• robust at high divergence with
reference genome• e.g. stampy
the shorter the read the harder is to find its location in the genome
big amount of data: computationally challenging for memory and speed
2‐ Alignment to a reference genome
![Page 62: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/62.jpg)
What if there are several possible places to align your sequencing read?
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
![Page 63: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/63.jpg)
raw reads reference genome
low MQ: the probability of mapping to different locations is high, but no perfect multiple matches
high MQ: a single match
MQ0: a perfect multiple match
What if there are several possible places to align your sequencing read?
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
MQ is a phred‐score of the quality of the alignment
2‐ Alignment to a reference genome
![Page 64: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/64.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
![Page 65: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/65.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
![Page 66: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/66.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Reference sequence
Sample_1
1 copia
1 copia
1 copia
1 copia
![Page 67: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/67.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2Element 1
![Page 68: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/68.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Element 1
![Page 69: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/69.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Element 1
![Page 70: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/70.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Element 1
![Page 71: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/71.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Element 1
Perfect mul ple matches → MQ0Not a perfect match → Low MQ
![Page 72: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/72.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Element 1
Perfect mul ple matches → MQ0Not a perfect match → Low MQ
Reference sequence
Sample_1
2 copia
1 copia
1 copia
1 copia
![Page 73: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/73.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
![Page 74: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/74.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
False heterozygous callCluster of heterozygotes
![Page 75: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/75.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
False heterozygous callCluster of heterozygotes
Reference sequence
Sample_1
1 copia
2 copia
1 copia
1 copia
![Page 76: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/76.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
AluSg7
![Page 77: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/77.jpg)
This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps
2‐ Alignment to a reference genome
![Page 78: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/78.jpg)
create the index of the reference genome (for bwa, samtools and picard)
bwa index: this is a FM‐index – specific to the algorithm behind this aligner
bwa index -a is Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
2‐ Alignment to a reference genome: reference sequence
![Page 79: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/79.jpg)
create the index of the reference genome (for bwa, samtools and picard)
bwa index: this is a FM‐index – specific to the algorithm behind this aligner
bwa index -a is Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
index .fai
samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
The index file stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.
2‐ Alignment to a reference genome: reference sequence
![Page 80: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/80.jpg)
2‐ Alignment to a reference genome: reference sequence
index .fai
samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
The index file stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.
![Page 81: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/81.jpg)
2‐ Alignment to a reference genome: reference sequence
index .fai
samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
The index file stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.
![Page 82: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/82.jpg)
2‐ Alignment to a reference genome: reference sequence
index .fai
samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
The index file stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.
50 characters
![Page 83: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/83.jpg)
2‐ Alignment to a reference genome: reference sequence
index .fai
samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa
The index file stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.
50 characters
60 characters
![Page 84: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/84.jpg)
create the dictionary of the reference genome (for samtools, gatk and picard)
dictionary .dict: list of contigs included in the fasta file of the reference genome
java -jar picard.jar CreateSequenceDictionaryREFERENCE=Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa OUTPUT=Saccharomyces_cerevisiae.EF4.68.dna.toplevel.dict
keep index and dictionary files in the same directory of the reference file!
2‐ Alignment to a reference genome: reference sequence
![Page 85: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/85.jpg)
dictionary .dict: list of contigs included in the fasta file of the reference genome
2‐ Alignment to a reference genome – reference sequence
![Page 86: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/86.jpg)
dictionary .dict: list of contigs included in the fasta file of the reference genome
2‐ Alignment to a reference genome – reference sequence
SequenceName
![Page 87: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/87.jpg)
dictionary .dict: list of contigs included in the fasta file of the reference genome
2‐ Alignment to a reference genome – reference sequence
SequenceName
SequenceLength
![Page 88: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/88.jpg)
dictionary .dict: list of contigs included in the fasta file of the reference genome
2‐ Alignment to a reference genome – reference sequence
SequenceName
SequenceLength
Path
![Page 89: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/89.jpg)
dictionary .dict: list of contigs included in the fasta file of the reference genome
2‐ Alignment to a reference genome – reference sequence
SequenceName
SequenceLength
Path MD5 checksum
![Page 90: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/90.jpg)
2‐ Alignment to a reference genome: mapping with bwa‐mem
Three different algorithm:
1. BWA‐backtrack: for illumina reads up to 100bp;
2. BWA‐SW: long read support, split alignment;
3. BWA‐MEM: long read support, split alignment, faster, more accurate
![Page 91: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/91.jpg)
2‐ Alignment to a reference genome: mapping with bwa‐mem
Three different algorithm:
1. BWA‐backtrack: for illumina reads up to 100bp;
2. BWA‐SW: long read support, split alignment;
3. BWA‐MEM: long read support, split alignment, faster, more accurate
![Page 92: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/92.jpg)
• paired‐end alignment (lane1);
2‐ Alignment to a reference genome: mapping with bwa‐mem
![Page 93: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/93.jpg)
• paired‐end alignment (lane1);
• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;
2‐ Alignment to a reference genome: mapping with bwa‐mem
![Page 94: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/94.jpg)
• paired‐end alignment (lane1);
• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;
• Option to mark shorter split hits as secondary (not supplementary).
2‐ Alignment to a reference genome: mapping with bwa‐mem
![Page 95: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/95.jpg)
2‐ Alignment to a reference genome: mapping with bwa‐mem
Split read:
Karacok E et al., 2012
• paired‐end alignment (lane1);
• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;
• Option to mark shorter split hits as secondary (not supplementary).
![Page 96: Bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 1. 20. · base recalibration GATK GATK picard aligned reads (.sam/.bam) 5. variant](https://reader036.fdocuments.net/reader036/viewer/2022062403/60473277372b6a0f2d0e3df0/html5/thumbnails/96.jpg)
• paired‐end alignment (lane1);
• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;
• Option to mark shorter split hits as secondary (not supplementary).
bwa mem [options] [RefSeq] [lane1_fastq1] [lane1_fastq2] > lane1.sam
2‐ Alignment to a reference genome: mapping with bwa‐mem