Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next...
Transcript of Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next...
![Page 1: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/1.jpg)
Introduction to Next Generation Sequencing Analysis: Part I
Short Read Mapping and Visualization
Phillip Richmond@Phil_A_Richmond
November 23rd, 2016
![Page 2: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/2.jpg)
Workshop outline
1. Introduction2. Preparing your workshop directory3. Short Read Mapping “Pipeline”4. Learn how to use BWA and Samtools5. Analyze example dataset 6. Visualize example dataset7. Q & A, work individually on additional samples
![Page 3: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/3.jpg)
Welcome!
● Welcome to the UBC Advanced Research Computing (ARC) Workshop!● As the first session in what is hopefully a useful series, we are open to
comments/critiques on what works/fails● Info about ARC and who we are
○ https://arc.ubc.ca/
● Info about WestGrid for when you fall in love and want to pursue further analysis on the High Performance Compute (HPC) systems
○ https://www.westgrid.ca/
![Page 4: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/4.jpg)
Learning Goals
● Learn to interact with WestGrid compute environment and queuing system● Explore command-line usage of popular Bioinformatics Tools used in an
abundance of applications● Learn about file formats (Fastq, SAM, BAM, Fasta)● Visualize mapped reads using Integrative Genomics Viewer (IGV)● Gain confidence in the ability to analyze your own data!
![Page 5: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/5.jpg)
Interactive Experience
We hope this is an interactive experience for all of you
Questions/Problems can be posted to the group-chat in vidyo, or to this google doc:
https://docs.google.com/document/d/15nwI7Bl2Y1Miyk_yE4-WvduAkweYZ1-LEzjRw7p__JM/edit
We have 4 TAs to assist in answering questions and solving problems, at the end of the session I can address unresolved questions
![Page 6: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/6.jpg)
Computing via servers
● User interacts with their own desktop
● Through a terminal, they can communicate with the head node
● The head node communicates with the execution nodes through the job scheduler
terminal
Head node
ssh connection
Job scheduler,Job scripts
orcinus.westgrid.ca
![Page 7: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/7.jpg)
Short Read Sequencing
http://bitesizebio.com/13546/sequencing-by-synthesis-explaining-the-illumina-sequencing-technology/
● Several genomics applications for short-read DNA sequencing and alignment● Variant/Mutation calling● Protein:DNA/RNA interactions
○ ChIP-seq, Clip-Seq
● 3-D Chromatin Organization○ Capture Hi-C
● Regulatory Sequence Analysis○ MPRA, STARR-seq, CRE-seq,
CREST-Seq
● Transcriptional analysis○ GRO-seq, RNA-seq, Ribo-seq,
CAGE-seq, 3’-Seq
![Page 8: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/8.jpg)
Let’s get started! Login to Orcinus
You should have already attempted this by now, but as a reminder:
1. Open up a terminal (PC: MobaXterm, Putty | Mac/Linux: Terminal)2. Login to Orcinus
$ ssh <username>@orcinus.westgrid.ca
NOTE: Whenever you see me represent something with the <>, I want you to replace it with what applies to you. Also, whenever there is a “$”, I am showing you a command
Example:$ ssh [email protected]
![Page 9: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/9.jpg)
Orcinus Filesystem Organization
/
global/ tmp/ (ignore... ...the rest)home/
user02/scratch/
user.../software/
ARC_Training/
user01/richmonp/
GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/
![Page 10: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/10.jpg)
First logging in: Your home directory
/
global/ tmp/ (ignore... ...the rest)home/
user02/scratch/
user.../software/
ARC_Training/
user01/richmonp/
GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/
Command (Print working directory):$ pwd
![Page 11: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/11.jpg)
Let’s explore: /global/scratch/ARC_Training/
/
global/ tmp/ (ignore... ...the rest)home/
user02/scratch/
user.../software/
ARC_Training/
user01/richmonp/
GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/
Command:$ cd /global/scratch/ARC_Training/
![Page 12: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/12.jpg)
Make yourself a “Workshop” directory inside of PROCESS/, title it: <LASTNAME>/
/
global/home/
user02/scratch/
user.../
ARC_Training/
user01/richmonp/
GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/
Example: $ mkdir /global/scratch/ARC_Training/RICHMOND/
RICHMOND/
![Page 13: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/13.jpg)
Let’s copy some files into your Workshop Directory/
global/
scratch/
ARC_Training/
GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/
Example: $ cp /global/scratch/ARC_Training/RAW_DATA/NA20845* /global/scratch/ARC_Training/PROCESS/RICHMOND/
RICHMOND/
NA20845.chr19.subregion_R1.fastq NA20845.chr19.subregion_R2.fastq
![Page 14: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/14.jpg)
What are these Files?
● These files come from the 1000 Genomes Project, and represent paired-end sequencing raw-data files
● Lots of data is available in this format through the Short Read Archive (SRA)○ https://www.ncbi.nlm.nih.gov/sra
● Fastq (AKA: FastQ, fq) files contain raw reads sequence “reads”, and for paired-end reads, the files are sorted so that for each line, the read in the _R1 file has a corresponding read in the _R2 file
● You can look at the contents of the file using the head command:
$ head <filename>
![Page 15: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/15.jpg)
FastQ file format
● File extension .fastq or .fq
Example:
@Read_identifier_and_flowcell_infoACGTCCGGTTNNN…+B$!?NP\\\[%&C…
ReadNameSequence+Quality Score
https://en.wikipedia.org/wiki/FASTQ_format
Qua
lity
scor
e
Probability of error
![Page 16: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/16.jpg)
Let’s also explore some human genome files/
global/
scratch/
ARC_Training/
GENOME/ PROCESS/ Quiz/ RAW_DATA/ SCRIPTS/
RICHMOND/
genome.fa genome.fa.ann
genome.fa.bwt
genome.fa.amb
genome.fa.pacgenome.fa.sagenome.fa.fai
Example: $ more genome.fa
![Page 17: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/17.jpg)
Pipeline Overview
Sample.Reads1.fastq
Sample.Reads2.fastq
genome.fa*
(genome.fa.anngenome.fa.ambgenome.fa.pacgenome.fa.bwtgenome.fa.sa)
BWA mem
Raw reads
Genome indexSample.sam
samtoolsview
samtoolssort
samtoolsindex
Sample.bam
Sample.sorted.bam
Sample.sorted.bam.bai
File format conversion
Read mapping
IGVVisualization
![Page 18: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/18.jpg)
First: Read mapping
Sample.Reads1.fastq
Sample.Reads2.fastq
genome.fa*
(genome.fa.anngenome.fa.ambgenome.fa.pacgenome.fa.bwtgenome.fa.sa)
BWA mem
Raw reads
Genome indexSample.sam
samtoolsview
samtoolssort
samtoolsindex
Sample.bam
Sample.sorted.bam
Sample.sorted.bam.bai
File format conversion
Read mapping
IGVVisualization
![Page 19: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/19.jpg)
Learning the bwa commandFirst we need to load the module that has the bwa command in it$ module load bio-tools
Next we will call the bwa mem command to see how it’s used$ bwa mem
Let’s break down this usage statement:$ bwa mem [options] <idxbase> <in1.fq> [in2.fq]
[ ] is an optional argument<> is required and is asking you to replace what’s inside with the appropriate value
Example:$ bwa mem genome.fa Sample.Reads1.fastq Sample.Reads2.fastq > Sample.sam
![Page 20: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/20.jpg)
Next: File Format Conversion
Sample.Reads1.fastq
Sample.Reads2.fastq
genome.fa*
(genome.fa.anngenome.fa.ambgenome.fa.pacgenome.fa.bwtgenome.fa.sa)
BWA mem
Raw reads
Genome indexSample.sam
samtoolsview
samtoolssort
samtoolsindex
Sample.bam
Sample.sorted.bam
Sample.sorted.bam.bai
File format conversion
Read mapping
IGVVisualization
![Page 21: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/21.jpg)
Learning the samtools commands
We will use 3 samtools operations: view, sort, and index (in that order)
$ samtools view -b <in.sam> -o <out.bam>$ samtools view -b Sample1.sam -o Sample1.bam
$ samtools sort <in.bam> <out.sorted>$ samtools sort Sample1.bam Sample1.sorted
$ samtools index <in.sorted.bam> $ samtools index Sample1.sorted.bam
![Page 22: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/22.jpg)
Let’s chat briefly about the queue
Interacting with the queue is done with a few commands:
Submit a queue script:
$ qsub <file.pbs>
Check the status of the queue
$ showq
$ qstat
Check the status of your jobs in the queue
$ showq -u <username>
$ showq -u richmonp
terminal
Head node
ssh connection
Job scheduler,Job scripts
orcinus.westgrid.ca
![Page 23: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/23.jpg)
The .pbs queue script
● The best resource for understanding queue scripts is:○ https://www.westgrid.ca/support/running_jobs
● Lucky for you, I’ve made a script with the bwa mem and samtools commands in it.
● Copy this script into your Workshop directory:
/global/scratch/ARC_Training/SCRIPTS/MapAndConvert.pbs
Example:$ cp /global/scratch/ARC_Training/SCRIPTS/MapAndConvert.pbs /global/scratch/ARC_Training/PROCESS/RICHMOND/
![Page 24: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/24.jpg)
Open MapAndConvert.pbs in emacs
#!/bin/bash#PBS -S /bin/bash
## I want 4 processors#PBS -l procs=4
## How much RAM does each processor need?#PBS -l pmem=2000mb
## The maximum walltime that will be used for my job#PBS -l walltime=00:15:00
## I want email sent when the job begins, ends and aborts (bea)#PBS -m bea
## Where I want the email to be sent#PBS -M [email protected]
Make sure you edit this to be your own email address (doesn’t have to be gmail)
$ emacs <filename>$ emacs /global/scratch/ARC_Training/PROCESS/RICHMOND/MapAndConvert.pbs
![Page 25: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/25.jpg)
Edit MapAndConvert.pbs, change RICHMOND
## Load the module containing bwa and samtoolsmodule load bio-tools
## Map with BWAbwa mem -t 4 /global/scratch/ARC_Training/PROCESS/RICHMOND/genome.fa /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion_R1.fastq /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion_R2.fastq > /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sam
## Convert sam to bam using samtools viewsamtools view -b /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sam -o /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.bam
## Sort the bam file using samtools sortsamtools sort /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.bam /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sorted
## Index the sorted bamsamtools index /global/scratch/ARC_Training/PROCESS/RICHMOND/NA20845.chr19.subregion.sorted.bam
Make sure you change all instances of RICHMOND to your own last name
![Page 26: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/26.jpg)
Now we can run our job in the queue
Submit job using qsub
$ qsub <file.pbs>$ qsub /global/scratch/ARC_Training/PROCESS/RICHMOND/
Check job status using showq or qstat
$ qstat -u <username>$ qstat -u richmonp
$ showq -u <username>$ showq -u richmonp
![Page 27: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/27.jpg)
The output SAM file
@SQ - Sequence (contig/chromosome) from reference file@PG - Program information about mapping@RG - Read group information (we won’t have any here)
Tab delimited, each line is 1 read. Pairs will be next to each other in the file (e.g. Line1: Read1Line2: Read2
https://samtools.github.io/hts-specs/SAMv1.pdf
![Page 28: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/28.jpg)
Bam file is a binary format of that sam file
We cannot look at these binary files the same way as we look at text files
Downstream applications will almost always ask for a .bam file
Sorting is necessary for downstream applications
Index will be required for IGV
![Page 29: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/29.jpg)
Data visualization
Sample.Reads1.fastq
Sample.Reads2.fastq
genome.fa*
(genome.fa.anngenome.fa.ambgenome.fa.pacgenome.fa.bwtgenome.fa.sa)
BWA mem
Raw reads
Genome indexSample.sam
samtoolsview
samtoolssort
samtoolsindex
Sample.bam
Sample.sorted.bam
Sample.sorted.bam.bai
File format conversion
Read mapping
IGVVisualization
![Page 30: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/30.jpg)
Use FileZilla to transfer files onto your own computer
![Page 31: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/31.jpg)
Open up IGV, and load the file we just created
![Page 32: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/32.jpg)
In the search box, type: chr19:1,201,956-1,242,206
Search box Zoom tool
![Page 33: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/33.jpg)
Congratulations!
In the remaining time, please try to repeat what we just did with these other raw data files in /global/scratch/ARC_Training/RAW_DATA/
Also, you can try to use different options with bwa mem:
-k 25-B 10-O 12,12
Visualize multiple samples at the same time in IGV
![Page 34: Sequencing Analysis: Part I Introduction to Next Generation to Next... · Introduction to Next Generation Sequencing Analysis: Part I Short Read Mapping and Visualization Phillip](https://reader033.fdocuments.net/reader033/viewer/2022042304/5ecfc158cd3f2311532c3125/html5/thumbnails/34.jpg)
Thanks for participating!
We will contact you in a few days with the following:
1. A form for feedback on the course2. The date of an “office hours” session in the next two weeks regarding this
material3. Information about how to get a full account on WestGrid for future analysis
projects for those with temporary logins
Hope to see you all at the next workshop!