P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University .

43
P. Tang ( 鄧鄧鄧 ); PJ Huang ( 鄧鄧鄧 ) Bioinformatics Center, Chang Gung University. Databases and Tools for High Throughput Sequencing Analysis

description

Databases and Tools for High Throughput Sequencing Analysis. P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University . HTseq Platforms. Applications on Biomedical Sciences. Analysis Strategies: Reference Sequence Alignment (Mapping) vs De novo Assembly. - PowerPoint PPT Presentation

Transcript of P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University .

Page 1: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

P. Tang (鄧致剛 ); PJ Huang (黄栢榕 )Bioinformatics Center, Chang Gung University.

Databases and Tools for High Throughput Sequencing

Analysis

Page 2: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .
Page 3: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

HTseq Platforms

Page 4: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .
Page 5: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Applications on Biomedical Sciences

Page 6: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

or transcriptome

Analysis Strategies: Reference Sequence Alignment (Mapping) vs De novo Assembly

Page 7: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

HTseq Experiment

Page 8: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

• Data and information management is slowly moving out of infancy in genomics science…. at the toddler stage…

• The Good news– Some data formats are being accepted widely

• The Bad news– Still many competing standards in some areas– Interoperability of data standards is almost non-existent– Governance is questionable

Great… I got my data now what…

Page 9: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Storage & Computing PowerNext gen sequencers generated Giga bp to Tera bp of data

Page 10: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Data Format Types

• Raw Sequence Data e.g. fasta

• Aligned data e.g. BAM

• Processed data e.g. BED

Page 11: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Interpreting raw data

Page 12: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

How deep should we go?

(a) 80% of yeast genes (genome size: ~120MB) were detected at 4 million uniquely mapped RNA-Seq reads, and coverage reaches a plateau afterwards despite the increasing sequencing depth. Expressed genes are defined as having at least four independent reads from a 50-bp window at the 3' end.

(b) The number of unique start sites detected starts to reach a plateau when the depth of sequencing reaches 80 million in two mouse transcriptomes. ES, embryonic stem cells; EB, embryonic body.

Nature Reviews Genetics 10, 57-63

coverage

Page 13: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Genome Size

De novo assembled rice transcriptome 1.3 Gb RNA Seq data (genome size: ~400MB)‐85% of assembled unigenes were covered by gene models

Page 14: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

HTseq Raw Data Format

• fasta (Sanger)• csfasta (SOLiD)• fastq (Solexa)• sff (454)• …. And about 30 other file formats

• http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

Page 15: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

SOLiD Color Space

Page 16: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

(cs)Fasta/(cs)Fastq

• FASTA– Header line “>”– Sequence

• FASTQ– Add QVs encoded as single byte ASCII codes

• Most aligners accept FASTA/Q as input• Issue: data is volumous (2 bytes per base for FASTQ)• Do PHRED scaled values provide the most

information?

Page 17: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Fastq: Illumina & Snager

Page 18: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Fastq: Illumina & NCBI

Page 19: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

sff (text format): 454

Page 20: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

454 fasta with quality file

Page 21: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

454 base quality?

Page 22: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Illumina SoLID/ABI-Life Roche 454 Ion Torrent

1. Removal of low quality bases/ Low complexity regions2. Removal of adaptor sequences3. Homopolymer-associated base call errors (3 or more

identical DNA bases) causes higher number of (artificial) frameshifts

All Platforms have Errors

Page 23: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

High quality region - NO ambiguities (Ns)

Trace File

Medium quality region - SOME ambiguities (Ns)

Poor quality region - LOW confidence

Page 24: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Quality Control Is Essential

Page 25: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Accessing Quality: phred scores

Page 26: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Accessing Quality: phred scores

Page 27: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

454 output formats

.sff

.fna

.qual

Standard flowgram format

Page 28: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Illumina output formats

.seq.txt

.prb.txt

Illumina FASTQ (ASCII – 64 is Illumina score)

Qseq(ASCII – 64 is Phred score)

Illumina single line formatSCARF

28Solexa Compact ASCII Read Format

Phred quality scores

Page 29: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

• ASCII value for h= 103• Quality of Base A at the position 1 = 103- 64• 103- 64 = 39• Where 39 is the phred score

Illumina FastQ

Page 30: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Quality ControlRead quality distribution

Library insert sizeMapping Rate

Duplication assessment

Page 31: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

Quality Control Tools

Page 32: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

NGS QC Toolkit & FastQC

NGS QC Toolkit is for quality check and filtering of high-quality read

This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html

Application have been implemented in Perl programming language

QC of sequencing data generated using Roche 454 and Illumina platforms

Additional tools to aid QC : (sequence format converter and trimming tools) and analysis (statistics tools)

FastQC can be used only for preliminary analysis

Page 33: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .
Page 34: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .
Page 35: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

http://www.ncbi.nlm.nih.gov/geo/

Page 36: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

http://www.ncbi.nlm.nih.gov/gds/

expression profiling by arrayexpression profiling by genome tiling arrayexpression profiling by high throughput sequencingexpression profiling by mpssexpression profiling by rt pcrexpression profiling by sageexpression profiling by snp arraygenome binding/occupancy profiling by arraygenome binding/occupancy profiling by genome tiling arraygenome binding/occupancy profiling by high throughput sequencinggenome binding/occupancy profiling by snp arraygenome variation profiling by arraygenome variation profiling by genome tiling arraygenome variation profiling by high throughput sequencinggenome variation profiling by snp arraymethylation profiling by arraymethylation profiling by genome tiling arraymethylation profiling by high throughput sequencingmethylation profiling by snp arraynon coding rna profiling by arraynon coding rna profiling by genome tiling arraynon coding rna profiling by high throughput sequencingotherprotein profiling by mass specprotein profiling by protein arraysnp genotyping by snp arraythird party reanalysis

Page 37: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .
Page 38: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

"Illumina Genome Analyzer" AND smallRNA

Page 40: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .
Page 42: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .
Page 43: P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics  Center, Chang Gung University .

http://seqanswers.com/