P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University .

P. Tang (鄧致剛 ); PJ Huang (黄栢榕 )Bioinformatics Center, Chang Gung University.

Databases and Tools for High Throughput Sequencing

Analysis

HTseq Platforms

Applications on Biomedical Sciences

or transcriptome

Analysis Strategies: Reference Sequence Alignment (Mapping) vs De novo Assembly

HTseq Experiment

• Data and information management is slowly moving out of infancy in genomics science…. at the toddler stage…

• The Good news– Some data formats are being accepted widely

• The Bad news– Still many competing standards in some areas– Interoperability of data standards is almost non-existent– Governance is questionable

Great… I got my data now what…

Storage & Computing PowerNext gen sequencers generated Giga bp to Tera bp of data

Data Format Types

• Raw Sequence Data e.g. fasta

• Aligned data e.g. BAM

• Processed data e.g. BED

Interpreting raw data

How deep should we go?

(a) 80% of yeast genes (genome size: ~120MB) were detected at 4 million uniquely mapped RNA-Seq reads, and coverage reaches a plateau afterwards despite the increasing sequencing depth. Expressed genes are defined as having at least four independent reads from a 50-bp window at the 3' end.

(b) The number of unique start sites detected starts to reach a plateau when the depth of sequencing reaches 80 million in two mouse transcriptomes. ES, embryonic stem cells; EB, embryonic body.

Nature Reviews Genetics 10, 57-63

coverage

Genome Size

De novo assembled rice transcriptome 1.3 Gb RNA Seq data (genome size: ~400MB)‐85% of assembled unigenes were covered by gene models

HTseq Raw Data Format

• fasta (Sanger)• csfasta (SOLiD)• fastq (Solexa)• sff (454)• …. And about 30 other file formats

• http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

SOLiD Color Space

(cs)Fasta/(cs)Fastq

• FASTA– Header line “>”– Sequence

• FASTQ– Add QVs encoded as single byte ASCII codes

• Most aligners accept FASTA/Q as input• Issue: data is volumous (2 bytes per base for FASTQ)• Do PHRED scaled values provide the most

information?

Fastq: Illumina & Snager

Fastq: Illumina & NCBI

sff (text format): 454

454 fasta with quality file

454 base quality?

Illumina SoLID/ABI-Life Roche 454 Ion Torrent

1. Removal of low quality bases/ Low complexity regions2. Removal of adaptor sequences3. Homopolymer-associated base call errors (3 or more

identical DNA bases) causes higher number of (artificial) frameshifts

All Platforms have Errors

High quality region - NO ambiguities (Ns)

Trace File

Medium quality region - SOME ambiguities (Ns)

Poor quality region - LOW confidence

Quality Control Is Essential

Accessing Quality: phred scores

454 output formats

Standard flowgram format

Illumina output formats

.seq.txt

.prb.txt

Illumina FASTQ (ASCII – 64 is Illumina score)

Qseq(ASCII – 64 is Phred score)

Illumina single line formatSCARF

28Solexa Compact ASCII Read Format

Phred quality scores

• ASCII value for h= 103• Quality of Base A at the position 1 = 103- 64• 103- 64 = 39• Where 39 is the phred score

Illumina FastQ

Quality ControlRead quality distribution

Library insert sizeMapping Rate

Duplication assessment

Quality Control Tools

NGS QC Toolkit & FastQC

NGS QC Toolkit is for quality check and filtering of high-quality read

This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html

Application have been implemented in Perl programming language

QC of sequencing data generated using Roche 454 and Illumina platforms

Additional tools to aid QC : (sequence format converter and trimming tools) and analysis (statistics tools)

FastQC can be used only for preliminary analysis

http://www.ncbi.nlm.nih.gov/geo/

http://www.ncbi.nlm.nih.gov/gds/

expression profiling by arrayexpression profiling by genome tiling arrayexpression profiling by high throughput sequencingexpression profiling by mpssexpression profiling by rt pcrexpression profiling by sageexpression profiling by snp arraygenome binding/occupancy profiling by arraygenome binding/occupancy profiling by genome tiling arraygenome binding/occupancy profiling by high throughput sequencinggenome binding/occupancy profiling by snp arraygenome variation profiling by arraygenome variation profiling by genome tiling arraygenome variation profiling by high throughput sequencinggenome variation profiling by snp arraymethylation profiling by arraymethylation profiling by genome tiling arraymethylation profiling by high throughput sequencingmethylation profiling by snp arraynon coding rna profiling by arraynon coding rna profiling by genome tiling arraynon coding rna profiling by high throughput sequencingotherprotein profiling by mass specprotein profiling by protein arraysnp genotyping by snp arraythird party reanalysis

"Illumina Genome Analyzer" AND smallRNA

http://seqanswers.com/

P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University .

Documents

Transcript of P. Tang ( 鄧致剛 ) ; PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University .

第六章 观叶灌木 1.金叶垂榕 2.黄金榕 3.琴叶榕 4.柳叶榕 5.福木 6. …hqjt.jmu.edu.cn/__local/A/0A/5F/57742DB78D101ECA064D32765CF_… · 用 途 ：行道树，耐暴风和浪潮侵袭，根部稳固；树脂颜料。

栢野忠夫氏 天才ルカ・バリッキ 動作 エピソード0...天才ルカ・バリッキの ダンス動作 『体幹内操法』で著名な 栢野忠夫氏が紐解く 取材・文

OOAD1 UML Tools: A Brief Introduction 鄧姚文 joseph@im.knu.edu.tw joseph.

協康會馬鞍山中心 活動快訊導師：：：：駿暉駿暉哥哥（（（（資深新聞工作資深新聞工作 者者者者鄧鄧鄧鄧駿駿駿駿暉暉暉暉））））

交點台中Vol.3 - 陳葶榕 - 關於更多的我

基泰國際公司 鄧麗穎

R99945020 林澤豪 F98942047 許芷榕 R00922113 謝宗潛 R 98922144 駱家淮

Open data的發展及應用(鄧東波)20160531

南丫島實地考察活動 工作紙二 姓名： ( ) 班別： 日期： 第一站 榕 … · 第一站 榕樹灣大街 第一部分：垃圾 - 有甚麼解決方案﹖ 1. 在榕樹灣，採用了甚麼可持續的方法處理廢物﹖

1. The meeting was resumed at 9:00 a.m. on …...R2013 / C858 –鄧美英 R2104 –Yeung Ka Fai 鄧美英 Yeung Ka Fai ] ] Representer, Commenter, Representers’ and Commenters’

鄧宗業 菸商行銷策略

Photography Contest Winning Entries - Greening · 得獎者 冠軍Champion: 鄭朗澄Cheng Long Ching, Jason-垂釣樂「榕榕」 亞軍First runner-up: 黃海晴Wong Hoi Ching,

絕響──永遠的鄧麗君 CH5 Part 1

薗栢：説明怒ぐ醒 - BIGLOBEasai_/img020.pdf · 2020. 7. 11. · 重富 白 菜斤′ 闇 2016年（平成28年）6月2日 木曜日 13版 漢語発案≡ 薗栢：説明怒ぐ醒

家長會 校務 工作報告 主講人 : 高栢鈴 100/10/05

P. Tang ( 鄧致剛 ) ; RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 )

Instructor:VS 鄧豪偉 Presenter: CR 周益聖

Reporter: R4 李育庭 Supervisor: 鄧豪偉醫師

P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.

臺灣情故鄉味 JONE'S LOVE OF TAIWAN WITH EXOTIC HOME FLAVOURS Class: 餐旅二甲 Class: 餐旅二甲 Student ID:4A0M0124 Student ID:4A0M0124 Name: 黃珮榕 Name: 黃珮榕.

第六章观叶灌木 1.金叶垂榕 2.黄金榕 3.琴叶榕 4.柳叶榕 5.福木 6. …hqjt.jmu.edu.cn/__local/A/0A/5F/57742DB78D101ECA064D32765CF_… · 用途：行道树，耐暴风和浪潮侵袭，根部稳固；树脂颜料。

栢野忠夫氏天才ルカ・バリッキ動作エピソード0...天才ルカ・バリッキのダンス動作『体幹内操法』で著名な栢野忠夫氏が紐解く取材・文

協康會馬鞍山中心活動快訊導師：：：：駿暉駿暉哥哥（（（（資深新聞工作資深新聞工作者者者者鄧鄧鄧鄧駿駿駿駿暉暉暉暉））））

基泰國際公司鄧麗穎

南丫島實地考察活動工作紙二姓名： ( ) 班別：日期：第一站榕 … · 第一站榕樹灣大街第一部分：垃圾 - 有甚麼解決方案﹖ 1. 在榕樹灣，採用了甚麼可持續的方法處理廢物﹖

鄧宗業菸商行銷策略

Photography Contest Winning Entries - Greening · 得獎者冠軍Champion: 鄭朗澄Cheng Long Ching, Jason-垂釣樂「榕榕」亞軍First runner-up: 黃海晴Wong Hoi Ching,

薗栢：説明怒ぐ醒 - BIGLOBEasai_/img020.pdf · 2020. 7. 11. · 重富白菜斤′ 闇 2016年（平成28年）6月2日木曜日 13版漢語発案≡ 薗栢：説明怒ぐ醒

家長會校務工作報告主講人 : 高栢鈴 100/10/05