Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked...

43

Transcript of Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked...

Page 1: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –
Page 2: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

• Sample collection

• DNA extraction

• Library construction

• Sequencing

• Assembly

• Annotation

• Database

• Open access

Global Centers BGI WDCM

Page 3: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

1.5mL Eppendorf tube(can be

sealed with microplate sealing

film).

x Don’t use single PCR tube, 8-

well strip PCR tube or enclosed

a PCR plate with domed cap.

Page 4: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Eppendorf tube shall be placed in a

rigid and sealed container.

DNA sample that is precipitated by

ethanol can be transferred in

normal temperature. For any other

type of DNA sample, should be

transferred with enough dry ice.

Sample: Clear & Absent of color

Colored sample will be held

by customs

Page 5: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

One hard copy with sample. One soft copy by email.

Page 6: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Sequencing

type

Volum

e

(μl)

Total

DNA

Amou

nt (μg)

DNA

concentr

ation

(ng/μL)

DNA

library

Insert

size

Sequen

cing

time

Data

analysi

s

time

Fungi/Bacteria

Whole

Genome

Sequencing

15-100 ≥1 ≥12.5 270bp 20 days 10 days

Fungi/Bacteria

Pacbio DNA

Sequencing

15-100 ≥5 ≥60 20 kb 28 days 10 days

Fungi/Bacteria

Whole

Genome

Sequencing

15-100 ≥10 ≥30

PCR-

Free

270bp

20 days 10 days

100 250

500

750 1000

2000

564

2027 2322

4361

23130

Sample Test Method ①Method of concentration determination:■Qubit Fluorometer、□NanoDrop、□Microplate Reader ②Method of sample integrity test:■Agarose Gel Electrophoresis

Page 7: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

• Sample Name*

• Species*

• No. of Tubes*

• Concentration (ng/μL)

• Volume (μL)

• Total Quantity (μg) • One hard copy with sample. One soft copy by email.

• Shipping address – China: China National GeneBank, Jinsha Road, Dapeng District,

Shenzhen, China

– Other countries and regions: 1/F, 16th Dai Fu Street, Tai Po Industrial

Estate, Tai Po, Hong Kong

Please do remember to fill in the species name and the GC content !!!

Page 8: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Data

GenomicsComponent

GeneFunction

Assembly

Hiseq

PacBio

Draft Map

Fine Map

Ge

ne

Re

pe

at Seq

ue

nce

ncR

NA

Ge

ne

Island

Pro

PH

AG

E

CR

ISPR

General GeneAnnotation

GO KEGG NRSwiss-Port

TrEMBL COG

AnimalPathogens

ARDB VFDBPHI T3SS

PlantPathogens

PHI CAZYT3SS

Completed map

• Illumina Data Summary • Hiseq (270bp library,PE151)100X;

Page 9: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

8.5K -r-xr-xr-x 1 solexa solexa 431 Nov 15 17:59 181113_I12_V300008345_L1_HUMbakMAAAA-519.report

37K -r-xr-xr-x 1 solexa solexa 21K Nov 15 17:59 base.png

13K -r-xr-xr-x 1 solexa solexa 6.6K Nov 15 17:59 qual.png

8.5K -r-xr-xr-x 1 solexa solexa 1.1K Nov 15 17:59 report.htm

37K -r-xr-xr-x 1 solexa solexa 30K Nov 15 17:59 V300008345_L01_519_1.fq.fqStat.txt

3.2G -r-xr-xr-x 1 solexa solexa 3.2G Nov 15 17:59 V300008345_L01_519_1.fq.gz

37K -r-xr-xr-x 1 solexa solexa 30K Nov 15 17:59 V300008345_L01_519_2.fq.fqStat.txt

3.2G -r-xr-xr-x 1 solexa solexa 3.2G Nov 15 18:00 V300008345_L01_519_2.fq.gz

@V300008345L1C001R040000014/1

ATCTGCAAACCAAGTTCTTTCATTACCCGGTCAGTCTGTTTATTC

TTTCGGAGATTTCCCAACAACCACATTCCCTCATCGGCAAATAC

ATTCGACAGAC

+

FFF@FGFFFFFEFFFFFFFFFFFFGFFGGFF=BFFFGFFFFFFFF?FF

GFF;GFFGGFFGFFFGFFGGFGFFFFGGFGFFFFFFFFFGFGFGGE

FFFFFF

Fastq files:

Information about the read

Sequence

Separator

Phred quality scores encoded in

ASCII format

Page 10: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Phred quality scores Q are defined as a property which is logarithmically related to the base-calling error

probabilities P. For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called

incorrectly are 1 in 1000.

Phred Quality Score Probability of incorrect

base call Base call accuracy

10 1 in 10 90%

20 1 in 100 99%

30 1 in 1000 99.9%

40 1 in 10,000 99.99%

50 1 in 100,000 99.999%

60 1 in 1,000,000 99.9999%

Page 11: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Raw data

Low quality reads

Good reads

Adapter sequences

Host sequences

Clean data (Good reads)

SOAPnuke_filter

Duplications

N bases

Page 12: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Insert Size: The length of inserted fragment; Reads Length: Length of reads; Raw Data: The size of raw data; Adapter: The

proportion of Adapter; Duplication: The proportion of same reads; Total Reads: Total reads number; Filtered Reads: The

proportion of filtered reads; Low Quality Filtered Reads: The proportion of Low quality filtered reads; Clean Data: The size

of reads we delivered.

Sample Name (#)

Insert Size (bp)

Reads Length (bp)

Raw Data (Mb)

Adapter (%)

Duplication (%) Total

Reads (#) Filtered

Reads (%)

Low Quality Filtered Reads

(%)

Clean Data (Mb)

JCM.11219 270 (150:150) 1,308 0.65 12.96 8,723,49

8 14.46 0.82 1,119

JCM.12580 270 (150:150) 1,301 2.59 10.42 8,676,87

0 13.69 0.66 1,123

JCM.13583 270 (150:150) 1,301 0.51 12.88 8,676,86

4 14.32 0.89 1,115

Software for filtering : SOAPnuke_filter

Page 13: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

The X-axis shows the positions of bases in read1 and

read2. When the base composition is balanced, the A

and T curves overlap and the G and C curves

overlap.

The X-axis shows the positions of bases in read1 and

read2, the Y-axis shows the quality value of each base.

Each point in the graph represents the base quality value

of the corresponding position in a certain read.

BGI_result/Separate/SampleName/1.Cleandata:

|-- SampleName.Illumina_Cleandata.xls [Statistics of Illumina filtering]

|--SampleName.ISInserSize_Clean.1.fq.gz [Illuminareads1compressedfileinfastqformat]

|--SampleName.ISInserSize_Clean.2.fq.gz [Illuminareads2compressedfileinfastqformat]

|-- SampleName.ISInserSize_Clean.base.png [Filtered Illumina reads GC distribution]

|-- SampleName.ISInserSize_Clean.qual.png [Filtered Illumina reads quality distribution]

|-- SampleName.ISInserSize_Raw.base.png [Raw Illumina reads GC distribution]

|-- SampleName.ISInserSize_Raw.qual.png [Raw Illumina reads quality distribution]

Page 14: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Read: Sequences acquired from sequencer (Illuminia).

K-mer: Continuous fragments of reads that are equal in length.

Contig: Sequences assembled from k-mers.

Scaffold: linked contigs with gaps (with the help of PE reads) .

Software for assembly: SOAPdenovo – V.2.04

SPAdes – V.3.12.0

Reads

K-mers

Contigs

Scaffold NNNN

Page 15: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Interpreting the Statistic of Assembly

Seq Type: Assembled contig or scaffold;

Total Number: Number of contig or scaffold;

Total Length :The length of whole assembly sequence.

N50: Given a set of contigs, the N50 is defined as the sequence length of the shortest contig at 50% of the total genome length. The

bigger the N50, the better the assembly is.

Sample Name

Seq Type (#)

Total Number

(#)

Total Length

(bp)

N50 Length

(bp)

N90 Length

(bp)

Max Length

(bp)

Min Length

(bp)

Gap Number

(bp)

GC Content

(%)

JCM.23203 Scaffold 53 7,526,366 483,364 141,092 894,858 506 249 53.5

Contig 58 7,526,117 476,707 141,092 894,858 278 - 53.5

JCM.30071 Scaffold 31 4,377,410 442,210 124,953 1,045,258 652 250 36.94

Contig 38 4,377,160 430,384 122,824 1,045,258 375 - 36.94

Page 16: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

N50

N50 statistic defines assembly quality in

terms of contiguity. Given a set of

contigs, the N50 is defined as the

sequence length of the shortest contig at

50% of the total genome length. It can

be thought of as the point of half of the

mass of the distribution

Longer N50 Better assembly

Page 17: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Output files:

BGI_result/Separate/SampleName/2.Assembly:

|-- SampleName.kmer.png [Figure of kmer analysis]

|-- SampleName.kmer.stat.xls [Statistics of kmer results]

|-- SampleName.Draft.Assembly.stat.xls [Statistic of assemble results]

|-- SampleName.Draft.genome. Contig .fasta [Assemble result of contigs]

|-- SampleName.Draft.genome.fasta [Assemble result of scaffolds]

|-- SampleName.Draft.genome.ncbi.agp [NCBI uploading file]

|-- SampleName.genome.gb [Genome infomation in genebank format]

|-- SampleName.genome.tbl [Genome infomation in tbl format]

|-- SampleName.coverage_depth.table.xls [Statistics of coverage rate]

|-- SampleName.GC-depth.png [Figure of GC and depth]

Page 18: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Pictures are from lecture slide in “De Novo whole genome assembly”, Lecture 1, Qi Sun, Bioinformatics Facility, Cornell University

2. K should always be odd to avoid palindromic sequence :

Given a sequence GCCG, 2-mers are GC, CC, CG and the reverse complements are GC, GG, CG. 3-mers are

GCC, CCG and the reverse complements are GGC, CGG

Page 19: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

What can we know from the

frequency distribution of K-mers ? 1. If the sequencing process is random

(follows Poisson Distribution if random)

2. Whether the sequenced genome was

contaminated by other genomes from

other species (by observing the peaks)

3. Estimate the size of sequenced genome

(k-mer depth, k-mer number)

Page 20: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Kmer depth = 3 Kmer depth = 2

Page 21: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Knowing that K-mer is actually

shorter reads

Reads

K-mers

Page 22: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

What can we learn from GC-Depth graph? 1. Whether the sequencing is random

2. Whether the genome is contaminated

1. Sequencing depth distribution should be similar with k-mer

depth distribution if the sequencing is random. (Piosson

disrtibution)

2. There should be only one peak on Y axis. If two peaks found,

it means the sequenced genome is contaminated.

Contamination

found

Identify the contaminating genome by aligning

the reads against NT(Nucleotide Sequence

Database) database

Page 23: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Unusual GC content (GC% > 65 or GC % < 35 %)

PCR-free method can be used to improve the assembly

Sample Name Seq Type (#) Total Number

(#) Total Length

(bp) N50 Length

(bp) N90 Length

(bp) Max Length

(bp) Min Length

(bp) Gap Number

(bp) GC Content

(%)

KCTC22558 (Without PCR

free) Scaffold 71 3,043,249 99,202 22,419 262,339 584 31 69.6

Contig 72 3,043,218 99,202 22,419 262,339 584 - 69.6

KCTC22558 (With PCR

free) Scaffold 31 3,059,742 452,204 290,547 888,648 505 148 69.40

Contig 36 3,059,594 403,061 290,547 514528 276 - 69.40

With PCR-free Without PCR-free

Page 24: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Contamination

Sample Name

Seq Type (#)

Total Number

(#)

Total Length

(bp)

N50 Length

(bp)

N90 Length

(bp)

Max Length

(bp)

Min Length

(bp)

Gap Number

(bp)

GC Content

(%)

TBRC220

5

Scaffold 4,695 7,313,00

5 2,051 699 61,951 500 12,036 72.28

Contig 5,479 7,300,96

9 1,920 585 61,951 233 - 72.28

Possible contaminating species were predicted after aligning against the Nt database

TaxID Organism Cover_Len(bp) Scaffolds_Len Coverage(%) Genomics(%)

2024580 Plantactinospora sp. 2350415 3106359 75.66 42.48

261654 Micromonospora

auratinigra 133798 224639 59.56 3.07

743718 Isoptericola variabilis 115063 202306 56.88 2.77

307121 Micromonospora

krabiensis 104234 205038 50.84 2.8

Page 25: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Statistic of gene prediction

Total Number: The count of genes; Total Length: Total length of all genes; Average Length: Average length

of all genes; GC Content: The content of G and C in gene; Length/Genome Length: The proportion of gene

length in genome.

Sample Name (#)

Genome Size (#)

Total Number (#)

Total Length (bp)

Average Length (#)

Length / Genome

Length (%)

GC Content (%)

JCM.11219 2,439,499 2,689 2,146,311 798.18 87.98 46.05

JCM.12580 3,632,762 3,696 3,028,884 819.5 83.38 42.75

JCM.13583 1,753,463 1,678 1,527,282 910.18 87.1 56.87

JCM.14719 2,406,988 2,473 2,075,655 839.33 86.23 65.12

Gene prediction software: Glimmer 3.02

Page 26: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Interpreting the output files from gene prediction

BGI_result/Separate/SampleName/3.Genome_Component/Gene_Predict:

|-- SampleName.Gene.cds.fasta [Predicted genes in CDS format]

|-- SampleName.Gene.pep.fasta [Protein sequences of predicted genes

|-- SampleName.Gene.gff [Predicted genes in GFF3 format]

|-- SampleName.Gene.cds.png [Figure of Gene length distribution]

|-- SampleName.Gene.stat.xls [Statistics of predict genes]

JMC.12580.Gene.cds.fasta JCM.12580.Gene.pep.fasta SampleName.Gene.gff

Page 27: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Statistic table of ncRNA prediction

Type: The type of ncRNA;

% in genome: The proportion of length of ncRNA in genome.

Sample Name (#) Type (#) Copy Number (#) Average Length

(bp) Total Length (#) % in Genome

JCM.11219

tRNA 34 85 2,890 0.1185

5s_rRNA (Denovo) 1 117 117 0.0047

16s_rRNA (Denovo)

1 1,508 1508 0.0618

23s_rRNA (Denovo)

1 3,066 3066 0.1256

sRNA 1 24 24 0.001

Page 28: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Tools for ncRNA prediction

rRNA: RNAmmer1.2. The RNAmmer 1.2

server predicts 5s/8s, 16s/18s, and 23s/28s

ribosomal RNA in full genome sequences.

sRNA: Infernal, blastn, Rfam. Use blastn to

align the assembled genome against Rfam

database, in order to find out the possible

sRNA.

tRNA: tRNAscan-SE. Searching for tRNA

genes in genomic sequences.

Output files from ncRNA prediction

BGI_result/Separate/SampleName/3.Genome_Component/ncRNA_finding:

|-- SampleName.ncRNA.stat.xls [Statistics of ncRNA prediction]

|-- SampleName.denovo.rRNA.fasta [rRNAmmer prediction result]

|-- SampleName.denovo.rRNA.gff [rRNAmmer prediction result in

GFF3format]

|-- SampleName.sRNA.cmsearch.confident.gff [Filtered result of sRNA

prediction] |-- SampleName.sRNA.cmsearch.confident.nr.gff [Final result of sRNA after

duplication removal]

|-- SampleName.tRNA.gff [tRNA prediction result in GFF3format]

|-- SampleName.tRNA.structure [tRNA secondary structure file]

|-- SampleName.tRNA.xls [tRNA prediction result]

JCM.12580.denovo.rRNA.fasta

JCM.12580.denovo.rRNA.gff

Page 29: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Tandem repeat (TR) was the sequence which contains more than two neighbored repeat units. The length of repeat unit

ranges from 1 bp to 500 bp, and it exhibited the specificity of species which contribute to the researches of evolution.

Minisatellite DNA was also named as tandem repeat sequences with variable number which was a kind of small repeated DNA

sequence, and the length of repeat unit was 15-65 bp.

Microsatellite DNA was also named as short tandem repeat sequences or simple tandem repeat sequences, and the length of its

repeat unit was 2-10 bp. The repeat unit and repeat frequency of microsatellite DNA between different species was different, so it

can be used as molecular marker (SSR).

Statistic of predicted repeated sequences in assembled genome

Total Length: Total length of all repeat; % in genome: The proportion of the length of repeat in Genome.

Sample Name (#) Type (#) Number (#) Repeat Size (bp) Total Length (bp) In Genome (%)

JCM.13583 TRF 49 1-127 2,662 0.1518

Minisatellite DNA 27 15-65 1,654 0.0943

Microsatellite DNA 7 3-10 553 0.0315

JCM.14719 TRF 60 6-243 3,617 0.1503

Minisatellite DNA 42 15-57 2,380 0.0989

Microsatellite DNA 5 6-10 210 0.0087

Page 30: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Tools

Tandem repeats finder version 4.09, Tandem Repeats Finder is a program to locate and display tandem

repeats in DNA sequences.

Output files from repeat sequence prediction:

BGI_result/Separate/SampleName/3.Genome_Component/Repeat_finding:

|-- SampleName.TRF.stat.xls [Statistics of tandem repeat analysis]

|-- SampleName.Microsatellite.DNA.dat.gff [Microsatellite DNA file in

GFF3format]

|-- SampleName.Minisatellite.DNA.dat.gff [Minisatellite DNA file in GFF3format]

|-- SampleName.trf.dat [Primary results of TRF analysis]

|-- SampleName.trf.dat.gff [*.trf.dat file in GFF3format]

Microsatellite.DNA.dat.gff file

Minisatellite.DNA.dat.gff file

Page 31: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

A microsatellite is a tract of tandemly repeated (i.e. adjacent) DNA motifs that range in length from one to six

or up to ten nucleotides (the exact definition and delineation to the longer minisatellites varies from author to

author), and are typically repeated 5-50 times. Microsatellites are distributed throughout the genome.

Microsatellites (Single sequence repeats) in non-

coding regions do not have any specific function; This

allows them to accumulate mutations unhindered over

the generations and gives rise to variability that can be

used for DNA fingerprinting and identification

purposes. It provides alternative solution when 16S

sequencing can not distinguish two bacteria at species

level.

TATATATATA

Dinucleotide microsatellite

GTCGTCGTCGTCGTC

Trinucleotide microsatellite

Tetra- and Pentanucleotide microsatellite

……

3rd generation sequencing (PacBio) is required for the accurate identification of SSRs

Page 32: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

The function annotation is accomplished by analysis of protein/gene sequences. We align genes with databases to obtain their

corresponding annotations. To ensure the biological meaning, the highest quality alignment result is chosen as gene

annotation. Function annotation is completed by blasting genes with different databases.

Statistic of all annotation

This table shows the number of annotated genes on each database from each sample and their ratio.

Sample Name (#)

Total (#) VFDB (#) ARDB (#) IPR (#) SWISSPROT

(#) COG (#) GO (#) KEGG (#) NR (#) T3SS (#) OverAll (#)

JCM.11219 2,689 27 (1%) 1 (0.03%) 1,823

(67.79%) 654

(24.32%) 1,691

(62.88%) 1,282

(47.67%) 1,195

(44.44%) 2,348

(87.31%) 256

(9.52%) 2,435

(90.55%)

JCM.12580 3,696 142

(3.84%) 18 (0.48%)

3,011 (81.46%)

1,864 (50.43%)

2,483 (67.18%)

2,157 (58.36%)

2,080 (56.27%)

3,357 (90.82%)

455 (12.31%)

3,438 (93.01%)

JCM.13583 1,678 28 (1.66%) 0 (0%) 1,403

(83.61%) 603

(35.93%) 1,173

(69.9%) 1,107

(65.97%) 922

(54.94%) 1,390

(82.83%) 139

(8.28%) 1,528

(91.06%)

JCM.14719 2,473 148

(5.98%) 12 (0.48%)

2,089 (84.47%)

1,423 (57.54%)

1,791 (72.42%)

1,563 (63.2%)

1,567 (63.36%)

2,229 (90.13%)

144 (5.82%)

2,304 (93.16%)

Page 33: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Databases normally used:

1. VFDB (virulence factor database) version 2017-09

2. ARDB (Antibiotic Resistance Genes Database) version:1.1

3. T3DB (Type 3 secretion system related Database) version:1.0

4. KEGG (Kyoto Encyclopedia of Genes and Genomes) version:81

5. COG (clusters of orthologous groups) version:2014-11-10

6. GO (Gene Ontology) Database releases_2017-09-08

7. Interpro (IPR) version 71.0

8. Swiss-Prot/TrEMBL version:release-2017-07

9. NR(Non-Redundant Protein Sequence Database)

Page 34: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Output files of ARDB annotation

SampleName.ardb.list.anno.xls [ARDB annotation result]

Gene_id Identity E_value Subject_id Resistance_Ty

pe Resistance_Require

ment Antibiotic_Resist

ance Description

JCM.12580GL000615

42.92 2.00E-54 ardb_2438 vanrb vanb,vanhb,vanxb,va

nyb,vansb,vanwb vancomycin

VanB type vancomycin resistance operon genes,

which can synthesize peptidoglycan with

modified C-terminal D-Ala-D-Ala to D-alanine--D-

lactate.

Gene_id Subject_id Identity Align_length Mismatch Gap E_value Score

JCM.12580GL000615

ardb_2438 42.92 226 119 4 2.00E-54 171

SampleName.ardb.list.filter.xls [ARDB blast result]

Page 35: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Gene_id Identity E_value GI_id Type Description

JCM.12580GL000448 72.13 4.00E-159 gi:118480317 Predicted

(gtaB) UTP--glucose-1-phosphate

uridylyltransferase [Polysaccharide capsule

(CVF567)] [Bacillus thuringiensis str. Al

Hakam]

JCM.12580GL000544 76.32 4.00E-109 gb|NP_465991 Verified

(clpP) ATP-dependent Clp protease

proteolytic subunit [ClpP (VF0074)]

[Listeria monocytogenes EGD-e]

SampleName.ardb.list.anno.xls [VFDB annotation result]

Page 36: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Gene_id Description Score is secreted

JCM.12580GL000067 locus=Scaffold1:69668:70753:+ 1 TRUE

JCM.12580GL000150 locus=Scaffold1:149049:149651:+ 1 TRUE

JCM.12580GL000195 locus=Scaffold10:8375:9355:- 1 TRUE

JCM.12580GL001292 locus=Scaffold23:42488:43762:- 0.95314004 TRUE

JCM.12580GL002540 locus=Scaffold5:88857:89477:+ 0.95228901 TRUE

JCM.12580GL001387 locus=Scaffold25:44758:44877:+ 0.95182691 TRUE

SampleName.effectiveT3.std.anno.xls[T3SS annotation result]

Page 37: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Gene_id Identity E_value Accession_id Locus_id Description

JCM.12580GL000004 83.51 0 Q8EPF3 SYT_OCEIH

Threonine--tRNA ligase OS=Oceanobacillus

iheyensis (strain DSM 14371 / CIP 107618 /

JCM 11309 / KCTC 3954 / HTE831) GN=thrS

JCM.12580GL000005 88.02 4.00E-106 Q8EPF5 IF3_OCEIH

Translation initiation factor IF-3

OS=Oceanobacillus iheyensis (strain DSM 14371 / CIP 107618 /

JCM 11309 / KCTC 3954 / HTE831) GN=infC

JCM.12580GL000006 81.25 9.00E-32 P55874 RL35_BACSU 50S ribosomal protein

L35 OS=Bacillus subtilis (strain 168) GN=rpmI

SampleName.swissprot.list.anno.xls [SWISSPROT annotation result]

Page 38: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Gene_id Identity E_value Subject_id Subject_description

JCM.12580GL000001 76.96 4.00E-105 WP_066393003.1 transposase [Bacillus

horneckiae]

JCM.12580GL000002 86.59 0 WP_010532428.1 hypothetical protein

[Lentibacillus jeotgali]

JCM.12580GL000004 90.59 0 WP_010532429.1 threonine--tRNA ligase [Lentibacillus jeotgali]

JCM.12580GL000005 97.6 1.00E-117 WP_083838424.1 translation initiation factor IF-3 [Lentibacillus jeotgali]

SampleName.Nr.list.anno.xls [NR annotation result]

Page 39: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Visualization of the statistic of COG annotation

Gene_id Identity E_value GI_numb

er Subject_Number

Taxid Genome_name

Code Subject_l

ength Subject_

start Subject_

end

Membership_clas

s COG_id Anno

ClassFunction

JCM.12580GL000

012 78.06 0

386715028

YP_006181351.1

866895

Halobacillus_halophilus_DSM_2266_uid1620

33

Halhal 361 1 361 0 COG136

3

Putative aminopeptidase

FrvX

Amino acid

transport and

metabolism;Carbohydrate transport

and metaboli

sm

Page 40: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Gene_id GO_number GO_id; GO_description; GO_class

JCM.12580GL000001 3

{GO:0003677; DNA binding; molecular_function} {GO:0004803; transposase activity; molecular_function} {GO:0006313;

transposition, DNA-mediated; biological_process}

JCM.12580GL000005 2 {GO:0003743; translation initiation factor

activity; molecular_function} {GO:0006413; translational initiation; biological_process}

JCM.12580GL000006 4

{GO:0003735; structural constituent of ribosome; molecular_function} {GO:0005622;

intracellular; cellular_component} {GO:0005840; ribosome; cellular_component} {GO:0006412;

translation; biological_process}

JCM.12580GL000013 2 {GO:0005524; ATP binding; molecular_function}

{GO:0016887; ATPase activity; molecular_function}

JCM.12580GL000017 1 {GO:0030436; asexual sporulation;

biological_process}

Gene_id IPR_number InterPro_entry; InterPro_description

JCM.12580GL000001 1 {IPR003346; Transposase, IS116/IS110/IS902}

JCM.12580GL000002 1 {IPR014199; Sporulation protein YtxC}

JCM.12580GL000006 3

{IPR001706; Ribosomal protein L35, non-mitochondrial} {IPR018265; Ribosomal protein

L35, conserved site} {IPR021137; Ribosomal protein L35}

Visualization of the statistic of GO annotation

InterPro's intention is to provide a one-stop-shop

for protein classification, where all the signatures

produced by the different member databases are

placed into entries within the InterPro database.

Signatures which represent equivalent domains,

sites or families are put into the same entry and

entries can also be related to one another.

Additional information such as a description,

consistent names and Gene Ontology (GO) terms

are associated with each entry, where possible.

Page 41: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Visualization of the statistic of

KEGG annotation

BGI_result/Separate/SampleName/4.Genome_Function/General_Gene_Annotation:

|-- SampleName.kegg.functional_classification_2.pdf [KEGG functional classification figure in PDF format]

|-- SampleName.kegg.functional_classification_2.png [KEGG functional classification figure in PNGformat]

|-- SampleName.kegg.list.anno.xls [KEGG annotation result]

|-- SampleName.kegg.list.filter.xls [KEGG blast result]

|-- SampleName.kegg.list.Gene2KEGG.xls [Statistic of KEGG genes and corresponding KoNumber]

|-- SampleName.kegg.list.KEGG2Gene.xls [Statistic of KEGG classifications and corresponding genes]

|-- SampleName.kegg.list.ko.htm [Relative URL of ko]

|-- SampleName.kegg.list.ko.path.xls [Statistic of KEGG pathways and corresponding genes]

|-- SampleName.kegg.list.ko.xls [Description of each ko]

|-- KEGG_MAP.tar.gz [Maps packed file]

File structure of KEGG

annotation

Page 42: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –

Gene_id Identity E_value Kegg_geneID Ko_id Ko_name Ko_defi Ko_EC Ko_class

JCM.12580GL000004

89.35 0 lao:AOX59_067

05 K01868 TARS, thrS

threonyl-tRNA synthetase

6.1.1.3

Genetic Information Processing--Translation--Aminoacyl-

tRNA biosynthesis

JCM.12580GL000007

94.07 4.00E-75 lao:AOX59_066

90 K02887

RP-L20, MRPL20, rplT

large subunit ribosomal

protein L20 --

Genetic Information Processing--Translation--

Ribosome

JCM.12580GL003664

75.44 8.00E-123 lao:AOX59_047

35 K00945 cmk

CMP/dCMP kinase

2.7.4.25

Metabolism--Global and

overview maps--Metabolic

pathways|Metabolism--

Nucleotide metabolism--

Pyrimidine metabolism

#Pathway Count (2080) Pathway ID Level 1 Level 2

Aminoacyl-tRNA biosynthesis

27 ko00970 Genetic Information

Processing Translation

Page 43: Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked contigs with gaps (with the help of PE reads) . Software for assembly: SOAPdenovo –