Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked...
Transcript of Annotation - WDCM · 2019-02-14 · Contig: Sequences assembled from k-mers. Scaffold: linked...
• Sample collection
• DNA extraction
• Library construction
• Sequencing
• Assembly
• Annotation
• Database
• Open access
Global Centers BGI WDCM
1.5mL Eppendorf tube(can be
sealed with microplate sealing
film).
x Don’t use single PCR tube, 8-
well strip PCR tube or enclosed
a PCR plate with domed cap.
Eppendorf tube shall be placed in a
rigid and sealed container.
DNA sample that is precipitated by
ethanol can be transferred in
normal temperature. For any other
type of DNA sample, should be
transferred with enough dry ice.
Sample: Clear & Absent of color
Colored sample will be held
by customs
One hard copy with sample. One soft copy by email.
Sequencing
type
Volum
e
(μl)
Total
DNA
Amou
nt (μg)
DNA
concentr
ation
(ng/μL)
DNA
library
Insert
size
Sequen
cing
time
Data
analysi
s
time
Fungi/Bacteria
Whole
Genome
Sequencing
15-100 ≥1 ≥12.5 270bp 20 days 10 days
Fungi/Bacteria
Pacbio DNA
Sequencing
15-100 ≥5 ≥60 20 kb 28 days 10 days
Fungi/Bacteria
Whole
Genome
Sequencing
15-100 ≥10 ≥30
PCR-
Free
270bp
20 days 10 days
100 250
500
750 1000
2000
564
2027 2322
4361
23130
Sample Test Method ①Method of concentration determination:■Qubit Fluorometer、□NanoDrop、□Microplate Reader ②Method of sample integrity test:■Agarose Gel Electrophoresis
• Sample Name*
• Species*
• No. of Tubes*
• Concentration (ng/μL)
• Volume (μL)
• Total Quantity (μg) • One hard copy with sample. One soft copy by email.
• Shipping address – China: China National GeneBank, Jinsha Road, Dapeng District,
Shenzhen, China
– Other countries and regions: 1/F, 16th Dai Fu Street, Tai Po Industrial
Estate, Tai Po, Hong Kong
Please do remember to fill in the species name and the GC content !!!
Data
GenomicsComponent
GeneFunction
Assembly
Hiseq
PacBio
Draft Map
Fine Map
Ge
ne
Re
pe
at Seq
ue
nce
ncR
NA
Ge
ne
Island
Pro
PH
AG
E
CR
ISPR
General GeneAnnotation
GO KEGG NRSwiss-Port
TrEMBL COG
AnimalPathogens
ARDB VFDBPHI T3SS
PlantPathogens
PHI CAZYT3SS
Completed map
• Illumina Data Summary • Hiseq (270bp library,PE151)100X;
8.5K -r-xr-xr-x 1 solexa solexa 431 Nov 15 17:59 181113_I12_V300008345_L1_HUMbakMAAAA-519.report
37K -r-xr-xr-x 1 solexa solexa 21K Nov 15 17:59 base.png
13K -r-xr-xr-x 1 solexa solexa 6.6K Nov 15 17:59 qual.png
8.5K -r-xr-xr-x 1 solexa solexa 1.1K Nov 15 17:59 report.htm
37K -r-xr-xr-x 1 solexa solexa 30K Nov 15 17:59 V300008345_L01_519_1.fq.fqStat.txt
3.2G -r-xr-xr-x 1 solexa solexa 3.2G Nov 15 17:59 V300008345_L01_519_1.fq.gz
37K -r-xr-xr-x 1 solexa solexa 30K Nov 15 17:59 V300008345_L01_519_2.fq.fqStat.txt
3.2G -r-xr-xr-x 1 solexa solexa 3.2G Nov 15 18:00 V300008345_L01_519_2.fq.gz
@V300008345L1C001R040000014/1
ATCTGCAAACCAAGTTCTTTCATTACCCGGTCAGTCTGTTTATTC
TTTCGGAGATTTCCCAACAACCACATTCCCTCATCGGCAAATAC
ATTCGACAGAC
+
FFF@FGFFFFFEFFFFFFFFFFFFGFFGGFF=BFFFGFFFFFFFF?FF
GFF;GFFGGFFGFFFGFFGGFGFFFFGGFGFFFFFFFFFGFGFGGE
FFFFFF
Fastq files:
Information about the read
Sequence
Separator
Phred quality scores encoded in
ASCII format
Phred quality scores Q are defined as a property which is logarithmically related to the base-calling error
probabilities P. For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called
incorrectly are 1 in 1000.
Phred Quality Score Probability of incorrect
base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%
Raw data
Low quality reads
Good reads
Adapter sequences
Host sequences
Clean data (Good reads)
SOAPnuke_filter
Duplications
N bases
Insert Size: The length of inserted fragment; Reads Length: Length of reads; Raw Data: The size of raw data; Adapter: The
proportion of Adapter; Duplication: The proportion of same reads; Total Reads: Total reads number; Filtered Reads: The
proportion of filtered reads; Low Quality Filtered Reads: The proportion of Low quality filtered reads; Clean Data: The size
of reads we delivered.
Sample Name (#)
Insert Size (bp)
Reads Length (bp)
Raw Data (Mb)
Adapter (%)
Duplication (%) Total
Reads (#) Filtered
Reads (%)
Low Quality Filtered Reads
(%)
Clean Data (Mb)
JCM.11219 270 (150:150) 1,308 0.65 12.96 8,723,49
8 14.46 0.82 1,119
JCM.12580 270 (150:150) 1,301 2.59 10.42 8,676,87
0 13.69 0.66 1,123
JCM.13583 270 (150:150) 1,301 0.51 12.88 8,676,86
4 14.32 0.89 1,115
Software for filtering : SOAPnuke_filter
The X-axis shows the positions of bases in read1 and
read2. When the base composition is balanced, the A
and T curves overlap and the G and C curves
overlap.
The X-axis shows the positions of bases in read1 and
read2, the Y-axis shows the quality value of each base.
Each point in the graph represents the base quality value
of the corresponding position in a certain read.
BGI_result/Separate/SampleName/1.Cleandata:
|-- SampleName.Illumina_Cleandata.xls [Statistics of Illumina filtering]
|--SampleName.ISInserSize_Clean.1.fq.gz [Illuminareads1compressedfileinfastqformat]
|--SampleName.ISInserSize_Clean.2.fq.gz [Illuminareads2compressedfileinfastqformat]
|-- SampleName.ISInserSize_Clean.base.png [Filtered Illumina reads GC distribution]
|-- SampleName.ISInserSize_Clean.qual.png [Filtered Illumina reads quality distribution]
|-- SampleName.ISInserSize_Raw.base.png [Raw Illumina reads GC distribution]
|-- SampleName.ISInserSize_Raw.qual.png [Raw Illumina reads quality distribution]
Read: Sequences acquired from sequencer (Illuminia).
K-mer: Continuous fragments of reads that are equal in length.
Contig: Sequences assembled from k-mers.
Scaffold: linked contigs with gaps (with the help of PE reads) .
Software for assembly: SOAPdenovo – V.2.04
SPAdes – V.3.12.0
Reads
K-mers
Contigs
Scaffold NNNN
Interpreting the Statistic of Assembly
Seq Type: Assembled contig or scaffold;
Total Number: Number of contig or scaffold;
Total Length :The length of whole assembly sequence.
N50: Given a set of contigs, the N50 is defined as the sequence length of the shortest contig at 50% of the total genome length. The
bigger the N50, the better the assembly is.
Sample Name
Seq Type (#)
Total Number
(#)
Total Length
(bp)
N50 Length
(bp)
N90 Length
(bp)
Max Length
(bp)
Min Length
(bp)
Gap Number
(bp)
GC Content
(%)
JCM.23203 Scaffold 53 7,526,366 483,364 141,092 894,858 506 249 53.5
Contig 58 7,526,117 476,707 141,092 894,858 278 - 53.5
JCM.30071 Scaffold 31 4,377,410 442,210 124,953 1,045,258 652 250 36.94
Contig 38 4,377,160 430,384 122,824 1,045,258 375 - 36.94
N50
N50 statistic defines assembly quality in
terms of contiguity. Given a set of
contigs, the N50 is defined as the
sequence length of the shortest contig at
50% of the total genome length. It can
be thought of as the point of half of the
mass of the distribution
Longer N50 Better assembly
Output files:
BGI_result/Separate/SampleName/2.Assembly:
|-- SampleName.kmer.png [Figure of kmer analysis]
|-- SampleName.kmer.stat.xls [Statistics of kmer results]
|-- SampleName.Draft.Assembly.stat.xls [Statistic of assemble results]
|-- SampleName.Draft.genome. Contig .fasta [Assemble result of contigs]
|-- SampleName.Draft.genome.fasta [Assemble result of scaffolds]
|-- SampleName.Draft.genome.ncbi.agp [NCBI uploading file]
|-- SampleName.genome.gb [Genome infomation in genebank format]
|-- SampleName.genome.tbl [Genome infomation in tbl format]
|-- SampleName.coverage_depth.table.xls [Statistics of coverage rate]
|-- SampleName.GC-depth.png [Figure of GC and depth]
Pictures are from lecture slide in “De Novo whole genome assembly”, Lecture 1, Qi Sun, Bioinformatics Facility, Cornell University
2. K should always be odd to avoid palindromic sequence :
Given a sequence GCCG, 2-mers are GC, CC, CG and the reverse complements are GC, GG, CG. 3-mers are
GCC, CCG and the reverse complements are GGC, CGG
What can we know from the
frequency distribution of K-mers ? 1. If the sequencing process is random
(follows Poisson Distribution if random)
2. Whether the sequenced genome was
contaminated by other genomes from
other species (by observing the peaks)
3. Estimate the size of sequenced genome
(k-mer depth, k-mer number)
Kmer depth = 3 Kmer depth = 2
Knowing that K-mer is actually
shorter reads
Reads
K-mers
What can we learn from GC-Depth graph? 1. Whether the sequencing is random
2. Whether the genome is contaminated
1. Sequencing depth distribution should be similar with k-mer
depth distribution if the sequencing is random. (Piosson
disrtibution)
2. There should be only one peak on Y axis. If two peaks found,
it means the sequenced genome is contaminated.
Contamination
found
Identify the contaminating genome by aligning
the reads against NT(Nucleotide Sequence
Database) database
Unusual GC content (GC% > 65 or GC % < 35 %)
PCR-free method can be used to improve the assembly
Sample Name Seq Type (#) Total Number
(#) Total Length
(bp) N50 Length
(bp) N90 Length
(bp) Max Length
(bp) Min Length
(bp) Gap Number
(bp) GC Content
(%)
KCTC22558 (Without PCR
free) Scaffold 71 3,043,249 99,202 22,419 262,339 584 31 69.6
Contig 72 3,043,218 99,202 22,419 262,339 584 - 69.6
KCTC22558 (With PCR
free) Scaffold 31 3,059,742 452,204 290,547 888,648 505 148 69.40
Contig 36 3,059,594 403,061 290,547 514528 276 - 69.40
With PCR-free Without PCR-free
Contamination
Sample Name
Seq Type (#)
Total Number
(#)
Total Length
(bp)
N50 Length
(bp)
N90 Length
(bp)
Max Length
(bp)
Min Length
(bp)
Gap Number
(bp)
GC Content
(%)
TBRC220
5
Scaffold 4,695 7,313,00
5 2,051 699 61,951 500 12,036 72.28
Contig 5,479 7,300,96
9 1,920 585 61,951 233 - 72.28
Possible contaminating species were predicted after aligning against the Nt database
TaxID Organism Cover_Len(bp) Scaffolds_Len Coverage(%) Genomics(%)
2024580 Plantactinospora sp. 2350415 3106359 75.66 42.48
261654 Micromonospora
auratinigra 133798 224639 59.56 3.07
743718 Isoptericola variabilis 115063 202306 56.88 2.77
307121 Micromonospora
krabiensis 104234 205038 50.84 2.8
Statistic of gene prediction
Total Number: The count of genes; Total Length: Total length of all genes; Average Length: Average length
of all genes; GC Content: The content of G and C in gene; Length/Genome Length: The proportion of gene
length in genome.
Sample Name (#)
Genome Size (#)
Total Number (#)
Total Length (bp)
Average Length (#)
Length / Genome
Length (%)
GC Content (%)
JCM.11219 2,439,499 2,689 2,146,311 798.18 87.98 46.05
JCM.12580 3,632,762 3,696 3,028,884 819.5 83.38 42.75
JCM.13583 1,753,463 1,678 1,527,282 910.18 87.1 56.87
JCM.14719 2,406,988 2,473 2,075,655 839.33 86.23 65.12
Gene prediction software: Glimmer 3.02
Interpreting the output files from gene prediction
BGI_result/Separate/SampleName/3.Genome_Component/Gene_Predict:
|-- SampleName.Gene.cds.fasta [Predicted genes in CDS format]
|-- SampleName.Gene.pep.fasta [Protein sequences of predicted genes
|-- SampleName.Gene.gff [Predicted genes in GFF3 format]
|-- SampleName.Gene.cds.png [Figure of Gene length distribution]
|-- SampleName.Gene.stat.xls [Statistics of predict genes]
JMC.12580.Gene.cds.fasta JCM.12580.Gene.pep.fasta SampleName.Gene.gff
Statistic table of ncRNA prediction
Type: The type of ncRNA;
% in genome: The proportion of length of ncRNA in genome.
Sample Name (#) Type (#) Copy Number (#) Average Length
(bp) Total Length (#) % in Genome
JCM.11219
tRNA 34 85 2,890 0.1185
5s_rRNA (Denovo) 1 117 117 0.0047
16s_rRNA (Denovo)
1 1,508 1508 0.0618
23s_rRNA (Denovo)
1 3,066 3066 0.1256
sRNA 1 24 24 0.001
Tools for ncRNA prediction
rRNA: RNAmmer1.2. The RNAmmer 1.2
server predicts 5s/8s, 16s/18s, and 23s/28s
ribosomal RNA in full genome sequences.
sRNA: Infernal, blastn, Rfam. Use blastn to
align the assembled genome against Rfam
database, in order to find out the possible
sRNA.
tRNA: tRNAscan-SE. Searching for tRNA
genes in genomic sequences.
Output files from ncRNA prediction
BGI_result/Separate/SampleName/3.Genome_Component/ncRNA_finding:
|-- SampleName.ncRNA.stat.xls [Statistics of ncRNA prediction]
|-- SampleName.denovo.rRNA.fasta [rRNAmmer prediction result]
|-- SampleName.denovo.rRNA.gff [rRNAmmer prediction result in
GFF3format]
|-- SampleName.sRNA.cmsearch.confident.gff [Filtered result of sRNA
prediction] |-- SampleName.sRNA.cmsearch.confident.nr.gff [Final result of sRNA after
duplication removal]
|-- SampleName.tRNA.gff [tRNA prediction result in GFF3format]
|-- SampleName.tRNA.structure [tRNA secondary structure file]
|-- SampleName.tRNA.xls [tRNA prediction result]
JCM.12580.denovo.rRNA.fasta
JCM.12580.denovo.rRNA.gff
Tandem repeat (TR) was the sequence which contains more than two neighbored repeat units. The length of repeat unit
ranges from 1 bp to 500 bp, and it exhibited the specificity of species which contribute to the researches of evolution.
Minisatellite DNA was also named as tandem repeat sequences with variable number which was a kind of small repeated DNA
sequence, and the length of repeat unit was 15-65 bp.
Microsatellite DNA was also named as short tandem repeat sequences or simple tandem repeat sequences, and the length of its
repeat unit was 2-10 bp. The repeat unit and repeat frequency of microsatellite DNA between different species was different, so it
can be used as molecular marker (SSR).
Statistic of predicted repeated sequences in assembled genome
Total Length: Total length of all repeat; % in genome: The proportion of the length of repeat in Genome.
Sample Name (#) Type (#) Number (#) Repeat Size (bp) Total Length (bp) In Genome (%)
JCM.13583 TRF 49 1-127 2,662 0.1518
Minisatellite DNA 27 15-65 1,654 0.0943
Microsatellite DNA 7 3-10 553 0.0315
JCM.14719 TRF 60 6-243 3,617 0.1503
Minisatellite DNA 42 15-57 2,380 0.0989
Microsatellite DNA 5 6-10 210 0.0087
Tools
Tandem repeats finder version 4.09, Tandem Repeats Finder is a program to locate and display tandem
repeats in DNA sequences.
Output files from repeat sequence prediction:
BGI_result/Separate/SampleName/3.Genome_Component/Repeat_finding:
|-- SampleName.TRF.stat.xls [Statistics of tandem repeat analysis]
|-- SampleName.Microsatellite.DNA.dat.gff [Microsatellite DNA file in
GFF3format]
|-- SampleName.Minisatellite.DNA.dat.gff [Minisatellite DNA file in GFF3format]
|-- SampleName.trf.dat [Primary results of TRF analysis]
|-- SampleName.trf.dat.gff [*.trf.dat file in GFF3format]
Microsatellite.DNA.dat.gff file
Minisatellite.DNA.dat.gff file
A microsatellite is a tract of tandemly repeated (i.e. adjacent) DNA motifs that range in length from one to six
or up to ten nucleotides (the exact definition and delineation to the longer minisatellites varies from author to
author), and are typically repeated 5-50 times. Microsatellites are distributed throughout the genome.
Microsatellites (Single sequence repeats) in non-
coding regions do not have any specific function; This
allows them to accumulate mutations unhindered over
the generations and gives rise to variability that can be
used for DNA fingerprinting and identification
purposes. It provides alternative solution when 16S
sequencing can not distinguish two bacteria at species
level.
TATATATATA
Dinucleotide microsatellite
GTCGTCGTCGTCGTC
Trinucleotide microsatellite
Tetra- and Pentanucleotide microsatellite
……
3rd generation sequencing (PacBio) is required for the accurate identification of SSRs
The function annotation is accomplished by analysis of protein/gene sequences. We align genes with databases to obtain their
corresponding annotations. To ensure the biological meaning, the highest quality alignment result is chosen as gene
annotation. Function annotation is completed by blasting genes with different databases.
Statistic of all annotation
This table shows the number of annotated genes on each database from each sample and their ratio.
Sample Name (#)
Total (#) VFDB (#) ARDB (#) IPR (#) SWISSPROT
(#) COG (#) GO (#) KEGG (#) NR (#) T3SS (#) OverAll (#)
JCM.11219 2,689 27 (1%) 1 (0.03%) 1,823
(67.79%) 654
(24.32%) 1,691
(62.88%) 1,282
(47.67%) 1,195
(44.44%) 2,348
(87.31%) 256
(9.52%) 2,435
(90.55%)
JCM.12580 3,696 142
(3.84%) 18 (0.48%)
3,011 (81.46%)
1,864 (50.43%)
2,483 (67.18%)
2,157 (58.36%)
2,080 (56.27%)
3,357 (90.82%)
455 (12.31%)
3,438 (93.01%)
JCM.13583 1,678 28 (1.66%) 0 (0%) 1,403
(83.61%) 603
(35.93%) 1,173
(69.9%) 1,107
(65.97%) 922
(54.94%) 1,390
(82.83%) 139
(8.28%) 1,528
(91.06%)
JCM.14719 2,473 148
(5.98%) 12 (0.48%)
2,089 (84.47%)
1,423 (57.54%)
1,791 (72.42%)
1,563 (63.2%)
1,567 (63.36%)
2,229 (90.13%)
144 (5.82%)
2,304 (93.16%)
Databases normally used:
1. VFDB (virulence factor database) version 2017-09
2. ARDB (Antibiotic Resistance Genes Database) version:1.1
3. T3DB (Type 3 secretion system related Database) version:1.0
4. KEGG (Kyoto Encyclopedia of Genes and Genomes) version:81
5. COG (clusters of orthologous groups) version:2014-11-10
6. GO (Gene Ontology) Database releases_2017-09-08
7. Interpro (IPR) version 71.0
8. Swiss-Prot/TrEMBL version:release-2017-07
9. NR(Non-Redundant Protein Sequence Database)
Output files of ARDB annotation
SampleName.ardb.list.anno.xls [ARDB annotation result]
Gene_id Identity E_value Subject_id Resistance_Ty
pe Resistance_Require
ment Antibiotic_Resist
ance Description
JCM.12580GL000615
42.92 2.00E-54 ardb_2438 vanrb vanb,vanhb,vanxb,va
nyb,vansb,vanwb vancomycin
VanB type vancomycin resistance operon genes,
which can synthesize peptidoglycan with
modified C-terminal D-Ala-D-Ala to D-alanine--D-
lactate.
Gene_id Subject_id Identity Align_length Mismatch Gap E_value Score
JCM.12580GL000615
ardb_2438 42.92 226 119 4 2.00E-54 171
SampleName.ardb.list.filter.xls [ARDB blast result]
Gene_id Identity E_value GI_id Type Description
JCM.12580GL000448 72.13 4.00E-159 gi:118480317 Predicted
(gtaB) UTP--glucose-1-phosphate
uridylyltransferase [Polysaccharide capsule
(CVF567)] [Bacillus thuringiensis str. Al
Hakam]
JCM.12580GL000544 76.32 4.00E-109 gb|NP_465991 Verified
(clpP) ATP-dependent Clp protease
proteolytic subunit [ClpP (VF0074)]
[Listeria monocytogenes EGD-e]
SampleName.ardb.list.anno.xls [VFDB annotation result]
Gene_id Description Score is secreted
JCM.12580GL000067 locus=Scaffold1:69668:70753:+ 1 TRUE
JCM.12580GL000150 locus=Scaffold1:149049:149651:+ 1 TRUE
JCM.12580GL000195 locus=Scaffold10:8375:9355:- 1 TRUE
JCM.12580GL001292 locus=Scaffold23:42488:43762:- 0.95314004 TRUE
JCM.12580GL002540 locus=Scaffold5:88857:89477:+ 0.95228901 TRUE
JCM.12580GL001387 locus=Scaffold25:44758:44877:+ 0.95182691 TRUE
SampleName.effectiveT3.std.anno.xls[T3SS annotation result]
Gene_id Identity E_value Accession_id Locus_id Description
JCM.12580GL000004 83.51 0 Q8EPF3 SYT_OCEIH
Threonine--tRNA ligase OS=Oceanobacillus
iheyensis (strain DSM 14371 / CIP 107618 /
JCM 11309 / KCTC 3954 / HTE831) GN=thrS
JCM.12580GL000005 88.02 4.00E-106 Q8EPF5 IF3_OCEIH
Translation initiation factor IF-3
OS=Oceanobacillus iheyensis (strain DSM 14371 / CIP 107618 /
JCM 11309 / KCTC 3954 / HTE831) GN=infC
JCM.12580GL000006 81.25 9.00E-32 P55874 RL35_BACSU 50S ribosomal protein
L35 OS=Bacillus subtilis (strain 168) GN=rpmI
SampleName.swissprot.list.anno.xls [SWISSPROT annotation result]
Gene_id Identity E_value Subject_id Subject_description
JCM.12580GL000001 76.96 4.00E-105 WP_066393003.1 transposase [Bacillus
horneckiae]
JCM.12580GL000002 86.59 0 WP_010532428.1 hypothetical protein
[Lentibacillus jeotgali]
JCM.12580GL000004 90.59 0 WP_010532429.1 threonine--tRNA ligase [Lentibacillus jeotgali]
JCM.12580GL000005 97.6 1.00E-117 WP_083838424.1 translation initiation factor IF-3 [Lentibacillus jeotgali]
SampleName.Nr.list.anno.xls [NR annotation result]
Visualization of the statistic of COG annotation
Gene_id Identity E_value GI_numb
er Subject_Number
Taxid Genome_name
Code Subject_l
ength Subject_
start Subject_
end
Membership_clas
s COG_id Anno
ClassFunction
JCM.12580GL000
012 78.06 0
386715028
YP_006181351.1
866895
Halobacillus_halophilus_DSM_2266_uid1620
33
Halhal 361 1 361 0 COG136
3
Putative aminopeptidase
FrvX
Amino acid
transport and
metabolism;Carbohydrate transport
and metaboli
sm
Gene_id GO_number GO_id; GO_description; GO_class
JCM.12580GL000001 3
{GO:0003677; DNA binding; molecular_function} {GO:0004803; transposase activity; molecular_function} {GO:0006313;
transposition, DNA-mediated; biological_process}
JCM.12580GL000005 2 {GO:0003743; translation initiation factor
activity; molecular_function} {GO:0006413; translational initiation; biological_process}
JCM.12580GL000006 4
{GO:0003735; structural constituent of ribosome; molecular_function} {GO:0005622;
intracellular; cellular_component} {GO:0005840; ribosome; cellular_component} {GO:0006412;
translation; biological_process}
JCM.12580GL000013 2 {GO:0005524; ATP binding; molecular_function}
{GO:0016887; ATPase activity; molecular_function}
JCM.12580GL000017 1 {GO:0030436; asexual sporulation;
biological_process}
Gene_id IPR_number InterPro_entry; InterPro_description
JCM.12580GL000001 1 {IPR003346; Transposase, IS116/IS110/IS902}
JCM.12580GL000002 1 {IPR014199; Sporulation protein YtxC}
JCM.12580GL000006 3
{IPR001706; Ribosomal protein L35, non-mitochondrial} {IPR018265; Ribosomal protein
L35, conserved site} {IPR021137; Ribosomal protein L35}
Visualization of the statistic of GO annotation
InterPro's intention is to provide a one-stop-shop
for protein classification, where all the signatures
produced by the different member databases are
placed into entries within the InterPro database.
Signatures which represent equivalent domains,
sites or families are put into the same entry and
entries can also be related to one another.
Additional information such as a description,
consistent names and Gene Ontology (GO) terms
are associated with each entry, where possible.
Visualization of the statistic of
KEGG annotation
BGI_result/Separate/SampleName/4.Genome_Function/General_Gene_Annotation:
|-- SampleName.kegg.functional_classification_2.pdf [KEGG functional classification figure in PDF format]
|-- SampleName.kegg.functional_classification_2.png [KEGG functional classification figure in PNGformat]
|-- SampleName.kegg.list.anno.xls [KEGG annotation result]
|-- SampleName.kegg.list.filter.xls [KEGG blast result]
|-- SampleName.kegg.list.Gene2KEGG.xls [Statistic of KEGG genes and corresponding KoNumber]
|-- SampleName.kegg.list.KEGG2Gene.xls [Statistic of KEGG classifications and corresponding genes]
|-- SampleName.kegg.list.ko.htm [Relative URL of ko]
|-- SampleName.kegg.list.ko.path.xls [Statistic of KEGG pathways and corresponding genes]
|-- SampleName.kegg.list.ko.xls [Description of each ko]
|-- KEGG_MAP.tar.gz [Maps packed file]
File structure of KEGG
annotation
Gene_id Identity E_value Kegg_geneID Ko_id Ko_name Ko_defi Ko_EC Ko_class
JCM.12580GL000004
89.35 0 lao:AOX59_067
05 K01868 TARS, thrS
threonyl-tRNA synthetase
6.1.1.3
Genetic Information Processing--Translation--Aminoacyl-
tRNA biosynthesis
JCM.12580GL000007
94.07 4.00E-75 lao:AOX59_066
90 K02887
RP-L20, MRPL20, rplT
large subunit ribosomal
protein L20 --
Genetic Information Processing--Translation--
Ribosome
JCM.12580GL003664
75.44 8.00E-123 lao:AOX59_047
35 K00945 cmk
CMP/dCMP kinase
2.7.4.25
Metabolism--Global and
overview maps--Metabolic
pathways|Metabolism--
Nucleotide metabolism--
Pyrimidine metabolism
#Pathway Count (2080) Pathway ID Level 1 Level 2
Aminoacyl-tRNA biosynthesis
27 ko00970 Genetic Information
Processing Translation