Profiling of BABA-induced differentially expressed genes ...
Unknown Genes, Community Profiling, & Biotorrents.net
-
Upload
morgan-langille -
Category
Technology
-
view
837 -
download
0
Transcript of Unknown Genes, Community Profiling, & Biotorrents.net
Morgan Langille
UC Davis
Questions
If we wanted to start studying a gene of unknown function, which one(s) should we study first?
How many un-annotated genes could be annotated?
What proportion of unknown genes (hypothetical proteins) are probably not real proteins (i.e. pseudo-genes, mis-predicted orfs, etc.) ?
What proportion of unknown gene families are probably phage-related?
Can some of these families (hopefully the top ranking ones) be characterized using non-similarity based bioinformatic approaches?
Outline of project
Genomic Data Pfam SearchFilter for unknown
genes
Build HMMs for unknown genes
Rank Families
•Universality
•Evenness
•Pathogen / Non-pathogen
•Etc.
Create unknown families for
metagenomics data
Identify unknown families that now
merge with known families
Quantify families that are likely
phage
Use several non-similarity based methods to predict family function
•Community Profiling**
•3D structure similarity
•Neighboring genes
Phylogenetic profiling
C. hydrogenoformans
identified presence or
absence of homologs in
all other completely
sequence genomes
Identified many
hypothetical proteins that
had the same profile as
other sporulation
proteins
Wu, et al., PLOS Genetics, 2005
Community ProfilingKEGG COG
Delong, et al., Science, 2006
Community Profiling
Look across multiple metagenomic
samples
Gene families that have similar profiles
may have similar function
Similar to using co-expression to identify
similar functioning genes
So what have I done?
"all metagenomics peptides" from
CAMERA
43M sequences (mostly GOS)
Searched against 11,000 Pfams using
HMMER 3
Used “cluster” to group genes and samples
Results
Red = above avg.
number of pfams
Green = below avg.
number of pfams
Have not normalized
Number of sequences
per sample
For number of pfams
Metagenomic Samples
Pfams
Example of phage Pfams
clustering together
Measuring functional
relatedness Need to measure community profiling performance
The hierarchal clusters were broken into 575 groups using a correlation cutoff of 0.90 or above.
PFams were mapped to GO terms using pfam2GO 1893 PFams had no associated GO term
○ 695 of these were Domains of Unknown Function:DUFs
3377 PFams had one or more associated GO terms and could be used for further analysis
Only 67 (of 575) clusters contained 4 or more PFamswith at least one GO term
Measuring GO similarity
G-SESAME
Measures the semantic similarity of any two GO
terms
Not downloadable so queries had to be
made to their web server (not fun)
Pair-wise similarity was measure for each
pair of GO terms in each cluster
had to check if terms were in same namespace
Results
Average G-Sesame scores for each cluster
The average of all cluster averages was 0.484 10 clusters had a score of 0.60 or greater.
The data was then randomized by using the same GO terms but in different random clusters and a score of 0.412-0.420 over 4 iterations Each of the 4 iterations had only 1 or 0 clusters with
a score of 0.60 or greater
Community Profiling Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96
G-S
es
am
e S
co
re
Cluster Correlation Coefficient
• Average of all clusters= 0.49
• 10 clusters are > 0.60
Random Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96
G-S
esam
e S
co
re
Cluster Correlation Coefficient
• Average of all clusters (4 iterations) = 0.41 - 0.42
• 1 or 0 clusters are > 0.60
Bittorrent
A peer-to-peer file sharing protocol
~ 27-55% of all Internet traffic
Mostly illegal file sharing
Files are shared in small
pieces between several
users
Torrents for Biology
Why use torrent technology?
1. Download large datasets much faster
2. Searchable central listing
3. Decentralization of data
What is BioTorrents?
A legal file sharing website for scientists
Users can upload their own research results, data, software
Users can browse or search through all datasets
Data is not hosted on BioTorrents
www.biotorrents.net
Browse & Search
Details
Sign Up
Upload
Other Features
Forum
RSS Feed
Top 10
FAQ
Links
Who will upload data?
Everyone!
Realistically,
Large organizations (e.g. NCBI, CAMERA, etc.)
○ May need some convincing to host their data via
torrents in addition to FTP, HTTP, etc.
Scientists that really support open science
○ Sharing data before formally complete and
published
Technical Challenges
Many institutions frown on BitTorrent technology
A port must be opened/forwarded
Client program and computer must be left running
Ensuring data is legal, virus free, etc. Users that upload many legitimate torrents will provide
more confidence to people downloading
Making downloading and uploading easy