Sequence Clustering
description
Transcript of Sequence Clustering
![Page 1: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/1.jpg)
Advancing Science with DNA Sequence
Sequence Clustering
MGM WorkshopMGM Workshop
May 14, 2012May 14, 2012
Reducing Search Space in Protein
and
DNA/RNA Sequence Analysis
Denis Kaznadzey, Prokaryotic Super Program
![Page 2: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/2.jpg)
Advancing Science with DNA Sequence
Sequence Clustering Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
![Page 3: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/3.jpg)
Advancing Science with DNA Sequence
Classification as Research Tool
- Classify into groups of essentially similar objects
- When new data arrives, assign objects to existing groups
- Classify ‘leftovers’
- Occasionally review the entire classification
Problem: What is ‘essentially similar’?• Finding properties that are important
(ontological relevancy)
• Does classification reflect reality in any way?
To deal with a huge variety of individual objects:
![Page 4: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/4.jpg)
Advancing Science with DNA Sequence
Classification
Ways to classify objects:
- Spectral methods
- Parametric decomposition
- Clustering
![Page 5: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/5.jpg)
Advancing Science with DNA Sequence
Sequence Data Abundance
In the modern biology: The most abundant type of data is sequence:
•DNA• Genomic
• Meta-Genomic
• Environmental Samples (16S rDNA)
•RNA (cDNA libraries; RNA-Seq)
•Derived Proteins
How to compare sequences?- Criteria depend on application, e.g. GC content vs. order of bases.
![Page 6: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/6.jpg)
Advancing Science with DNA Sequence
Sequence Clustering
Genome Assembly: Binning, Scaffolding
Transcriptomics: RNAseq (read) clustering
Protein Function and Evolution studies:Protein families
Phylogenetic profiling: OTUs
Select Applications in Genomic Sciences:
![Page 7: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/7.jpg)
Advancing Science with DNA Sequence
Clustering is Crucial for MetaGenomics
METAGENOMICS
• Thousands of samples• Hundreds of millions reads per sample• Trillions of base pairs• Billions of genes
impossible to observe/analyze individually Clustering becomes a strict requirement: - Find what classes of sequences are seen - Analyze classes rather then individual sequences
![Page 8: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/8.jpg)
Advancing Science with DNA Sequence
MetaGenomics Analysis Tasks
Primary tasks:
• Assess diversity
• Find genes
• Predict functions
• Predict pathways
• Estimate capabilities
Based on sequence comparison.
![Page 9: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/9.jpg)
Advancing Science with DNA Sequence
Sequence Clustering Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
![Page 10: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/10.jpg)
Advancing Science with DNA Sequence
Clustering in General
- Any Clustering is based on the Distance in some Metric
- Initial clustering is based on pair-wise distances
- Subsequent classification is based on distances from objects to clusters: Pledging
![Page 11: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/11.jpg)
Advancing Science with DNA Sequence
Sequence Clustering Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
![Page 12: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/12.jpg)
Advancing Science with DNA Sequence
Similarity Metrics
What is “similar”:
• Similarity measure should better reflect “reality”
• This “reality” depends on the application:
• Assembly: find identical sub-strings
• Orthology detection: Identify homologous proteins across the species
• Functional prediction: Identify proteins with similar evolutionary conserved motifs
Measure is:
Identity Percentage
Substitution matrix based
Match to HMM or PSSM
![Page 13: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/13.jpg)
Advancing Science with DNA Sequence
Similarity Measure
Computing similarity measure:- Edit distance or (ungapped) statistics P-value: BLAST,
Fasta, needle, water, etc.
- Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee
- K-mere statistics: CD-HIT, USEARCH, MUSCLE
- Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ
- Suffix Arrays: Bowtie, BWT
- Position-Specific scoring matrix: PSI-Blast, Impala
- Hidden Markov Models: HMMer, HHSearch/HHPred, SAM
![Page 14: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/14.jpg)
Advancing Science with DNA Sequence
Sequence Clustering Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
![Page 15: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/15.jpg)
Advancing Science with DNA Sequence
Assembling Clusters
There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology):
• Linkage-based • Average linkage• Complete linkage• Single linkage
• Hierarchy-based
• Fitting function-based(K-mean)
• Non-linear classifiers (SOM, etc.)
• Greedy methods (iterative, suboptimal)
![Page 16: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/16.jpg)
Advancing Science with DNA Sequence
Linkage-Based Clustering
Average linkage
Single linkage
Complete linkage
![Page 17: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/17.jpg)
Advancing Science with DNA Sequence
Hierarchical Clustering
- Build a tree representation of relationships
- Cut the branches using some quantitative criteria
![Page 18: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/18.jpg)
Advancing Science with DNA Sequence
Building the Tree
Criteria: More similar sequences appear at closer branches
This goal is not achievable for practical distance measures
1
4
3
42
2
A
BC
DA B C D A B D C
Solutions:-Approximation methods: neighbor join, UPGMA-Search for the optimal tree by explicit criteria: (maximum parsimony, maximal likelihood, etc.)
![Page 19: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/19.jpg)
Advancing Science with DNA Sequence
Suboptimal Tree Building
Neighbor joining (corresponds to single-linkage clustering):- Order edges by distance- Join in order from short to long, merging branches as needed
Unweighted Pair Group Method with Arithmetic Mean (UPGMA):(corresponds to average-linkage clustering)
For every pair of clusters (A, B), starting with all singletons:-Compute average of distances between every object in A and every object in B-Merge the clusters of the closest average distance
![Page 20: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/20.jpg)
Advancing Science with DNA Sequence
Global Fitting-Function Based
K-mean clustering
- Pre-define the number of clusters
- Find a distribution so that the sum of distances to the means is minimal
- Computationally hard
- Heuristics used, application specific heuristics may be efficient
![Page 21: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/21.jpg)
Advancing Science with DNA Sequence
Non-Linear Methods
Self-Organizing Maps:“self-learning” method
A neural network trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space
![Page 22: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/22.jpg)
Advancing Science with DNA Sequence
Pledging
Based on distance to cluster
-Representative
-Set of representatives (all at extreme)
-Other measure, may be unrelated to the initial one (profile, model)
![Page 23: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/23.jpg)
Advancing Science with DNA Sequence
Sequence Clustering Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
![Page 24: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/24.jpg)
Advancing Science with DNA Sequence
Performance Considerations
Distance computing is harder than clustering(Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ)
• For large data sets only k-mere and suffix array measures are practical
• However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes the use of sensitive similarity measures possible.
• For boolean distance, iterative similarity detection is possible (no off-the-shelf implementations)
• Binning: pre-clustering by rough and fast methods
33 objects 528 pairs
4 groups 127 pairs
![Page 25: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/25.jpg)
Advancing Science with DNA Sequence
Single Linkage is Fast
Time- and space- efficient clustering method: transitive closure-based
• Requires ‘boolean’ distances (two sequences can be linked or not linked
• Requires the number of nodes to be known
• Space ~ NodesNo
• Run-time (worst) ~ EdgesNo* AveClustSize
• Run-time (average) ~ EdgesNo * log2 (AveClustSize)
![Page 26: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/26.jpg)
Advancing Science with DNA Sequence
Single Linkage is Prone to Aggregation
Single-linkage clustering killer:
CLUSTER AGGREGATION
In large clusters, even a small number of random links lead to huge conglomerates.
![Page 27: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/27.jpg)
Advancing Science with DNA Sequence
Sequence Clustering Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
![Page 28: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/28.jpg)
Advancing Science with DNA Sequence
Case Study: RNA-Seq Pipeline
Goals:
1. Compute transcript structures
2. Compute expression profiles (“virtual”)
Reads/ ESTdb
Reads/EST clusters
Reads / clones attributed to particular source/condition
Counting reads originating from different sources
Source / condition specific expression profiles
![Page 29: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/29.jpg)
Advancing Science with DNA Sequence
RNAseq Analysis Solutions
Source: bioinfo.org, Macquarie University, Sydney
![Page 30: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/30.jpg)
Advancing Science with DNA Sequence
RNAseq Clustering
1. Detect identities (common segments):
• Compute similarities
• Select the “good” ones
2. Merge sequences into groups with shared segments: SINGLE LINKAGE
Approach Outline: Outcome:
One biggest cluster contains more then 60% of all sequences
(selection by better similarity does not help)
What causes aggregation and how to fight it?
![Page 31: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/31.jpg)
Advancing Science with DNA Sequence
Aggregation in RNA-Seq Clustering
“Bad” identities:
- Pieces of vector constructs / adaptors
- Repeats
- Redundant sequences
- Spurious matches (short infrequent repeats)
- Chimeras (if pre-amplification is used)
![Page 32: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/32.jpg)
Advancing Science with DNA Sequence
Similarities Selection
Computing ‘boolean’ distances:• Threshold – based
• Additional rules (match arrangement)
% identity + length + arrangement:
![Page 33: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/33.jpg)
Advancing Science with DNA Sequence
Trimming / Masking
Fighting aggregation
- Vector / adapter trimming:- Lucy, Figaro, etc. – integrated in many assembly suites
(newbler, velvet, AMOS, CLCbio, etc.)
- Low complexity detection / masking:- SEG, DUST, FastQC, WindowMasker etc. – often integrated
in search tools
![Page 34: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/34.jpg)
Advancing Science with DNA Sequence
Repeat Elimination
Regular (tandem) repeats:
•Pre-search masking: Based on
structure (IMEx, SRF) or on
database (TRDB)
•Post-search detection based on
similarity properties (multiple
parallel threads)
![Page 35: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/35.jpg)
Advancing Science with DNA Sequence
Repeat Elimination
Irregular (long) repeats:
•Database based: RepeatMasker
•De-novo: • RepeatScout,
• orrb,
• PILER, etc.
(Require genome as input, construct database)
![Page 36: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/36.jpg)
Advancing Science with DNA Sequence
Detecting Chimeras
Detecting chimeric sequences:
• Abundance-based: Perseus, UCHIME• Chimeras undergo less amplification cycles. So
chimera segments in native arrangements are more frequent
• Specific to 16S: ChimeraSlayer, Bellerophon• Chimera ‘arms’ are closer to originating clades
then the entire chimera
![Page 37: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/37.jpg)
Advancing Science with DNA Sequence
Detecting Chimeras
• Similarity coverage-based: Mira assembler
![Page 38: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/38.jpg)
Advancing Science with DNA Sequence
Detecting Chimeras
• Similarity graph topology-based: dchim
Alignment view Connectivity view
![Page 39: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/39.jpg)
Advancing Science with DNA Sequence
Sequence Clustering Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
![Page 40: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/40.jpg)
Advancing Science with DNA Sequence
Protein Clustering
Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species
Position-specific scoring matrices and profile-HMMs provide better sensitivity, but SLOW
Similar problems as for RNA-Seq/EST clustering, but their causes are harder to fight
No ‘one fits all’ solution: manual tuning and curation required for comprehensive results, especially at a large scale
The results of clustering are precious, they are kept as databases (PFAM, COGs, KOGs, eggNOG)
![Page 41: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/41.jpg)
Advancing Science with DNA Sequence
Protein Clustering at JGI
Functional annotation of metagenome genes through protein clusters (IMG):
- Build a set of functionally homogenous clusters of similar proteins – for annotated genomes
- Build HMM for each cluster, compose model database
- Pledge metagenome proteins to clusters by matching to models
- Cluster unpledged proteins, build models, update model database
![Page 42: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/42.jpg)
Advancing Science with DNA Sequence
Protein Clustering
Use of Protein Clusters reduces search space, but adds another level of indirection, which is a source of errors, and adds complexity that consumes effort
However, for proteins, which form dense relationship networks, clustering is a great tool
Konstantinos Mavrommatis will elaborate on protein clustering techniques
![Page 43: Sequence Clustering](https://reader035.fdocuments.net/reader035/viewer/2022062222/56814c88550346895db9a51a/html5/thumbnails/43.jpg)
Advancing Science with DNA Sequence
Thank you!