Sequence Clustering

Advancing Science with DNA Sequence

Sequence Clustering

MGM WorkshopMGM Workshop

May 14, 2012May 14, 2012

Reducing Search Space in Protein

and

DNA/RNA Sequence Analysis

Denis Kaznadzey, Prokaryotic Super Program


Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering


Classification as Research Tool

- Classify into groups of essentially similar objects

- When new data arrives, assign objects to existing groups

- Classify ‘leftovers’

- Occasionally review the entire classification

Problem: What is ‘essentially similar’?• Finding properties that are important

(ontological relevancy)

• Does classification reflect reality in any way?

To deal with a huge variety of individual objects:


Classification

Ways to classify objects:

- Spectral methods

- Parametric decomposition

- Clustering


Sequence Data Abundance

In the modern biology: The most abundant type of data is sequence:

•DNA• Genomic

• Meta-Genomic

• Environmental Samples (16S rDNA)

•RNA (cDNA libraries; RNA-Seq)

•Derived Proteins

How to compare sequences?- Criteria depend on application, e.g. GC content vs. order of bases.


Sequence Clustering

Genome Assembly: Binning, Scaffolding

Transcriptomics: RNAseq (read) clustering

Protein Function and Evolution studies:Protein families

Phylogenetic profiling: OTUs

Select Applications in Genomic Sciences:


Clustering is Crucial for MetaGenomics

METAGENOMICS

• Thousands of samples• Hundreds of millions reads per sample• Trillions of base pairs• Billions of genes

impossible to observe/analyze individually Clustering becomes a strict requirement: - Find what classes of sequences are seen - Analyze classes rather then individual sequences


MetaGenomics Analysis Tasks

Primary tasks:

• Assess diversity

• Find genes

• Predict functions

• Predict pathways

• Estimate capabilities

Based on sequence comparison.





Distance Measures






Clustering in General

- Any Clustering is based on the Distance in some Metric

- Initial clustering is based on pair-wise distances

- Subsequent classification is based on distances from objects to clusters: Pledging





Distance Measures






Similarity Metrics

What is “similar”:

• Similarity measure should better reflect “reality”

• This “reality” depends on the application:

• Assembly: find identical sub-strings

• Orthology detection: Identify homologous proteins across the species

• Functional prediction: Identify proteins with similar evolutionary conserved motifs

Measure is:

Identity Percentage

Substitution matrix based

Match to HMM or PSSM


Similarity Measure

Computing similarity measure:- Edit distance or (ungapped) statistics P-value: BLAST,

Fasta, needle, water, etc.

- Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee

- K-mere statistics: CD-HIT, USEARCH, MUSCLE

- Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ

- Suffix Arrays: Bowtie, BWT

- Position-Specific scoring matrix: PSI-Blast, Impala

- Hidden Markov Models: HMMer, HHSearch/HHPred, SAM





Distance Measures






Assembling Clusters

There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology):

• Linkage-based • Average linkage• Complete linkage• Single linkage

• Hierarchy-based

• Fitting function-based(K-mean)

• Non-linear classifiers (SOM, etc.)

• Greedy methods (iterative, suboptimal)


Linkage-Based Clustering

Average linkage

Single linkage

Complete linkage


Hierarchical Clustering

- Build a tree representation of relationships

- Cut the branches using some quantitative criteria


Building the Tree

Criteria: More similar sequences appear at closer branches

This goal is not achievable for practical distance measures

1

4

3

42

2

A

BC

DA B C D A B D C

Solutions:-Approximation methods: neighbor join, UPGMA-Search for the optimal tree by explicit criteria: (maximum parsimony, maximal likelihood, etc.)


Suboptimal Tree Building

Neighbor joining (corresponds to single-linkage clustering):- Order edges by distance- Join in order from short to long, merging branches as needed

Unweighted Pair Group Method with Arithmetic Mean (UPGMA):(corresponds to average-linkage clustering)

For every pair of clusters (A, B), starting with all singletons:-Compute average of distances between every object in A and every object in B-Merge the clusters of the closest average distance


Global Fitting-Function Based

K-mean clustering

- Pre-define the number of clusters

- Find a distribution so that the sum of distances to the means is minimal

- Computationally hard

- Heuristics used, application specific heuristics may be efficient


Non-Linear Methods

Self-Organizing Maps:“self-learning” method

A neural network trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space


Pledging

Based on distance to cluster

-Representative

-Set of representatives (all at extreme)

-Other measure, may be unrelated to the initial one (profile, model)





Distance Measures







Distance computing is harder than clustering(Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ)

• For large data sets only k-mere and suffix array measures are practical

• However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes the use of sensitive similarity measures possible.

• For boolean distance, iterative similarity detection is possible (no off-the-shelf implementations)

• Binning: pre-clustering by rough and fast methods

33 objects 528 pairs

4 groups 127 pairs


Single Linkage is Fast

Time- and space- efficient clustering method: transitive closure-based

• Requires ‘boolean’ distances (two sequences can be linked or not linked

• Requires the number of nodes to be known

• Space ~ NodesNo

• Run-time (worst) ~ EdgesNo* AveClustSize

• Run-time (average) ~ EdgesNo * log2 (AveClustSize)


Single Linkage is Prone to Aggregation

Single-linkage clustering killer:

CLUSTER AGGREGATION

In large clusters, even a small number of random links lead to huge conglomerates.





Distance Measures






Case Study: RNA-Seq Pipeline

Goals:

1. Compute transcript structures

2. Compute expression profiles (“virtual”)

Reads/ ESTdb

Reads/EST clusters

Reads / clones attributed to particular source/condition

Counting reads originating from different sources

Source / condition specific expression profiles


RNAseq Analysis Solutions

Source: bioinfo.org, Macquarie University, Sydney


RNAseq Clustering

1. Detect identities (common segments):

• Compute similarities

• Select the “good” ones

2. Merge sequences into groups with shared segments: SINGLE LINKAGE

Approach Outline: Outcome:

One biggest cluster contains more then 60% of all sequences

(selection by better similarity does not help)

What causes aggregation and how to fight it?


Aggregation in RNA-Seq Clustering

“Bad” identities:

- Pieces of vector constructs / adaptors

- Repeats

- Redundant sequences

- Spurious matches (short infrequent repeats)

- Chimeras (if pre-amplification is used)


Similarities Selection

Computing ‘boolean’ distances:• Threshold – based

• Additional rules (match arrangement)

% identity + length + arrangement:


Trimming / Masking

Fighting aggregation

- Vector / adapter trimming:- Lucy, Figaro, etc. – integrated in many assembly suites

(newbler, velvet, AMOS, CLCbio, etc.)

- Low complexity detection / masking:- SEG, DUST, FastQC, WindowMasker etc. – often integrated

in search tools


Repeat Elimination

Regular (tandem) repeats:

•Pre-search masking: Based on

structure (IMEx, SRF) or on

database (TRDB)

•Post-search detection based on

similarity properties (multiple

parallel threads)


Repeat Elimination

Irregular (long) repeats:

•Database based: RepeatMasker

•De-novo: • RepeatScout,

• orrb,

• PILER, etc.

(Require genome as input, construct database)


Detecting Chimeras

Detecting chimeric sequences:

• Abundance-based: Perseus, UCHIME• Chimeras undergo less amplification cycles. So

chimera segments in native arrangements are more frequent

• Specific to 16S: ChimeraSlayer, Bellerophon• Chimera ‘arms’ are closer to originating clades

then the entire chimera


Detecting Chimeras

• Similarity coverage-based: Mira assembler


Detecting Chimeras

• Similarity graph topology-based: dchim

Alignment view Connectivity view





Distance Measures






Protein Clustering

Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species

Position-specific scoring matrices and profile-HMMs provide better sensitivity, but SLOW

Similar problems as for RNA-Seq/EST clustering, but their causes are harder to fight

No ‘one fits all’ solution: manual tuning and curation required for comprehensive results, especially at a large scale

The results of clustering are precious, they are kept as databases (PFAM, COGs, KOGs, eggNOG)


Protein Clustering at JGI

Functional annotation of metagenome genes through protein clusters (IMG):

- Build a set of functionally homogenous clusters of similar proteins – for annotated genomes

- Build HMM for each cluster, compose model database

- Pledge metagenome proteins to clusters by matching to models

- Cluster unpledged proteins, build models, update model database


Protein Clustering

Use of Protein Clusters reduces search space, but adds another level of indirection, which is a source of errors, and adds complexity that consumes effort

However, for proteins, which form dense relationship networks, clustering is a great tool

Konstantinos Mavrommatis will elaborate on protein clustering techniques


Thank you!

Sequence Clustering

Documents

Transcript of Sequence Clustering