Sequence Clustering

43
Advancing Science with DNA Sequence Sequence Clustering MGM Workshop MGM Workshop May 14, 2012 May 14, 2012 Reducing Search Space in Protein and DNA/RNA Sequence Analysis Denis Kaznadzey, Prokaryotic Super Program

description

Sequence Clustering. Reducing Search S pace in Protein and DNA /RNA S equence A nalysis Denis Kaznadzey, Prokaryotic Super Program. MGM Workshop May 14, 2012. Sequence Clustering Outline. Classification of Sequences General Problem of Clustering Distance Measures - PowerPoint PPT Presentation

Transcript of Sequence Clustering

Page 1: Sequence Clustering

Advancing Science with DNA Sequence

Sequence Clustering

MGM WorkshopMGM Workshop

May 14, 2012May 14, 2012

Reducing Search Space in Protein

and

DNA/RNA Sequence Analysis

Denis Kaznadzey, Prokaryotic Super Program

Page 2: Sequence Clustering

Advancing Science with DNA Sequence

Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

Page 3: Sequence Clustering

Advancing Science with DNA Sequence

Classification as Research Tool

- Classify into groups of essentially similar objects

- When new data arrives, assign objects to existing groups

- Classify ‘leftovers’

- Occasionally review the entire classification

Problem: What is ‘essentially similar’?• Finding properties that are important

(ontological relevancy)

• Does classification reflect reality in any way?

To deal with a huge variety of individual objects:

Page 4: Sequence Clustering

Advancing Science with DNA Sequence

Classification

Ways to classify objects:

- Spectral methods

- Parametric decomposition

- Clustering

Page 5: Sequence Clustering

Advancing Science with DNA Sequence

Sequence Data Abundance

In the modern biology: The most abundant type of data is sequence:

•DNA• Genomic

• Meta-Genomic

• Environmental Samples (16S rDNA)

•RNA (cDNA libraries; RNA-Seq)

•Derived Proteins

How to compare sequences?- Criteria depend on application, e.g. GC content vs. order of bases.

Page 6: Sequence Clustering

Advancing Science with DNA Sequence

Sequence Clustering

Genome Assembly: Binning, Scaffolding

Transcriptomics: RNAseq (read) clustering

Protein Function and Evolution studies:Protein families

Phylogenetic profiling: OTUs

Select Applications in Genomic Sciences:

Page 7: Sequence Clustering

Advancing Science with DNA Sequence

Clustering is Crucial for MetaGenomics

METAGENOMICS

• Thousands of samples• Hundreds of millions reads per sample• Trillions of base pairs• Billions of genes

impossible to observe/analyze individually Clustering becomes a strict requirement: - Find what classes of sequences are seen - Analyze classes rather then individual sequences

Page 8: Sequence Clustering

Advancing Science with DNA Sequence

MetaGenomics Analysis Tasks

Primary tasks:

• Assess diversity

• Find genes

• Predict functions

• Predict pathways

• Estimate capabilities

Based on sequence comparison.

Page 9: Sequence Clustering

Advancing Science with DNA Sequence

Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

Page 10: Sequence Clustering

Advancing Science with DNA Sequence

Clustering in General

- Any Clustering is based on the Distance in some Metric

- Initial clustering is based on pair-wise distances

- Subsequent classification is based on distances from objects to clusters: Pledging

Page 11: Sequence Clustering

Advancing Science with DNA Sequence

Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

Page 12: Sequence Clustering

Advancing Science with DNA Sequence

Similarity Metrics

What is “similar”:

• Similarity measure should better reflect “reality”

• This “reality” depends on the application:

• Assembly: find identical sub-strings

• Orthology detection: Identify homologous proteins across the species

• Functional prediction: Identify proteins with similar evolutionary conserved motifs

Measure is:

Identity Percentage

Substitution matrix based

Match to HMM or PSSM

Page 13: Sequence Clustering

Advancing Science with DNA Sequence

Similarity Measure

Computing similarity measure:- Edit distance or (ungapped) statistics P-value: BLAST,

Fasta, needle, water, etc.

- Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee

- K-mere statistics: CD-HIT, USEARCH, MUSCLE

- Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ

- Suffix Arrays: Bowtie, BWT

- Position-Specific scoring matrix: PSI-Blast, Impala

- Hidden Markov Models: HMMer, HHSearch/HHPred, SAM

Page 14: Sequence Clustering

Advancing Science with DNA Sequence

Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

Page 15: Sequence Clustering

Advancing Science with DNA Sequence

Assembling Clusters

There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology):

• Linkage-based • Average linkage• Complete linkage• Single linkage

• Hierarchy-based

• Fitting function-based(K-mean)

• Non-linear classifiers (SOM, etc.)

• Greedy methods (iterative, suboptimal)

Page 16: Sequence Clustering

Advancing Science with DNA Sequence

Linkage-Based Clustering

Average linkage

Single linkage

Complete linkage

Page 17: Sequence Clustering

Advancing Science with DNA Sequence

Hierarchical Clustering

- Build a tree representation of relationships

- Cut the branches using some quantitative criteria

Page 18: Sequence Clustering

Advancing Science with DNA Sequence

Building the Tree

Criteria: More similar sequences appear at closer branches

This goal is not achievable for practical distance measures

1

4

3

42

2

A

BC

DA B C D A B D C

Solutions:-Approximation methods: neighbor join, UPGMA-Search for the optimal tree by explicit criteria: (maximum parsimony, maximal likelihood, etc.)

Page 19: Sequence Clustering

Advancing Science with DNA Sequence

Suboptimal Tree Building

Neighbor joining (corresponds to single-linkage clustering):- Order edges by distance- Join in order from short to long, merging branches as needed

Unweighted Pair Group Method with Arithmetic Mean (UPGMA):(corresponds to average-linkage clustering)

For every pair of clusters (A, B), starting with all singletons:-Compute average of distances between every object in A and every object in B-Merge the clusters of the closest average distance

Page 20: Sequence Clustering

Advancing Science with DNA Sequence

Global Fitting-Function Based

K-mean clustering

- Pre-define the number of clusters

- Find a distribution so that the sum of distances to the means is minimal

- Computationally hard

- Heuristics used, application specific heuristics may be efficient

Page 21: Sequence Clustering

Advancing Science with DNA Sequence

Non-Linear Methods

Self-Organizing Maps:“self-learning” method

A neural network trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space

Page 22: Sequence Clustering

Advancing Science with DNA Sequence

Pledging

Based on distance to cluster

-Representative

-Set of representatives (all at extreme)

-Other measure, may be unrelated to the initial one (profile, model)

Page 23: Sequence Clustering

Advancing Science with DNA Sequence

Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

Page 24: Sequence Clustering

Advancing Science with DNA Sequence

Performance Considerations

Distance computing is harder than clustering(Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ)

• For large data sets only k-mere and suffix array measures are practical

• However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes the use of sensitive similarity measures possible.

• For boolean distance, iterative similarity detection is possible (no off-the-shelf implementations)

• Binning: pre-clustering by rough and fast methods

33 objects 528 pairs

4 groups 127 pairs

Page 25: Sequence Clustering

Advancing Science with DNA Sequence

Single Linkage is Fast

Time- and space- efficient clustering method: transitive closure-based

• Requires ‘boolean’ distances (two sequences can be linked or not linked

• Requires the number of nodes to be known

• Space ~ NodesNo

• Run-time (worst) ~ EdgesNo* AveClustSize

• Run-time (average) ~ EdgesNo * log2 (AveClustSize)

Page 26: Sequence Clustering

Advancing Science with DNA Sequence

Single Linkage is Prone to Aggregation

Single-linkage clustering killer:

CLUSTER AGGREGATION

In large clusters, even a small number of random links lead to huge conglomerates.

Page 27: Sequence Clustering

Advancing Science with DNA Sequence

Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

Page 28: Sequence Clustering

Advancing Science with DNA Sequence

Case Study: RNA-Seq Pipeline

Goals:

1. Compute transcript structures

2. Compute expression profiles (“virtual”)

Reads/ ESTdb

Reads/EST clusters

Reads / clones attributed to particular source/condition

Counting reads originating from different sources

Source / condition specific expression profiles

Page 29: Sequence Clustering

Advancing Science with DNA Sequence

RNAseq Analysis Solutions

Source: bioinfo.org, Macquarie University, Sydney

Page 30: Sequence Clustering

Advancing Science with DNA Sequence

RNAseq Clustering

1. Detect identities (common segments):

• Compute similarities

• Select the “good” ones

2. Merge sequences into groups with shared segments: SINGLE LINKAGE

Approach Outline: Outcome:

One biggest cluster contains more then 60% of all sequences

(selection by better similarity does not help)

What causes aggregation and how to fight it?

Page 31: Sequence Clustering

Advancing Science with DNA Sequence

Aggregation in RNA-Seq Clustering

“Bad” identities:

- Pieces of vector constructs / adaptors

- Repeats

- Redundant sequences

- Spurious matches (short infrequent repeats)

- Chimeras (if pre-amplification is used)

Page 32: Sequence Clustering

Advancing Science with DNA Sequence

Similarities Selection

Computing ‘boolean’ distances:• Threshold – based

• Additional rules (match arrangement)

% identity + length + arrangement:

Page 33: Sequence Clustering

Advancing Science with DNA Sequence

Trimming / Masking

Fighting aggregation

- Vector / adapter trimming:- Lucy, Figaro, etc. – integrated in many assembly suites

(newbler, velvet, AMOS, CLCbio, etc.)

- Low complexity detection / masking:- SEG, DUST, FastQC, WindowMasker etc. – often integrated

in search tools

Page 34: Sequence Clustering

Advancing Science with DNA Sequence

Repeat Elimination

Regular (tandem) repeats:

•Pre-search masking: Based on

structure (IMEx, SRF) or on

database (TRDB)

•Post-search detection based on

similarity properties (multiple

parallel threads)

Page 35: Sequence Clustering

Advancing Science with DNA Sequence

Repeat Elimination

Irregular (long) repeats:

•Database based: RepeatMasker

•De-novo: • RepeatScout,

• orrb,

• PILER, etc.

(Require genome as input, construct database)

Page 36: Sequence Clustering

Advancing Science with DNA Sequence

Detecting Chimeras

Detecting chimeric sequences:

• Abundance-based: Perseus, UCHIME• Chimeras undergo less amplification cycles. So

chimera segments in native arrangements are more frequent

• Specific to 16S: ChimeraSlayer, Bellerophon• Chimera ‘arms’ are closer to originating clades

then the entire chimera

Page 37: Sequence Clustering

Advancing Science with DNA Sequence

Detecting Chimeras

• Similarity coverage-based: Mira assembler

Page 38: Sequence Clustering

Advancing Science with DNA Sequence

Detecting Chimeras

• Similarity graph topology-based: dchim

Alignment view Connectivity view

Page 39: Sequence Clustering

Advancing Science with DNA Sequence

Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

Page 40: Sequence Clustering

Advancing Science with DNA Sequence

Protein Clustering

Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species

Position-specific scoring matrices and profile-HMMs provide better sensitivity, but SLOW

Similar problems as for RNA-Seq/EST clustering, but their causes are harder to fight

No ‘one fits all’ solution: manual tuning and curation required for comprehensive results, especially at a large scale

The results of clustering are precious, they are kept as databases (PFAM, COGs, KOGs, eggNOG)

Page 41: Sequence Clustering

Advancing Science with DNA Sequence

Protein Clustering at JGI

Functional annotation of metagenome genes through protein clusters (IMG):

- Build a set of functionally homogenous clusters of similar proteins – for annotated genomes

- Build HMM for each cluster, compose model database

- Pledge metagenome proteins to clusters by matching to models

- Cluster unpledged proteins, build models, update model database

Page 42: Sequence Clustering

Advancing Science with DNA Sequence

Protein Clustering

Use of Protein Clusters reduces search space, but adds another level of indirection, which is a source of errors, and adds complexity that consumes effort

However, for proteins, which form dense relationship networks, clustering is a great tool

Konstantinos Mavrommatis will elaborate on protein clustering techniques

Page 43: Sequence Clustering

Advancing Science with DNA Sequence

Thank you!