From Expression to Regulation: the online analysis of microarray data Gert Thijs K.U.Leuven, Belgium...

download From Expression to Regulation: the online analysis of microarray data Gert Thijs K.U.Leuven, Belgium ESAT-SCD.

If you can't read please download the document

description

ESAT-SCD Faculty of Engineering Mathematical engineering (120) – Systems and control – Data mining and Neural Nets – Biomedical signal processing – Telecommunications – Bioinformatics – Cryptography

Transcript of From Expression to Regulation: the online analysis of microarray data Gert Thijs K.U.Leuven, Belgium...

From Expression to Regulation: the online analysis of microarray data Gert Thijs K.U.Leuven, Belgium ESAT-SCD K.U.Leuven Founded in 1425 Situated in the center of Belgium Some numbers: students researchers professors University Hospital with beds ESAT-SCD Faculty of Engineering Mathematical engineering (120) Systems and control Data mining and Neural Nets Biomedical signal processing Telecommunications Bioinformatics Cryptography Bioinformatics team Research in medical informatics and bioinformatics Research on algorithmic methods Interdisciplinary team 15 researchers (1 full professor, 4 post-docs, 10 Ph.D. students) Engineering, physics, mathematics, computer science, biotech, and medicine Collaborative research with molecular biologists and clinicians VIB MicroArray Facility: primary analysis of microarray data University of Gent-VIB, Plant Genetics: motif discovery KUL-VIB, Center for Human Genetics Neuronal development in mice neurons Targets of PLAG1 (pleiomorphic adenoma gene) KUL, Obstetrics and Gynecology Diagnosis of ovarian tumors from ultrasonography (IOTA) Microarray analysis of ovarian tumor biopsies Overview 1. Short introduction to microarrays 2. Exploratory analysis of microarray data 3. Clustering gene expression profiles 4. Upstream sequence retrieval 5. Motif finding in sets of co-expressed genes cDNA microarrays Collaboration with VIB microarray facility cDNAs (genes, ESTs) spotted on array Cy3, Cy5 labeling of samples Hybridization (test, control) Laser scanning & image analysis Arabidopsis, mouse, and human Microarray experiment 1.Collecting samples 2.Extracting mRNA 3.Labeling 4.Hybridizing 5.Scanning 6.Visualizing Microarray production Clones Plasmide preparation PCR amplification Reordering Spotting Zoom - pins From expression to regulation Clustering start Blast Gibbs sampler Microarrays A1234 Z4321 GenBank Exploratory data analysis Data exploration Subset selection based on Gene Ontology functional classes Keywords, gene names Check the expression profiles of individual genes Visualization expression profiles of gene families Link to upstream sequence retrieval Gene Ontology Subset selection Profile inspection Profile visualization Sequence Retrieval Clustering Goal of clustering Exploration of microarray data Form coherent groups of Genes Patient samples (e.g., tumors) Drug or toxin response Study these groups to get insight into biological processes Genes in same clusters can have the same function or same regulation K-means Initialization Choose the number of clusters K and start from random positions for the K centers Iteration Assign points to the closest center Move each center to the center of mass of the assigned points Termination Stop when the centers have converged or maximum number of iterations Initialization K-means Initialization Choose the number of clusters K and start from random positions for the K centers Iteration Assign points to the closest center Move each center to the center of mass of the assigned points Termination Stop when the centers have converged or maximum number of iterations Iteration 1 Iteration 1 K-means Initialization Choose the number of clusters K and start from random positions for the K centers Iteration Assign points to the closest center Move each center to the center of mass of the assigned points Termination Stop when the centers have converged or maximum number of iterations Iteration 3 K-means Initialization Choose the number of clusters K and start from random positions for the K centers Iteration Assign points to the closest center Move each center to the center of mass of the assigned points Termination Stop when the centers have converged or maximum number of iterations Hierarchical clustering Construction of gene tree based on correlation matrix K-means clustering Need for new clustering algorithms Noisy genes deteriorate consistency of profiles in cluster All genes forced into cluster Adaptive quality-based clustering For discovery, biologists are looking for highly coherent, reliable clusters Other needs for clustering microarray data Fast + limited memory (need to analyze thousands of genes) No need to specify number of clusters in advance Few and intuitive parameters AQBC = 2 step algorithm Cluster center localization Cluster radius estimation with EM Read more: De Smet et al. (2002) Bioinformatics, in press. Step 1: localization of cluster center Step 2: re-estimation of cluster radius Distance from cluster center randomly distributed except for small group (= cluster elements) Size of cluster can be estimated automatically by EM Step 3: remove cluster points and look for new cluster K-means:A.Q.B.C. User defined parameters Quality criterion (QC): % defines how significant a cluster should be separated from background Minimal number of genes in a cluster Advantages Outcome not sensitive to parameter setting Number of clusters is determined automatically Based on QC an optimal radius is calculated for each cluster Set of smaller clusters containing genes with highly similar expression profile (fewer false positives) Noisy genes are rejected User-defined parameters Number of clusters Number of iterations Disadvantages Outcome sensitive towards parameter setting Extensive fine-tuning required to find optimal number of clusters Separation and merging of clusters based on visual inspection and not on statistical foundation No quality criterion: more false positives All genes will be clustered (noisy clusters) Disadvantages Some information is rejected: clusters too small Advantages Fewer true positives are rejected Comparison with K-means Adaptive Quality-Based Clustering Web Interface Cluster results page Upstream Sequence Retrieval Upstream sequence retrieval Upstream Sequence Retrieval 1. Identify all genes in cluster based on given accession number and gene name. 2. Delineate upstream region based on sequence annotation. 3. Check for presence of annotated upstream gene. 4. IF upstream gene found THEN select intergenic region ELSE blast gene to find genomic DNA where gene is annotated. 5. Parse blast reports to find intergenic regions 6. Report results in GFF. Gene Identification Selected sequences & genes to be blasted Results blast report parsing Selected sequences Motif Finding Transcriptional regulation Complex integration of multiple signals determines gene activity Combinatorial control Identifying regulatory elements from expression data Cluster genes from microarray expression data to build clusters of co-expressed genes Co-expressed genes may share regulatory mechanisms Most regulatory sequences are found in the upstream region of the genes (up to 2kb from A. thaliana) Motifs that are statistically overrepresented in the upstream regions are candidate regulatory sequences Upstream sequence model Motifs are hidden in noisy background sequence. Data set contains two types of sequences: Sequences with one or more copies of the common motif. Sequences with no copy of the common motif. Motif Sampler Algorithm based on the original Gibbs Sampling algorithm (Lawrence et al. 1993, Science 262: ) Probabilistic sequence model Changes and additions: Use of higher-order background model. Use of probability distribution to estimate number of copies. Different motifs are found and masked in consecutive runs of the algorithm. Read more: Thijs et al. (2001) Bioinformatics 17(12), Thijs et al. (2002) J.Comp.Biol. 9(2), Background model Representation of DNA sequence by higher-order Markov Chain: Core promotergene Intergenic region Reliable model can be build from selected intergenic DNA sequences. Intergenic sequence = non-coding region between two consecutive genes. Only regions that contain core promoter are selected. Algorithm: Initialization Calculate background model score Start from random set of motif positions Create initial motif model Algorithm: iterative procedure 1. Score sequences with current motif model 2. Calculate distribution 3. Sample new alignment position 4. Iterate for fixed number of steps Algorithm: Convergence Select best scoring positions from Wx to create motif and alingment Motif Sampler Motif Sampler results page Example: Plant wounding 150 Arabidopsis genes Mechanical plant wounding 7 (or 8) time points over a 24h period Adaptive quality-based clustering produces 8 clusters of which 4 contain 5 or more genes. Search for a motif of length 8 and a motif of length 12 in 4 clusters Reymond, P et al Differential gene expression in response to mechanical wounding and insect feeding in Arabidopsis. Plant Cell 12(5): Results: Cluster 1 TAArTAAGTCAC 7TGAGTCA tissue specific GCN4-motif CGTCA MeJA-responsive element ATTCAAATTT 8ATACAAAT element associated to GCN4-motif CTTCTTCGATCT 5TTCGACC elicitor responsive element Results: Cluster 2 CCCGCGTTTCAA 4 CCCCCGenhancer like element TTGACyCGy 5 TGACGMeJa responsive element (T)TGAC(C)Box-W1, elicitor responsive element mACGTCACct 7 CGTCAMeJA responsive element ACGTAbcissic response element Results: Cluster 4 wATATATATmTT 5 TATATATATA-box like element TCTwCnTC 9 TCTCCCTTCCC-motif, part of light responsive element ATAAATAkGCnT 7 - - Results: Cluster 8 yTGACCGTCcsa9CCGTCCmeristem specific activation of H4 gene CCGTCCA-box, light or elicitor responsive element TGACGMeJA responsive element CGTCAMeJA responsive element CACGTGG5CACGTGG-box, light responsive element ACGTAbcissic acid response element GCCTymTT8-- AGAATCAAT6-- Conclusions Gene expression data can reveal useful information on transcriptional regulation. Adaptive quality-based clustering finds coherent groups of co-expressed genes. Use of higher-order background models improves performance of Motif Sampler. INCLUSive enables online analysis from clustering to motif finding Acknowledgements ESAT-SCD Prof. Bart De Moor Dr. Yves Moreau Dr. Kathleen Marchal Frank De Smet Stein Aerts all others STWW Project Pierre Rouz (VIB Gent, INRA) Stephane Rombauts (VIB Gent, INRA) Magali Lescot (LGPD, Marseille) IWT-Vlaanderen