Data Analysis Tools & Techniques – II. In this presentation…… Part 1 – Gene Expression...
-
date post
20-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Data Analysis Tools & Techniques – II. In this presentation…… Part 1 – Gene Expression...
In this presentation……
Part 1 – Gene Expression Microarray Data
Part 2 – Global Expression & Sequence Data Analysis
Part 3 – Proteomic Data Analysis
Conversion to matrix
• Whichever platform is used, aim of data processing is to convert the hybridization signals into numbers, which can be used to build a gene expression matrix
• This matrix can be regarded as a table in which the rows represent genes (different features on array) and the columns represent treatments, samples or conditions used in experiment
What do they represent?
• For a dual hybridization experiment using a glass microarray, each of the probes represents a different experimental condition
• In other cases, a whole series of conditions or treatments may be used, e.g. representing a series of concentrations of a particular drug, or a series of developmental time points
Schematic of an idealized expression array, in which the results from 3 experiments are combined. Three genes NG (G1, G2, G3) are labeled on vertical axis and three experimental conditions NC (C1, C2, C3) are labeled on horizontal axis, giving a total of nine data points represented by NC x NG. The shading of each data point represents the level of gene expression, with darker colours representing higher expression levels
C1 C2 C3
G1
G2
G3
Expression profile
• Interpretation of microarray experiment is carried out by grouping data according to similar expression profiles
• It is defined as expression measurements of a given gene over a set of conditions; essentially it means reading along a row of data in the matrix
• Intensity of shading is used to represent expression levels
• With experimental conditions C1 and C2, genes G1 and G2 look functionally similar and G3 appears different. However, if C3 is included, a functional link between genes G1 and G3 can be seen
• Analysis methods are either supervised or unsupervised
Microarray Data Analysis Types
• Gene Selection– find genes for therapeutic targets
• Classification – classify disease based on genes– predict outcome / select best treatment
• Clustering– find new biological classes / refining existing ones– Exploration
• …
Microarray Data Mining Challenges
• too few records (samples), usually < 100 • too many columns (genes), usually > 1,000• Too many columns likely to lead to False positives• for exploration, a large set of all relevant genes is
desired• for diagnostics or identification of therapeutic
targets, smallest reliable set of genes is needed• model needs to be explainable to biologists
Data Mining Methodology is Critical!
Data Mining is a Continuous
Process!
Following Correct Methodology
is Critical!
CRISP-DM methodology
Preparation
Building Classification ModelsGene data
Feature Selection
Model Building
Evaluation
Class data
Supervised analysis method
• Supervised methods are essentially classification systems, i.e. they incorporate some kind of classifier so that expression profiles are assigned to one or more predefined categories
• For instance, supervised analysis of gene expression profiles from different leukemias allows samples to be divided into two distinct subtypes: acute myeloid leukemia (AML) and acute lymphoblastoid leukemia (ALL)
• For example, support vector machine (SVM), learning vector quantization (LVQ), etc.
Unsupervised analysis method
• They have no inbuilt classifiers, so the number and nature of groups depends only on the algorithm used and nature of data themselves
• This type of analysis is known as clustering• For example, k-means, principal component
analysis (PCA), self-organizing maps (SOM), hierarchical clustering, etc.
Feature reduction• Since microarray data sets are so large, classification and
clustering can be laborious and demanding in terms of computer resources
• It is possible to use feature reduction, where non-informative or redundant data points are removed from data set, to make the algorithms run more quickly
• For instance, if two conditions have exactly same effect on gene expression, these data are redundant and one entire column of the matrix can be eliminated
• If the expression of a particular gene is same over a range of conditions, it is neither necessary nor beneficial to use this gene in further analysis because it provides no useful information on differential gene expression. An entire row can be removed
Other feature reduction methods• Several approaches can be used to automatically select such
redundant or non-informative data sets, but a popular method is principal component analysis (also called singular value decomposition)
• Redundant data are combined to form a single, composite data set, thus reducing the dimensions of gene expression matrix and simplifying analysis
• Feature reduction can also be used in supervised analysis methods to reduce number of features required to classify profiles correctly (also called cherry picking)
• In one method, this can be achieved simply by weighting classification features according to their usefulness and eliminating those that are least informative
Microarray data format• Unlike sequence and structural data, there is no
international convention for the representation of data from microarray experiments
• This is due to the wide variation in experimental design, assay platforms and methodologies
• Recently, an initiative to develop a common language for the representation and communication of microarray data has been proposed
• Experiments are described in a standard format called MIAME and communicated using a standardized data exchange model and microarray markup language based on XML
Micro Array Gene Expression Markup Language
• Micro Array Gene Expression Markup Language (MAGE-ML) creates a syntax that can manage the enormous number of variables involved in microarray experiments, and provides a mutually intelligible format to permit data merges or comparisons
• This is a collaborative effort of Lion Bioscience, The Institute for Genomic Research, Rosetta Biosoftware, the Institute for Systems Biology, among others under the chairmanship of Paul Spellman
• This will soon become standard for all microarray experiments world wide being run under different conditions in different labs
• Look for Paul T Spellman et al (2002) for more information on MAGE-ML
Tools for microarray data analysis• Many software applications are available for the
analysis of microarray data and these can be downloaded and installed on local computers
• There are also several resources, Expression Profiler being the most widely used, for microarray data analysis over the Internet
• Several gene expression databases have been constructed for the storage and dissemination of microarray data
• These include the NCBI Gene Expression Omnibus and the EBI ArrayExpress database
From expression data to pathways• Reconstructing molecular pathways from expression
data is a difficult task• One approach is to simulate pathways using a variety
of mathematical models and then choose the model that fits the data
• Reverse engineering is a less demanding approach in which models are built on the basis of the observed behaviour of molecular pathways
• Models using simultaneous differential equations or Boolean networks each suffer from disadvantages, so hybrid models, such as the finite linear state model, are preferred
Representation of molecular pathways
• There are two well-studied ways of representing a molecular pathways– The classical biochemical representation involves
use of simultaneous differential equations– The Boolean network representation
Sequence sampling data analysis
• Differential gene expression can be investigated by sampling random clones from different cDNA libraries, or by sampling EST data, which is obtained by single-pass sequencing of randomly picked cDNA clones and deposited in public or proprietary databases
• Thousands of sequences have to be sampled for such analysis to be statistically significant, even in the case of moderately abundant mRNAs
Global expression data analysis
• Refers to any experiment in which the expression of all genes is monitored simultaneously
• Such experiments generate large amounts of data, but unlike sequence and structural data, there is no universal system for description of gene expression profiles
• Global protein expression data are obtained predominantly as signal intensities on 2D protein gels
RNA expression data analysis
• At the RNA level, expression data may be obtained as digital expression readouts following direct sequence sampling from libraries or databases, or using more sophisticated techniques like SAGE
• Most global RNA expression data, however, are obtained as signal intensities from microarray experiments
SAGE• SAGE is a sequence sampling technique in which
very short sequence tags (9-15 nt) are joined into long concatamers
• The size of the SAGE tag is optimal for high-throughput analysis but genes can still be identified unambiguously
• A concatamer may contain more than 50 tags, and each SAGE sequence is thus equivalent to more than 50 independent cDNA sequencing experiments
• SAGE is therefore appropriate for the analysis of rare mRNAs
Starting points for SAGE analysisResource URL
John Hopkins SAGE site. Includes protocols, access to SAGE data and an extensive bibliography
www.sagenet.org
NCBI SAGE site. Includes tools for data analysis, access to SAGE data, and library of tags and ditags
www.ncbi.nlm.nih.gov/SAGE
Saccharomyces genome database SAGE query site
http://genome-www.stanford.edu/cgi-bin/SGD/SAGE/querySAGE
A useful SAGE site run by Genzyme Molecular Oncology Inc., which owns the license for commercial distribution of SAGE technology
www.genzymemolecularoncology.com/sage
2D protein gels• Global protein expression analysis is achieved using
high resolution 2D gel electrophoresis• In this technique, proteins are separated in the first
dimension by isoelectric focusing in an immobilized pH gradient, and in the second dimension according to molecular mass
• After staining the gel, the resulting pattern of sports is a reproducible fingerprint of proteins in the sample
• Comparison between samples can identify proteins that are differentially expressed, or induced in response to drugs, and so on
• Excised spots are analyzed by MS to characterize proteins
Raw data from 2D-PAGE gels
• 2D-PAGE is a protein separation technique that allows the resolution of thousands of proteins on a single gel, on the basis of charge and mass
• Separated proteins appear as spots, the nature and distribution of which constitute a protein fingerprint of any sample
Data processing• Data extraction from 2D-PAGE gels involves
– staining (to reveal the position of individual protein spots)
– scanning (to obtain a digital image)– spot detection and quantization
• The quality of the image, in terms of spatial and densitometric resolution, is an important factor in accurate spot measurement
• A number of algorithms are used to resolve complex overlapping spots and assemble a final spot list
Gel matching
• To study differential protein expression, a series of 2D-PAGE gels must be compared
• However, minute inconsistencies in gel structure and electrophoretic conditions make it impossible to exactly replicate any experiment
• Sophisticated algorithms are required to follow individual spots through a series of gel, a process known as gel matching
• MELANIE II is a widely used gel-matching software application
Protein expression matrices
• Differential protein expression data are assembled into a protein expression matrix
• This can be used to find distances between particular proteins or treatments, leading to classification or clustering of proteins according to similar expression profiles
2D-PAGE database• Data from 2D-PAGE experiments are
deposited in dedicated 2D-PAGE databases containing digital gel images with links from individual protein spots to useful annotations
• Internet 2D-PAGE databases are indexed at the ExPASy WORLD-2PAGE
• These allow 2D-PAGE data to be shared with scientists around the world, and comparisons between gels can be carried out using Java applets such as Flicker or CAROL
Raw data from mass spectrometry• Raw data from MS experiments are the
mass/charge (m/z) ratios of ions in a vacuum• These are used to determine accurate
molecular masses• The masses can be used in peptide mass
fingerprinting or fragment ion searching to find correlations in protein databases
• Alternatively, peptide ladders can be generated and used to determine protein sequences de novo
Virtual digests• They are theoretical protein cleavage reactions
performed by computers based on known protein sequences and the known specificity of a cleavage agent such as an endoproteinase
• Although many different polypeptides can generate the same peptide digest pattern, in practice a correlation between the masses of two or more peptides produced from the same protein and the theoretical peptides produced in a virtual digest provides very strong evidence for a database match
Dual digests• Dual digests, carried out on the same protein
either separately or sequentially, can provide extra data to correlate experimentally determined molecular masses with less robust data resources such as dbEST
• Alternatively, single digests can be carried out before and after protein modification, or ragged termini can be generated from proteins with clustered arginine and lysine residues, providing the masses of multiple fragments to use as database search terms
Database search tools• Algorithms for database searching may attempt to
match the experimentally determined mass of a peptide or peptide fragment to mass predicted from sequence database entries. The program SEQUEST works on this principle
• Alternatively, the amino acid composition of a particular peptide or peptide fragment can be predicted from its mass
• The order of amino acids cannot be predicted, so all permitted permutations are used as a database search query. The program Lutkefisk works on this principle
Limitations of MS analysis• Failure of MS data to elicit a high-confidence
hit on a sequence database may not always reflect the absence of that protein from database
• In some cases, it may reflect the presence of unknown or unanticipated post-translational modifications, or it may be caused by non-specific proteolysis or contaminating proteins
• Imperfect matches may be generated if the experimental protein itself is absent from the database but a close homolog, with a related sequence, is present
WWW resources for MS based protein identificationResource URL Features and comments
CBRG, ETH-Zurich cbrg.inf.ethz.ch/Masssearch.html Peptide mass search
European Molecular Biology Laboratory, Heidelberg
www.mann.embl-heidelberg/Services/PeptideSearch/PeptideSearchIntro.html
Peptide mass and fragment ion search
ExPASy www.expasy.ch/tools/#proteome Peptide mass and fragment ion search
Mascot www.matrix-science.com/cgi/index.pl?page/home.html
Peptide mass and fragment ion search
Rockfeller University, New York
prowl.rockfeller.edu Peptide mass and fragment ion search
SEQNET, Daresbury, UK
www.seqnet.dl.ac.uk/Bioinformatics/welapp/mowse
Peptide mass and fragment ion search
University of California
prospector.ucsf.edu
dontatello.ucsf.edu
Peptide mass (MS-Fit) and fragment ion (MS-Tag) search
University of Washington
thompson.mbt.washington.edu/sequest
Instruction on how to get SEQUEST fragment ion search program
Standard format• Scope of bioinformatics has widened to include analysis of
gene and protein expression data• Standard format has been adopted for representation of 2D
gel electrophoresis (2D-PAGE) protein gels but there is no similar convention for microarrays, even though microarray experiments produce some of the largest data sets bioinformatics has to deal with
• This reflects different array platforms available (i.e. nylon macroarrays, spotted glass microarrays, high-density oligonucleotide chips) and large amount of variation in experimental design, hybridization protocols and data gathering techniques
Recent development• Recently, there has been an international effort to
develop a common language for communication of microarray data
• Requirements for this language are that it should be minimal but it should convey enough information to enable experiment to be repeated, if necessary
• The convention is known as MIAME (minimum information about a microarray experiment) devised by MAGE group (microarray and gene expression group)
MIAME standard
• Incorporates six elements– Overall experimental design– Array design (identification of each spot on
each array)– Probe source and labeling method– Hybridization procedures and parameters– Measurement procedure (including
normalization methods)– Control types, values and specifications
Contents of MIAME standard
• A data exchange model (MAGE-Object Model or MAGE-OM) is modeled using unified modeling language (UML)
• A data exchange format (MAGE-Markup Language or MAGE-ML) uses extensible markup language (XML)
For more information visit the Microarray Gene Expression Database (MGED) website at http://www.mged.org
Analysis software and resourcesURL Product(s) Comments
http://genome-www4.stanford.edu/Microarray/SMD/restech.html
Cluster, Xcluster, SAM, Scanalyze, many more
Extensive list of software resources from Stanford University and other sources, both downloadable and WWW-based
http://ihome.cukh.edu.hk/~b400559/arraysoft.html
Cluster, Cleaver, GeneSpring, Genesis, many more
Comprehensive list of downloadable and WWW-based software of microarray analysis and data mining, plus links to gene expression databases
http://ep.ebi.ac.uk/EP Expression Profiler Very powerful suite of programs from EBI for analysis and clustering of expression data
http://www.ncgr.org/genex GeneX GeneX gene expression database is an integrated tool set for analysis and comparison of microarray data
Analysis software and resources
URL Product(s) Comments
http://bioinfo.cnio.es/dnarray/analysis
DNA arrays analysis tools
A suite of programs from National Spanish Cancer Centre (CNIO) including two-sample correlation plot, hierarchical clustering, SOM, SVM, tree viewers, etc.
http://www.ncbi.nlm.nih.gov/geo
NCBI Gene Expression Omnibus
Gene expression and hybridization database; could be searched directly or through Entrez ProbeSet search interface
http://www.ebi.ac.uk/microarray/ArrayExpress/arrayexpress.html
ArrayExpress EBI microarray gene expression database, developed by MGED and supports MIAME
More on microarray chips
• Protein chip market expected to be of $ 700 million by 2006
• Chips for agricultural purposes will be great demand
• Peptide microarray chips– Silicon based micro-fluidics chips– 2000 to 4000 peptide sequence on a 1.5 cm2 chip
• Protein – Secreted– Membranal
Accuracy of new tech chips
• New software technologies can reduce the inter-experiment variability from 1500-200 genes down to 10-15 genes by identification and suppression of background noise in producing microarray data
• They can be used for high throughput sequencing, protein detection and SNP analysis
• Reduces error rate of false positives from 30 % down to 1 %
• Current DNA chips III are equipped to handle multiple mRNA transcripts
Front-end and back-end processing
• This term is widely used by biotech industry
• Front end DNA microarray processes– Sample preparation– Microarray production
• Back end DNA microarray processes– Hybridization– Imaging and analysis
DNA chip test• Cancers can act differently even when they look the
same. To decide how to treat breast tumors, doctors look at a range of indicators such as whether the cancer has spread to nearby lymph nodes, tumor size, and certain characteristics of the tumor cells. However, none of these factors is very accurate
• The DNA chip test reveals how 70 genes turned on or off in the cancer cells
• According to Netherlands Cancer Institute, the tumors most likely to spread usually show a different pattern of gene expression than their less dangerous counterparts