presentationIDC - 14MAY2015
-
Upload
anat-reiner-benaim -
Category
Documents
-
view
38 -
download
1
Transcript of presentationIDC - 14MAY2015
Intron detection in time-course tiling array data
The big-data analytics challenge combining statistical and algorithmic perspectivesAnat Reiner-Benaim
Department of StatisticsUniversity of Haifa
IDC, May 14, 2015
1
OutlineIDC, May 2015data science -Definition?Who needs it?The elements of data science Analysis:ModelingSoftwareExamples:Scheduling prediction of runtimeGenetics detection of rare events
2
2
What is data Science?IDC, May 2015
From Wikipedia:Data science is the study of the generalizable extraction of knowledge from data
3
3
IDC, May 2015
More from Wikipedia:builds on techniques and theories from many fields, including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization, uncertainty modeling, data warehousing and high performance computing...goal: extracting meaning from data and creating data products not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of data science.
4
4
Data Science who needs it?IDC, May 20155Anyone who has (big) data, e.g.:Cellular industry phones, apps, advertisersInternet: search engines, social media, marketing, advertisersComputer networks and server systemsCyber securityCredit cardsBanksHealth care providersLife science genome, proteomeTV and related Weather forecast
5
The elements of data scienceIDC, May 20156NoSQL Database (e.g. Cassandra)DFS (Distributed File System)(e.g. Hadoop, Spark, GraphLab)Store, Preprocess
Database - SQL(e.g. MySQL, SAS-SQL)Dump to SQL
Apply sophisticated methods:Statistical modelingMachine learning algorithmsAnalyze
Big data technologies
Big data Analytics
6
IDC, May 2015How can I decide that an item in a manufacturing process is faulty?What is the difference between the new machine and the old one? What are the factors that affect system load?How can I predict memory/runtime of a program?How can I predict that a costumer will churn?What is the chance that the phone/web user will click my advertisement?What is the chance that the current ATM user is making a fraud?What are the chance for snow this week?
7Data Analysis First, define the problem
7
IDC, May 2015Possible goals:Predicting, classifying(Logistic) Regression, LDA, QDA, Nave Bayes, Neural networksCART, Random forests, SVM, KNNClusteringHierarchical, K-means, Mixture models, HMM, PCAAnomaly detection, peak detectionScan statistic, outlier detection methodsA/B Testing (actually two sample comparison)Parametric tests (normal, t, chi-square, ANOVA)Non-parametric tests (signed-rank, rank-sum, Kruskal-Wallis)Identify trends, cyclesRegression, time-series
8Modeling
8
IDC, May 20159Choosing modelsType of variables:Continuous, ordinal, categorical.Statistical assumption:Normality, equal-variance, independence.Missing dataStability
9
IDC, May 201510Learning toolsBootstrapRepeatedly fit model on resampled data.Bagging (bootstrap aggregation)Combine bootstrap samples to prevent instability.BoostingCombine a set of weak learners to create a single strong learnerRegularizationSolve over-fitting by restriction (e.g. limit regression to linear or low degree polynomial)Utility/cost functionEvaluate performance, compare modelsTypically iterative procedures.combined with the modeling procedures Help optimize the model and evaluate its performance
10
IDC, May 2015
11More to consider-Control statistical error due to large scale analysisMultiple statistical tests
Inflated statistical error
Control FDR?FDR = expected proportion of false findings (e.g. features)
11
IDC, May 201512The R softwareOpen source programming language and environment for statistical computing.Widely used among statisticians for developing statistical software (packages) and for data analysis.Increasingly popular among all data professionals.
Advantages:Contains most updated statistical models and machine learning algorithms.Methods are based on research, compiled and documented.Contains Hadoop functions (package rhdfs). Very convenient for plain programming, scripting, simulations, visualization.Friendly interface (e.g. R-Studio). The R project site
12
IDC, May 2015
13ExamplesRuntime prediction (manufacturing, scheduling)
Anomaly/peak detection (fraud, electronics, genetics)
Diagnostics(biotech, healthcare)
Epistatic detection (genetics)
13
Example 1:Classification of Job Runtime in Intel Joint work with: Anna Grabarnick, University of HaifaEdi Shmueli, Intel
14
Job processingIDC, May 2015
15
Usersserversjobs
JobschedulerDecide:Which server?Queue?
15
Job schedulersIDC, May 2015
Algorithms aimed to efficiently queuing and distributing jobs among servers, thereby improving system utilization.Popular scheduling algorithms (e.g. the backfilling) use information on how long the jobs are expected to run.In serial job systems, scheduling performance can be improved by merely separating the short jobs from the long and assigning them to different queues in the system. This helps reduce the likelihood that short jobs will be delayed after long ones, and thus improves overall system throughput.
16
16
Job processingIDC, May 2015
17
UsersserversClassifyEach Job:
shortlongjobs
scheduler
17
The problemIDC, May 2015
Main purpose: Classify jobs into short and long durations.
Questions:How can the classes can be defined?How can the jobs be classified?
18
18
Available dataIDC, May 2015
two traces obtained from one of Intels data centers:~1 million jobs executed during a period of 10 consecutive days.Used for data training. ~755,000 jobs executed during a period of 7 consecutive days. Used for model validation.
Aside from runtime information, 9 categorical variables were available:19
19
IDC, May 201520
Analysis stepsIDC, May 2015
Exploratory visualization of the data.Class construction and characterization.Classification:Choice of a classification model.Optimize model.Validate model.
21
21
IDC, May 201522
A1B4A3A2
IDC, May 201523
IDC, May 201524Runtime distribution
All observationsWtime < 15,000 secsecondsseconds
IDC, May 201525Runtime - log transformation
IDC, May 201526Constructing classes by the mixture model
IDC, May 201527Mixture distribution parameters estimation
Short
IDC, May 201528Parameters estimation contd
IDC, May 201529Parameters estimation contd
IDC, May 201530Parameters estimation contd
IDC, May 201531Parameter estimation - additional notes
IDC, May 201532
We obtain the following estimates:
IDC, May 201533
Building a Classifier The Learning algorithmIDC, May 201534Fit a model on training data:Model/feature selectionEvaluate the model on testing data
Summarize model performance:ROCMisclassification ratesFit (F test, SSE)
Compare modelsValidate on validation set
Optimize onfull data:ROC, pseudo-ROC
34
IDC, May 201535The training and testing process80% are for training finding a classifier (model/feature selection)20% are for testing checking performanceAfter obtaining a classifier optimize: choose the mixture threshold that maximizes performance on full dataset.
Sequential procedures for model reduction
IDC, May 201536ClassifiersHere we choose two classification models:logistic regressiondecision trees
They can both handle:Missing dataCandidate classifying variables that are either continuous or categorical.Categorical variables with many categories
IDC, May 201537Decision treesClassification rules are formed by the paths from the root to the leaves.No assumptions are made regarding the distribution of predictors.Relatively unstable.
steps:Tree is built using recursive splitting of nodes, until a maximal tree is generated.Pruning simplification of the tree by cutting nodes off, prevents overfitting.Selection of the optimal pruned tree fits without overfitting.
IDC, May 201538Logistic RegressionRegression used to predict the outcome of a binary variable (like short or long).Conditional mean E(Y|X) is distributed Bernoulli.The connection between E(Y|X) and X can be described by the logistic function:
which has an s shape. In general, the logistic function is
IDC, May 201539Performance measuresWe use ROC curve.It combines both types of errors:Sensitivity (true positive rate) - probability for a short classification when the runtime is short.Specificity (true negative rate)- probability for a long classification when the runtime is long.
IDC, May 201540Performance optimizationFor the CART procedure, variables A1, A2, A3 and B4 were selected to be in the classifier.For performance optimization, we use a pseudo-ROC curve:
blue circle marks optimal tradeoff between sensitivity and specificityobtained for mixture probability threshold of 0.45.
IDC, May 201541
For the Logistic regression, most variables were selected to be in the classifier.For performance optimization, we compare ROC curves obtained for different thresholds, and choose threshold 0.4:
IDC, May 201542Validation resultsTotal misclassification rates:CART: 9%.Logistic regression: 17%.
Summary:Runtime can be effectively classified using the available information.Further evaluation of our method is required using different data sets from different installations and times.
IDC, May 201543Joint work with: Pavel Goldstein and Prof. Avraham Korol, University of Haifa
Example 2:Detection of 2nd order Epistasis on multi-trait complexes
IDC, May 2015
Goal:search for epistatic effects (interactions between genomic loci) on expression traits.
44Searching for Epistasis
44
Epistasis
no epistasisepistasis45IDC, May 2015
QTL2QTL1QTL2QTL1alleleallele
However, more than one gene may be affecting the trait, and then epistatic effect is of potential interest. The Y-axis here gene expression and X-axis genotypes of a QTL1. The markers have only two levels A or H and this is the case of recombinant inbred line (RIL) population. The first plot represents the case of no epistasis. Conceptually, it similar to 2-way analysis of variance. In the second plot epistatic effect is involved.
45
IDC, May 2015Despite the growing interest in searching for epistatic interactions, there is no consensus as to the best strategy for their detection
Suggested approach:QTL analysis - combine gene expression and mapping dataUse multi-trait complexes rather than single traits (trait = gene expression of a particular gene).Screen for potential epistatic regions in a hierarchical manner.Control the overall FDR (False Discovery Rate).
46
46
Multi-trait complexes47IDC, May 2015
Number of tests for interactions on single traits:Number of genes (~7200) * number of loci pairs (~120,000) = a lot!
A dimension reduction stage can be of help!
Suggestion:Consider correlated traits as multi-trait complexes has been shown to increase QTL detection power, mapping resolution and estimation accuracy (Korol et al, 2001).
However, more than one gene may be affecting the trait, and then epistatic effect is of potential interest. The Y-axis here gene expression and X-axis genotypes of a QTL1. The markers have only two levels A or H and this is the case of recombinant inbred line (RIL) population. The first plot represents the case of no epistasis. Conceptually, it similar to 2-way analysis of variance. In the second plot epistatic effect is involved.
47
The Weighted Gene Co-Expression Network Analysis (WGCNA) (Zhang and Horvath, 2005)Use WGCNA Weighted correlation networkTop-down hierarchical clustering.Dynamic Tree Cut algorithm: branch cutting method for detecting gene modules, depending on their shapeBuilding up meta-genes by taking the first principal component of the genes from every cluster.48IDC, May 2015Clustering traits (genes)
The WGCNA proposed by Zhang and Horvath, used for gene expression clustering.Firstly, Top-down hierarchical clustering applied, using weighted inter-genes distances .Then, the branches cutting method sensitive for their shapes implemented for detecting gene modules.And then meta-genes are defined as the first principal component of the genes from every cluster.
48
Testing for epistasis:Natural and Orthogonal Interactions (NOIA) model (Alvarez-Castro and Carlborg , 2007) For trait t, loci-pair l (loci A and B) and replicate i :
design matrixvector of genetic effectsIndicator of genotype combinations for two locigenotypesgene expression49IDC, May 2015
We propose to test epistasis hypothesis by fitting a proposed by Alvarez-Castro and Carlborg NOIA model modified for second order epistasis in RIL populations, which are homozygous. The model allows orthogonal estimation of genetic effects. For loci A and B gene expression level for trait t, loci-pair l and replicate i we can represent the vector of gene expressions as a product of phenotypes with corresponding gynotype combinations indicators plus error term. In turn, phenotypes may be represented as a multiplication of genetic effects and design matrix that guarantees orthogonality of the effects.
49
The test for epistasis is done hierarchically50Framework marker
Secondary markersIDC, May 2015
As mentioned gynotype map neighbor markers contain very similar information. Based on this attribute we separated all markers for "framework" markers (marked as bold dots) - relatively distant loci and "secondary" markers (small vertical lines) related to corresponding framework markers. Long vertical lines denote borders of "framework" marker areas. Thus our markers have hierarchical structure. We propose a two-stage approach for identifying QTL epistasis. The offered algorithm starts with an initial construction of multi-trait complexes (or meta-genes by WGCNA clustering the microarray gene expression data. Then, epistasis is tested for among all combinations of such complexes and loci-pairs: starting with an initial "rough" search for pairs among framework markers, which is followed by a higher resolution search only within the identified regions.If we found the epistatic effect between markers m1 and m2, we should continue our search between all pairs of markers along with their "secondary" markers (colored in yellow)
50
False Discovery Rate (FDR) in hierarchical testingYekutieli (2008) offers a procedure to control the FDR for the full tree of tests
51
IDC, May 2015
Since the number of tests involved is enormous we should control false positives. For this purpose we used False Discovery Rate criteria proposed by Benjaminiand Hochberg, In our case it defined as expected proportion of erroneously identified epistasis effects among all identified ones.Yekutieli (2008) suggested the hierarchical procedure to control the FDR across the tree of hypotheses. In our case all hypotheses could be arranged in a 2-level structure. In the first level are the hypotheses for all combinations of multi-trait complexes and pairs of sparse "framework" markers. In the second level are the hypotheses for all combinations selected in the first level, this time using "secondary", markers, related to corresponding framework markers. We are interesting in Full-tree FDR control - all epistasis discoveries in all tree . The rejection threshold q should be chosen such that the full-tree FDR will be controlled at the level, 0.1.
51
Hierarchical FDR controlA universal upper bound is derived for the full-tree FDR (Yekutieli, 2008):
An upper bound for * may be estimated using:
where RtPi=0 and RtPi=1 are the number of discoveries in t, given that Hi is a true null hypothesis in t, and false null hypothesis, respectively. .52IDC, May 2015
52
IDC, May 201553Searching algorithmSTAGE 1: Construct multi-trait complexes (using WGCNA clustering)
STAGE 2: hierarchical searchstep1: Screen for combinations of loci-pair and multi-trait complex with potential for epistasis (NOIA model)Step 2:Test using higher resolution loci only for the selected regions (NOIA model).
DataA sample of 210 individuals from Arabidopsis thaliana populationGenotypic map consists of 579 markersTranscript levels were quantified using Affymetrix whole-genome microarraysTotal of 22,810 gene expressions from all five chromosomes (non-expressed genes filtered out).
54IDC, May 2015
We implemented the algorithm on Arabidobsis data of 210 RILs .Around 23000 gene expressions were produced from all five chromosomes.
54
Two-stage hierarchical testing for epistasisSTAGE 1: Identified 314 gene clusters (WGSNA)
STAGE 2:
47 sparse "framework" markers that are within 10 cM of each other.10-12 secondary" marker related to each "framework" marker.First step: 1081 marker pairs X 314 meta-genes =339,434 tests- 11 regions are identified.Second step:- 1141 epistatic effects are identified.
55IDC, May 2015
Then we applied out algorithm:were identified 314 gene clusters (WGCNA)For the first stage of hierarchical testing 47 sparse "framework" markers that are within 10 cM of each other were used10 -12 secondary" markers were placed for each framework areaSo we tested around 440 000 epistatic hypotheses
55
IDC, May 201556
Epistatic regions
IDC, May 201557
Simulation study
IDC, May 201558
Simulation study (contd)
PreprocessingThe Variance Stabilization Normalization
Gene expression filtering: 7244 genes out of 22810
Markers preprocessing59IDC, May 2015
The Variance Stabilization Normalization (VSN) uses Generalized log transformation After filtering non-expressed genes remained 7244 genes out of 22810Also we filtered out bad markers or non-informative markers
59
Computational advantageUsing the two-stage algorithm on meta-genes, 341,107 hypotheses were testsNaive analysis: 121278 loci pairs for each of 7244 traits, namely 878,537,832 tests would have been performedReduction of tests number by 2575 times60IDC, May 2015
Using the two-stage algorithm on meta-genes, 341,107 hypotheses were testsIf instead, all possible combinations of markers and row traits were tested at one stage, about 900,000,000 tests would have been performed.
60
Peak Detection
61
Point-wise statistics
Wild-type Mutant
IDC, May 2015
Define a scan statistics
62IDC, May 2015
IDC, May 2015Peak Detection
63
Point-wise statisticsMoving-sum statistics
IDC, May 201564Summary data science
Data science is an emerging filed/profession that incorporates knowledge and expertise form several disciplines.
It combines both big data technologies and sophisticated methods for complicated data analysis.
Data analysis is aimed to answer various questions with case-specific challenges, and should therefore be carefully tailored to the type of problem and data.
ReferencesIDC, May 201565Reiner-Benaim, A., Shmueli, E. and Grabarnick, A. (submitted) A statistical learning approach for runtime prediction in Intels data center. Goldstein, P., Korol, A. B. and Reiner-Benaim, A. (2014) Two-stage genome-wide search for epistasis with implementation to Recombinant Inbred Lines (RIL) populations. PLOS ONE, 9(12).Reiner-Benaim, A., (2015) Scan statistic tail probability assessment based on process covariance and window size. Methodology and Computing in Applied Probability, In Press. Reiner-Benaim, A., Davis, R. W. and Juneau, K. (2014) Scan statistics analysis for detection of introns in time-course tiling array data. Statistical Applications in Genetics and Molecular Biology, 13(2), 173-90.
Thank you