presentationIDC - 14MAY2015

66
The big-data analytics challenge – combining statistical and algorithmic perspectives Anat Reiner-Benaim Department of Statistics University of Haifa IDC, May 14, 2015

Transcript of presentationIDC - 14MAY2015

Intron detection in time-course tiling array data

The big-data analytics challenge combining statistical and algorithmic perspectivesAnat Reiner-Benaim

Department of StatisticsUniversity of Haifa

IDC, May 14, 2015

1

OutlineIDC, May 2015data science -Definition?Who needs it?The elements of data science Analysis:ModelingSoftwareExamples:Scheduling prediction of runtimeGenetics detection of rare events

2

2

What is data Science?IDC, May 2015

From Wikipedia:Data science is the study of the generalizable extraction of knowledge from data

3

3

IDC, May 2015

More from Wikipedia:builds on techniques and theories from many fields, including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization, uncertainty modeling, data warehousing and high performance computing...goal: extracting meaning from data and creating data products not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of data science.

4

4

Data Science who needs it?IDC, May 20155Anyone who has (big) data, e.g.:Cellular industry phones, apps, advertisersInternet: search engines, social media, marketing, advertisersComputer networks and server systemsCyber securityCredit cardsBanksHealth care providersLife science genome, proteomeTV and related Weather forecast

5

The elements of data scienceIDC, May 20156NoSQL Database (e.g. Cassandra)DFS (Distributed File System)(e.g. Hadoop, Spark, GraphLab)Store, Preprocess

Database - SQL(e.g. MySQL, SAS-SQL)Dump to SQL

Apply sophisticated methods:Statistical modelingMachine learning algorithmsAnalyze

Big data technologies

Big data Analytics

6

IDC, May 2015How can I decide that an item in a manufacturing process is faulty?What is the difference between the new machine and the old one? What are the factors that affect system load?How can I predict memory/runtime of a program?How can I predict that a costumer will churn?What is the chance that the phone/web user will click my advertisement?What is the chance that the current ATM user is making a fraud?What are the chance for snow this week?

7Data Analysis First, define the problem

7

IDC, May 2015Possible goals:Predicting, classifying(Logistic) Regression, LDA, QDA, Nave Bayes, Neural networksCART, Random forests, SVM, KNNClusteringHierarchical, K-means, Mixture models, HMM, PCAAnomaly detection, peak detectionScan statistic, outlier detection methodsA/B Testing (actually two sample comparison)Parametric tests (normal, t, chi-square, ANOVA)Non-parametric tests (signed-rank, rank-sum, Kruskal-Wallis)Identify trends, cyclesRegression, time-series

8Modeling

8

IDC, May 20159Choosing modelsType of variables:Continuous, ordinal, categorical.Statistical assumption:Normality, equal-variance, independence.Missing dataStability

9

IDC, May 201510Learning toolsBootstrapRepeatedly fit model on resampled data.Bagging (bootstrap aggregation)Combine bootstrap samples to prevent instability.BoostingCombine a set of weak learners to create a single strong learnerRegularizationSolve over-fitting by restriction (e.g. limit regression to linear or low degree polynomial)Utility/cost functionEvaluate performance, compare modelsTypically iterative procedures.combined with the modeling procedures Help optimize the model and evaluate its performance

10

IDC, May 2015

11More to consider-Control statistical error due to large scale analysisMultiple statistical tests

Inflated statistical error

Control FDR?FDR = expected proportion of false findings (e.g. features)

11

IDC, May 201512The R softwareOpen source programming language and environment for statistical computing.Widely used among statisticians for developing statistical software (packages) and for data analysis.Increasingly popular among all data professionals.

Advantages:Contains most updated statistical models and machine learning algorithms.Methods are based on research, compiled and documented.Contains Hadoop functions (package rhdfs). Very convenient for plain programming, scripting, simulations, visualization.Friendly interface (e.g. R-Studio). The R project site

12

IDC, May 2015

13ExamplesRuntime prediction (manufacturing, scheduling)

Anomaly/peak detection (fraud, electronics, genetics)

Diagnostics(biotech, healthcare)

Epistatic detection (genetics)

13

Example 1:Classification of Job Runtime in Intel Joint work with: Anna Grabarnick, University of HaifaEdi Shmueli, Intel

14

Job processingIDC, May 2015

15

Usersserversjobs

JobschedulerDecide:Which server?Queue?

15

Job schedulersIDC, May 2015

Algorithms aimed to efficiently queuing and distributing jobs among servers, thereby improving system utilization.Popular scheduling algorithms (e.g. the backfilling) use information on how long the jobs are expected to run.In serial job systems, scheduling performance can be improved by merely separating the short jobs from the long and assigning them to different queues in the system. This helps reduce the likelihood that short jobs will be delayed after long ones, and thus improves overall system throughput.

16

16

Job processingIDC, May 2015

17

UsersserversClassifyEach Job:

shortlongjobs

scheduler

17

The problemIDC, May 2015

Main purpose: Classify jobs into short and long durations.

Questions:How can the classes can be defined?How can the jobs be classified?

18

18

Available dataIDC, May 2015

two traces obtained from one of Intels data centers:~1 million jobs executed during a period of 10 consecutive days.Used for data training. ~755,000 jobs executed during a period of 7 consecutive days. Used for model validation.

Aside from runtime information, 9 categorical variables were available:19

19

IDC, May 201520

Analysis stepsIDC, May 2015

Exploratory visualization of the data.Class construction and characterization.Classification:Choice of a classification model.Optimize model.Validate model.

21

21

IDC, May 201522

A1B4A3A2

IDC, May 201523

IDC, May 201524Runtime distribution

All observationsWtime < 15,000 secsecondsseconds

IDC, May 201525Runtime - log transformation

IDC, May 201526Constructing classes by the mixture model

IDC, May 201527Mixture distribution parameters estimation

Short

IDC, May 201528Parameters estimation contd

IDC, May 201529Parameters estimation contd

IDC, May 201530Parameters estimation contd

IDC, May 201531Parameter estimation - additional notes

IDC, May 201532

We obtain the following estimates:

IDC, May 201533

Building a Classifier The Learning algorithmIDC, May 201534Fit a model on training data:Model/feature selectionEvaluate the model on testing data

Summarize model performance:ROCMisclassification ratesFit (F test, SSE)

Compare modelsValidate on validation set

Optimize onfull data:ROC, pseudo-ROC

34

IDC, May 201535The training and testing process80% are for training finding a classifier (model/feature selection)20% are for testing checking performanceAfter obtaining a classifier optimize: choose the mixture threshold that maximizes performance on full dataset.

Sequential procedures for model reduction

IDC, May 201536ClassifiersHere we choose two classification models:logistic regressiondecision trees

They can both handle:Missing dataCandidate classifying variables that are either continuous or categorical.Categorical variables with many categories

IDC, May 201537Decision treesClassification rules are formed by the paths from the root to the leaves.No assumptions are made regarding the distribution of predictors.Relatively unstable.

steps:Tree is built using recursive splitting of nodes, until a maximal tree is generated.Pruning simplification of the tree by cutting nodes off, prevents overfitting.Selection of the optimal pruned tree fits without overfitting.

IDC, May 201538Logistic RegressionRegression used to predict the outcome of a binary variable (like short or long).Conditional mean E(Y|X) is distributed Bernoulli.The connection between E(Y|X) and X can be described by the logistic function:

which has an s shape. In general, the logistic function is

IDC, May 201539Performance measuresWe use ROC curve.It combines both types of errors:Sensitivity (true positive rate) - probability for a short classification when the runtime is short.Specificity (true negative rate)- probability for a long classification when the runtime is long.

IDC, May 201540Performance optimizationFor the CART procedure, variables A1, A2, A3 and B4 were selected to be in the classifier.For performance optimization, we use a pseudo-ROC curve:

blue circle marks optimal tradeoff between sensitivity and specificityobtained for mixture probability threshold of 0.45.

IDC, May 201541

For the Logistic regression, most variables were selected to be in the classifier.For performance optimization, we compare ROC curves obtained for different thresholds, and choose threshold 0.4:

IDC, May 201542Validation resultsTotal misclassification rates:CART: 9%.Logistic regression: 17%.

Summary:Runtime can be effectively classified using the available information.Further evaluation of our method is required using different data sets from different installations and times.

IDC, May 201543Joint work with: Pavel Goldstein and Prof. Avraham Korol, University of Haifa

Example 2:Detection of 2nd order Epistasis on multi-trait complexes

IDC, May 2015

Goal:search for epistatic effects (interactions between genomic loci) on expression traits.

44Searching for Epistasis

44

Epistasis

no epistasisepistasis45IDC, May 2015

QTL2QTL1QTL2QTL1alleleallele

However, more than one gene may be affecting the trait, and then epistatic effect is of potential interest. The Y-axis here gene expression and X-axis genotypes of a QTL1. The markers have only two levels A or H and this is the case of recombinant inbred line (RIL) population. The first plot represents the case of no epistasis. Conceptually, it similar to 2-way analysis of variance. In the second plot epistatic effect is involved.

45

IDC, May 2015Despite the growing interest in searching for epistatic interactions, there is no consensus as to the best strategy for their detection

Suggested approach:QTL analysis - combine gene expression and mapping dataUse multi-trait complexes rather than single traits (trait = gene expression of a particular gene).Screen for potential epistatic regions in a hierarchical manner.Control the overall FDR (False Discovery Rate).

46

46

Multi-trait complexes47IDC, May 2015

Number of tests for interactions on single traits:Number of genes (~7200) * number of loci pairs (~120,000) = a lot!

A dimension reduction stage can be of help!

Suggestion:Consider correlated traits as multi-trait complexes has been shown to increase QTL detection power, mapping resolution and estimation accuracy (Korol et al, 2001).

However, more than one gene may be affecting the trait, and then epistatic effect is of potential interest. The Y-axis here gene expression and X-axis genotypes of a QTL1. The markers have only two levels A or H and this is the case of recombinant inbred line (RIL) population. The first plot represents the case of no epistasis. Conceptually, it similar to 2-way analysis of variance. In the second plot epistatic effect is involved.

47

The Weighted Gene Co-Expression Network Analysis (WGCNA) (Zhang and Horvath, 2005)Use WGCNA Weighted correlation networkTop-down hierarchical clustering.Dynamic Tree Cut algorithm: branch cutting method for detecting gene modules, depending on their shapeBuilding up meta-genes by taking the first principal component of the genes from every cluster.48IDC, May 2015Clustering traits (genes)

The WGCNA proposed by Zhang and Horvath, used for gene expression clustering.Firstly, Top-down hierarchical clustering applied, using weighted inter-genes distances .Then, the branches cutting method sensitive for their shapes implemented for detecting gene modules.And then meta-genes are defined as the first principal component of the genes from every cluster.

48

Testing for epistasis:Natural and Orthogonal Interactions (NOIA) model (Alvarez-Castro and Carlborg , 2007) For trait t, loci-pair l (loci A and B) and replicate i :

design matrixvector of genetic effectsIndicator of genotype combinations for two locigenotypesgene expression49IDC, May 2015

We propose to test epistasis hypothesis by fitting a proposed by Alvarez-Castro and Carlborg NOIA model modified for second order epistasis in RIL populations, which are homozygous. The model allows orthogonal estimation of genetic effects. For loci A and B gene expression level for trait t, loci-pair l and replicate i we can represent the vector of gene expressions as a product of phenotypes with corresponding gynotype combinations indicators plus error term. In turn, phenotypes may be represented as a multiplication of genetic effects and design matrix that guarantees orthogonality of the effects.

49

The test for epistasis is done hierarchically50Framework marker

Secondary markersIDC, May 2015

As mentioned gynotype map neighbor markers contain very similar information. Based on this attribute we separated all markers for "framework" markers (marked as bold dots) - relatively distant loci and "secondary" markers (small vertical lines) related to corresponding framework markers. Long vertical lines denote borders of "framework" marker areas. Thus our markers have hierarchical structure. We propose a two-stage approach for identifying QTL epistasis. The offered algorithm starts with an initial construction of multi-trait complexes (or meta-genes by WGCNA clustering the microarray gene expression data. Then, epistasis is tested for among all combinations of such complexes and loci-pairs: starting with an initial "rough" search for pairs among framework markers, which is followed by a higher resolution search only within the identified regions.If we found the epistatic effect between markers m1 and m2, we should continue our search between all pairs of markers along with their "secondary" markers (colored in yellow)

50

False Discovery Rate (FDR) in hierarchical testingYekutieli (2008) offers a procedure to control the FDR for the full tree of tests

51

IDC, May 2015

Since the number of tests involved is enormous we should control false positives. For this purpose we used False Discovery Rate criteria proposed by Benjaminiand Hochberg, In our case it defined as expected proportion of erroneously identified epistasis effects among all identified ones.Yekutieli (2008) suggested the hierarchical procedure to control the FDR across the tree of hypotheses. In our case all hypotheses could be arranged in a 2-level structure. In the first level are the hypotheses for all combinations of multi-trait complexes and pairs of sparse "framework" markers. In the second level are the hypotheses for all combinations selected in the first level, this time using "secondary", markers, related to corresponding framework markers. We are interesting in Full-tree FDR control - all epistasis discoveries in all tree . The rejection threshold q should be chosen such that the full-tree FDR will be controlled at the level, 0.1.

51

Hierarchical FDR controlA universal upper bound is derived for the full-tree FDR (Yekutieli, 2008):

An upper bound for * may be estimated using:

where RtPi=0 and RtPi=1 are the number of discoveries in t, given that Hi is a true null hypothesis in t, and false null hypothesis, respectively. .52IDC, May 2015

52

IDC, May 201553Searching algorithmSTAGE 1: Construct multi-trait complexes (using WGCNA clustering)

STAGE 2: hierarchical searchstep1: Screen for combinations of loci-pair and multi-trait complex with potential for epistasis (NOIA model)Step 2:Test using higher resolution loci only for the selected regions (NOIA model).

DataA sample of 210 individuals from Arabidopsis thaliana populationGenotypic map consists of 579 markersTranscript levels were quantified using Affymetrix whole-genome microarraysTotal of 22,810 gene expressions from all five chromosomes (non-expressed genes filtered out).

54IDC, May 2015

We implemented the algorithm on Arabidobsis data of 210 RILs .Around 23000 gene expressions were produced from all five chromosomes.

54

Two-stage hierarchical testing for epistasisSTAGE 1: Identified 314 gene clusters (WGSNA)

STAGE 2:

47 sparse "framework" markers that are within 10 cM of each other.10-12 secondary" marker related to each "framework" marker.First step: 1081 marker pairs X 314 meta-genes =339,434 tests- 11 regions are identified.Second step:- 1141 epistatic effects are identified.

55IDC, May 2015

Then we applied out algorithm:were identified 314 gene clusters (WGCNA)For the first stage of hierarchical testing 47 sparse "framework" markers that are within 10 cM of each other were used10 -12 secondary" markers were placed for each framework areaSo we tested around 440 000 epistatic hypotheses

55

IDC, May 201556

Epistatic regions

IDC, May 201557

Simulation study

IDC, May 201558

Simulation study (contd)

PreprocessingThe Variance Stabilization Normalization

Gene expression filtering: 7244 genes out of 22810

Markers preprocessing59IDC, May 2015

The Variance Stabilization Normalization (VSN) uses Generalized log transformation After filtering non-expressed genes remained 7244 genes out of 22810Also we filtered out bad markers or non-informative markers

59

Computational advantageUsing the two-stage algorithm on meta-genes, 341,107 hypotheses were testsNaive analysis: 121278 loci pairs for each of 7244 traits, namely 878,537,832 tests would have been performedReduction of tests number by 2575 times60IDC, May 2015

Using the two-stage algorithm on meta-genes, 341,107 hypotheses were testsIf instead, all possible combinations of markers and row traits were tested at one stage, about 900,000,000 tests would have been performed.

60

Peak Detection

61

Point-wise statistics

Wild-type Mutant

IDC, May 2015

Define a scan statistics

62IDC, May 2015

IDC, May 2015Peak Detection

63

Point-wise statisticsMoving-sum statistics

IDC, May 201564Summary data science

Data science is an emerging filed/profession that incorporates knowledge and expertise form several disciplines.

It combines both big data technologies and sophisticated methods for complicated data analysis.

Data analysis is aimed to answer various questions with case-specific challenges, and should therefore be carefully tailored to the type of problem and data.

ReferencesIDC, May 201565Reiner-Benaim, A., Shmueli, E. and Grabarnick, A. (submitted) A statistical learning approach for runtime prediction in Intels data center. Goldstein, P., Korol, A. B. and Reiner-Benaim, A. (2014) Two-stage genome-wide search for epistasis with implementation to Recombinant Inbred Lines (RIL) populations. PLOS ONE, 9(12).Reiner-Benaim, A., (2015) Scan statistic tail probability assessment based on process covariance and window size. Methodology and Computing in Applied Probability, In Press. Reiner-Benaim, A., Davis, R. W. and Juneau, K. (2014) Scan statistics analysis for detection of introns in time-course tiling array data. Statistical Applications in Genetics and Molecular Biology, 13(2), 173-90.

Thank you