Nimbus

29
An opensource library to implement random forests in genomic contexts Oscar González-Recio Juanjo Bazán Selma Forni
  • date post

    21-Oct-2014
  • Category

    Technology

  • view

    618
  • download

    1

description

Nimbus is a ruby gem implementing random forest algorithm for genomic selection contexts. http://www.nimbusgem.org Presentation from the EAAP 2011 conference in Stavanger, Norway.

Transcript of Nimbus

Page 1: Nimbus

An opensource library to implement random forests in genomic contexts

Oscar González-Recio Juanjo Bazán Selma Forni

Page 2: Nimbus

The Problem

Page 3: Nimbus

The ProblemMassive amount of information from high troughput genotyping platforms.

Page 4: Nimbus

The ProblemMassive amount of information from high troughput genotyping platforms.

Need to extract knowledge from large, noisy, redundant, missing and fuzzy data.

Page 5: Nimbus

The ProblemMassive amount of information from high troughput genotyping platforms.

Massive amount of information consumes the attention of its recipients. We need to allocate that attention efciently.

Need to extract knowledge from large, noisy, redundant, missing and fuzzy data.

Page 6: Nimbus

Why Random Forest?

Page 7: Nimbus

Why Random Forest?Using Machine Learning techniques we can extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design.

Page 8: Nimbus

Why Random Forest?Using Machine Learning techniques we can extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design.

Random Forest have desirable statistical properties.

Page 9: Nimbus

Why Random Forest?Using Machine Learning techniques we can extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design.

Random Forest have desirable statistical properties.

Random Forest scales well computationally.

Page 10: Nimbus

Using Machine Learning techniques we can extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design.

Random Forest have desirable statistical properties.

Random Forest scales well computationally.

Random Forest performs extremely well in a variety of possible complex domains (Breiman, 2001; Gonzalez-Recio & Forni, 2011).

Why Random Forest?

Page 11: Nimbus

The Algorithm

Page 12: Nimbus

Ensemble methods:

- Combination of diferent methods (usually simple models).- They have very good predictive ability because use additivity of models performances.

Based on Classifcation And Regression Trees (CART).

Use Randomization and Bagging.

Performs Feature Subset Selection.

Convenient for classifcation problems.

Fast computation.

Simple interpretation of results for human minds.

Previous work in genome-wide prediction (Gonzalez-Recio and Forni, 2011)

The Algorithm

Page 13: Nimbus

The AlgorithmPerform bootstrap on data: Ψ* = (y, X)

Build a CART ( fi (y, X) = ht (x) ) using only mtry proportion of SNPs in each node.

Repeat M times to reduce residuals by a factor of M.

Average estimates c0 = μ ; ci = 1/M

Page 14: Nimbus

Let Ψ = (y, X) be a set of data, with

y = vector of phenotypes (response variables)

X = (x1, x2) = matrix of features

y1 x11 x12… … ...yi xi1 xi2… … ...yn xn1 xn2

The AlgorithmClassifcation and regression trees:

Page 15: Nimbus

The Algorithm

Page 16: Nimbus

Nimbus library

Page 17: Nimbus

Nimbus library

Written in Ruby

www.ruby-lang.org

Open source programming language

Syntax focused on simplicity

Natural to read and easy to write

Page 18: Nimbus

Nimbus library

How to install:

> gem install nimbus

Prerequisites:

Ruby and Rubygems (default library manager) installed in the system

Page 19: Nimbus

Nimbus library

How to run:

> nimbus

Confguration:

Via confg.yml fle

Page 20: Nimbus

Nimbus libraryconfg.yml fle:

Page 21: Nimbus

Nimbus libraryInput fles:

training

testing

Page 22: Nimbus

Nimbus libraryFeatures, use cases:

Training of a prediction forestTraining of a prediction forest

Genomic Prediction of a testing sample

Using a training set of individuals

Nimbus creates a reutilizable forest

Nimbus calculates SNP importancesGeneralization error are computed for every tree in the forest

Using a custom/reused forest, specifi ed via confi g.yml

Using a new trained forest

Page 23: Nimbus

OutputsRandom Forest fle

In standard YAML format

Page 24: Nimbus

OutputsPredictions for the training sample

Page 25: Nimbus

OutputsPredictions for the testing sample

Page 26: Nimbus

OutputsSNP importances

Page 27: Nimbus

More info:Nimbus website:

Source code:

Report bugs/request features:

www.nimbusgem.org

www.github.com/xuanxu/nimbus/issues

www.github.com/xuanxu/nimbus

Page 28: Nimbus

Thank you!

Page 29: Nimbus

Questions?