R Integration Hadoop on Ubuntu

Post on 17-Oct-2015

103 views 0 download


R integration

Transcript of R Integration Hadoop on Ubuntu

  • R-Hadoop Integration on Ubuntu:- This manual is direct for R and Hadoop integration on Ubuntu 12.04


    We assume, that the user would have following up and running before starting R and Hadoop integration

    Ubuntu 12.04

    Hadoop 1.x +

    If you do not have the Hadoop preinstalled on your Ubuntu machine, please follow the Single-node-cluster-(pseudo-distributed-mode-cluster.pdf guide present in your LMS under Module-7, to set-up the environment for R integration with Hadoop.

    Once Hadoop installation is done, make sure that all the processes are running:

    Note: R integration with Hadoop has issues when it comes to java-openjdk. To resolve it, we need to have oracle-java6 installed on the machine.

    To install oracle-java6 please follow the following steps:

    Give the command:

    sudo apt-get update

  • Click Yes to accept the agreement.

  • Edit the .bashrc file:

    # Set Hadoop-related environment variables

    export CONF=/home/user/hadoop-1.2.0/conf

    # Set JAVA_HOME

    export JAVA_HOME=/usr/lib/jvm/java-6-oracle

    # Add Hadoop bin/ directory to PATH

    export PATH=$PATH:$/home/user/hadoop-1.2.0/bin

    Note: Please add the exact location of the specified files from your system.

    Make sure JAVA_HOME is set to the correct java location.

  • Installing RHadoop RHadoop has mainly following three R packages:




    rmr2 package provides Hadoop MapReduce functionality in R, rhdfs provides HDFS file operations in R and rhbase provides HBase connectivity from R.

    Step #1: Update the sources.list.

    sudo gedit /etc/apt/sources.list

    Adding the line:

    deb http://cran.cnr.berkeley.edu/bin/linux/ubuntu/ precise/

  • Step #2: sudo apt-get update

    Step #3: Install r-base package.

    sudo apt-get install r-base

  • Checking the version of R:

  • Download the following packages from: http://cran.cnr.berkeley.edu/












    The installation requires the corresponding tar.gz archives to be downloaded.

    If the downloaded files are in Downloads, give the following command:

    To untar the zipped file:

  • Then we can run R CMD INSTALL command with sudo privileges.

    Rcpp Package

    RJSONIO Package

    digest Package

  • functional package

    stringr package

    plyr package

  • bitops package

    reshape2 package

    rmr2 package

  • Before installing rJava package we need to follow the following


    sudo JAVA_HOME=/usr/lib/jvm/java-6-oracle/jre R CMD javareconf

  • rJava package

    sudo R CMD INSTALL rJava rJava_0.9-3.tar.gz

    sudo HADOOP_CMD=/home/istvan/hadoop/bin/hadoop R CMD INSTALL rhdfs


  • Make sure that the following packages are installed:

    Getting started with RHadoop

    In principle, RHadoop MapReduce is a similar operation to R lapply function that applies a

    function over a list or vector.

    Without mapreduce function we could write a simple R code to double all the numbers from 1 to 100:

    > ints = 1:100 > doubleInts = sapply(ints, function(x) 2*x) > head(doubleInts) [1] 2 4 6 8 10 12

    With RHadoop rmr package we could use mapreduce function to implement the same calculations see doubleInts.R script:

  • Sys.setenv(HADOOP_HOME="/home/vikas/hadoop") Sys.setenv(HADOOP_CMD="/home/vikas/hadoop/bin/hadoop") library(rmr2) library(rhdfs) ints = to.dfs(1:100) calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v)) from.dfs(calc) $val