ITU Assignment

14
CS 850 4 Big Data with SAS Assignment # 2 1. Apache Spark: What is Scala Programming? Why is it so important in Big Data? What is RDD: Re-silient Distributed Datasets in cotext of Apache Spark? Write a standalone application for linear regression. Repeat the steps in question 1 on the given training data set (lpsa.data). Please explain in detail how you solved this problem using following steps with screen shots? Also explain challenges and pain points you experienced in running this exercise? Give business case where you would use this type of analysis method in context of Apache Spark? Please refer to following links for installation instructions, installation files and data file needed for running this example: Instructions and Intro: https://www.dropbox.com/s/guu6pb80l13er52/ ApacheSparkITU.pdf?dl=0 Installation and Data Files: https://www.dropbox.com/s/4aq0cpg9zgibk2v/SparkLabs.zip?dl=0 Template to run SCALA program in Spark Shell: $ source ./setup.sh $ spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m -i <Scalafile.scala> Scala program to run Regression in Sparkshell: import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.LinearRegressionModel import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("data/mllib/ridge-data/lpsa.data") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache()

description

This file contains Big Data Course CS954 4 Big Data With SAS assignment questions. This is for course work. It includes R, SAS, Hadoop, Spark, Storm, SolR and other Big Data techniques.

Transcript of ITU Assignment

CS 850 4 Big Data with SAS Assignment # 2

1. Apache Spark: What is Scala Programming? Why is it so important in Big Data? What is RDD: Re-silient Distributed Datasets in cotext of Apache Spark? Write a standalone application for linear regression. Repeat the steps in question 1 on the given training data set (lpsa.data). Please explain in detail how you solved this problem using following steps with screen shots? Also explain challenges and pain points you experienced in running this exercise? Give business case where you would use this type of analysis method in context of Apache Spark?Please refer to following links for installation instructions, installation files and data file needed for running this example: Instructions and Intro:https://www.dropbox.com/s/guu6pb80l13er52/ApacheSparkITU.pdf?dl=0Installation and Data Files:https://www.dropbox.com/s/4aq0cpg9zgibk2v/SparkLabs.zip?dl=0

Template to run SCALA program in Spark Shell:

$ source ./setup.sh$ spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m -i

Scala program to run Regression in Sparkshell:

import org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.regression.LinearRegressionModelimport org.apache.spark.mllib.regression.LinearRegressionWithSGDimport org.apache.spark.mllib.linalg.Vectors

// Load and parse the dataval data = sc.textFile("data/mllib/ridge-data/lpsa.data")val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))}.cache()

// Building the modelval numIterations = 100val model = LinearRegressionWithSGD.train(parsedData, numIterations)

// Evaluate model on training examples and compute training errorval valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()println("training Mean Squared Error = " + MSE)

// Save and load modelmodel.save(sc, "myModelPath")val sameModel = LinearRegressionModel.load(sc, "myModelPath") 2. Apache Kafka: Run following exercise in Apache Kafka. Please explain what is Kafka? Use? Pros Cons ? Usecase: Step by Step explain how you ran the exercise and pain points, challenges you faced in running following exercise. Also explain the output and give business context on how you will use it in real life business and what is Truck Event exercise doing and why we use Kafka for this purpose?

Refer to following linked file for installation and other detailshttps://www.dropbox.com/s/7qd6pr9xixaoaug/Installation_Exercises_NetappD4.pptx?dl=0

Make sure you have installed Kafka following steps in above link. Then you can run code step by step to get the results

Make sure Kafka is running$ jpsStart Truck Event topic$ sh kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic truckeventList all Kafka topics active$ sh kafka-topics.sh --list --zookeeper localhost:2181open another terminal $ sudo mkdir /opt/TruckEvents $ cd /opt/TruckEvents $ wget http://hortonassets.s3.amazonaws.com/mda/Tutorials-master.zip $ unzip Tutorials-master.zip$ cd Tutorials-master$ wget http://apache.arvixe.com/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz $ tar xvf apache-maven-3.2.5-bin.tar.gz $ sudo mv apache-maven-3.2.2 /usr/local/ $ export PATH=/usr/local/apache-maven-3.2.2/bin:$PATH $ mvn version$ mvn clean package$ cd target$ java -cp Tutorial-1.0-SNAPSHOT.jar com.hortonworks.tutorials.tutorial1.TruckEventsProducer localhost:9092 localhost:2181 &Listen from consumer $ bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic truckevent --from-beginning

2. SAS: What is PROC and DATA STEP in SAS? Give example of some of the most used SAS PROC and explain its use in Business context? Run following SAS code for Multiple Linear Regression to get Multicollinearity and influence statistics (from SAS Manual) in SAS Virtual machine and explain results in detail in Statistical context.

options linesize=80; data fitness; input age weight oxy runtime rstpulse runpulse maxpulse; cards; 44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185 44 85.84 54.297 8.65 45 156 168 42 68.15 59.571 8.17 40 166 172 38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176 176 40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170 44 81.42 39.442 13.08 63 174 176 38 81.87 60.055 8.63 48 170 186 44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03 56 186 192 45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164 54 83.12 51.855 10.33 50 166 170 49 81.42 49.156 8.95 44 180 185 51 69.63 40.836 10.95 57 168 172 51 77.91 46.672 10.00 48 162 168 48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168 57 73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165 52 76.32 45.441 9.63 48 164 166 50 70.87 54.625 8.92 48 146 155 51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168 172 51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155 49 76.32 48.673 9.40 56 186 188 48 61.24 47.920 11.50 52 170 176 52 82.78 47.467 10.50 53 170 172 ;title 'SAS Fitness data';title2 'Example of multicollinearity and influence diagnostics';proc reg data=fitness simple corr;FULL: model oxy=runtime age weight runpulse maxpulse rstpulse / stb collin tol vif corrb influence; plot student.*predicted.;run;3. What is Cluster Analysis? Explain in detail different types of Clustering methods? Their business usecase and pros-cons. Run Cluster Analysis in SAS using following code and explain the results:/* file: mammalsteeth.sas

Example of cluster analysis taken from Example 4 of the SAS documentation to PROC CLUSTER */

options nocenter nodate pageno=1 linesize=132;title h = 1 j = l 'File: cluster.mammalsteeth.sas';title2 h = 1 j = l 'Cluster Analysis of Mammals'' teeth data';data teeth; input mammal $ 1-16 @21 (v1-v8) (1.); label v1='Top incisors' v2='Bottom incisors' v3='Top canines' v4='Bottom canines' v5='Top premolars' v6='Bottom premolars' v7='Top molars' v8='Bottom molars'; cards;BROWN BAT 23113333MOLE 32103333SILVER HAIR BAT 23112333PIGMY BAT 23112233HOUSE BAT 23111233RED BAT 13112233PIKA 21002233RABBIT 21003233BEAVER 11002133GROUNDHOG 11002133GRAY SQUIRREL 11001133HOUSE MOUSE 11000033PORCUPINE 11001133WOLF 33114423BEAR 33114423RACCOON 33114432MARTEN 33114412WEASEL 33113312WOLVERINE 33114412BADGER 33113312RIVER OTTER 33114312SEA OTTER 32113312JAGUAR 33113211COUGAR 33113211FUR SEAL 32114411SEA LION 32114411GREY SEAL 32113322ELEPHANT SEAL 21114411REINDEER 04103333ELK 04103333DEER 04003333MOOSE 04003333;

/* principal components analysis of teeth here we score the principal components and output then to data set teeth2 */proc princomp data=teeth out=teeth2; var v1-v8;run;

/* average linkage cluster analysis a dendrogram (tree diagram) is also output */proc cluster data=teeth2 method=average outtree=ttree ccc pseudo rsquare; var v1-v8; id mammal;run;

/* --- PROC TREE prints the tree diagram here we also output a data set, called ttree2 that contains four clusters --- */proc tree data=ttree out=ttree2 nclusters=4; id mammal;run;

/* --- the next set of statements sort the data sets by variable mammal and then merge the tree data set (with the cluster scores) with the teeth data set (with the prinicipal components) --- */proc sort data=teeth2; by mammal;run;proc sort data=ttree2; by mammal;run;data teeth3; merge teeth2 ttree2; by mammal;run;

/* --- stuff for plotting --- */symbol1 c=black f=, v='1';symbol2 c=black f=, v='2';symbol3 c=black f=, v='3';symbol4 c=black f=, v='4';

proc gplot; plot prin2*prin1=cluster;run;

proc sort; by cluster;run;

proc print; by cluster; var mammal prin1 prin2;run;

4. R programming Text Processing: Problem Statement: Budweiser wants to analyze the response posted by people on Twitter for its Super bowl commercial. It is humongous task for them to go through all the tweets. Complete the following objectives.

Objective 1: Group tweets in 10 categories (based on content) using K-Means clustering. Objective 2: Using K-Means clustering, find tweets which have reference to words "Clydesdale" and "Budweiser"

Data for this task is uploaded at following location:https://www.dropbox.com/s/6uevdeygb92vzvw/Tweets.csv?dl=0

5. Descriptive Questions on NoSQL and Big Data Sciencea. Please explain difference between NoSQL and RDBMSb. What is CAP Theorem? What is ACID and BASE in terms of Database?c. What is Cassandra? Explain Cassandra Data Model?d. What is Polyglot persistence in terms of NOSQL and Database?e. What is MongoDB? Where will I use MongoDB? Give example of MongoDBf. What is Graph database? Explain usecase.g. What is R programming? Give background and usecaseh. What is Machine learning? Give examples of Machine Learning algorithm? What is supervised and unsupervised algorithms? Give examples of each.

6. Group Assignment Q1: R-HadoopUse Big Data Science Virtual Machine at following linkhttp://goo.gl/gAqdf4

R packages needed to install and set up RHadoop on CentOShttps://goo.gl/PDN3Jx

Set up R-Hadoop using instructions in attached filehttps://www.dropbox.com/s/kh2aa8fejbdrdhx/Team%20Assignment%20R%20%E2%80%93Hadoop%20Installation%20Instructions.docx?dl=0

Once installed and tested as instructed in above document please run following to do linear regression using rhadoop and rmapreduce.

Run the following code and please explain what you understood from this exercise in detail. Explain the process, package meaning that you installed, why we use Hadoop in R and explain how you run the following code in R-hadoop, Challenges and pain points you experienced in executing this.

Regression in R without Hadoop map reduce

# Defining data variables X = matrix(rnorm(2000), ncol = 10)y = as.matrix(rnorm(200))

# Bundling data variables into dataframetrain_data