A Step to programming with Apache Spark

A Step to programming with

Rahul Kumar Trainee - Software ConsultantKnoldus Software LLP

Building Spark :

1. Pre Build Sparkhttp://mirror.fibergrid.in/apache/spark/spark-1.6.2/spark-1.6.2-bin-hadoop2.6.tgz2. Source Codehttp://mirror.fibergrid.in/apache/spark/spark-1.6.2/spark-1.6.2.tgz Goto the SPARK_HOME directory. Execute : mvn -Dhadoop.version=1.2.1 -Phadoop-1 -DskipTests clean packageTo start spark goto the SPARK_HOME/bin Execute ./spark-shell

The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark is not a modified version of Hadoop because it has its own cluster management.

Spark uses Hadoop in two ways one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only.

Spark Features :

Spark applications run as independentsets of processes on a cluster,coordinated by the SparkContext object in your main program (called the driver program).

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark.

It is an immutable distributed collection of objects.

RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

There are two ways to create RDDs: parallelizing an existing collection in your driver program

e.g. val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data)

val distFile = sc.textFile("data.txt")distFile: RDD[String] = MappedRDD@1d4cee08

RDD :

RDD(SPARK)

HDFS(HADOOP)

RDDs support two types of operations:

Transformations, which create a new dataset from an existing one, and

Actions, which return a value to the driver program after running a computation on the dataset.

For example,

map is a transformation that passes each dataset element through a function and returns a new RDD representing the results.

reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program

All transformations in Spark are lazy, in that they do not compute their results right away

RDD :

A DataFrame is equivalent to a relational table in Spark SQL.

DataFrame :

Steps to create DataFrame :

Create SparkContext object :

val conf = new SparkConf().setAppName("Demo").setMaster("local[2]")

val sc = new SparkContext(conf)

Create SqlContext object :

val sqlContext = new SQLContext(sc)

Read Data From Files :

val df = sqlContext.read.json("src/main/scala/emp.json")

A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.

DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.

An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it are not as constrained.

However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method

DataFrame and RDD :

DataFrame Transformations :

Def orderBy(sortExprs: Column*): DataFrame

Def select(cols: Column*): DataFrame

Def show(): Unit

Def filter(conditionExpr: String): DataFrame

Def groupBy(cols: Column*): GroupedData

Def collect(): Array[Row]

Def collectAsList(): List[Row]

Def count(): Long

Def head(): Row

Def head(n: Int): Array[Row]

DataFrame Actions :

Hive is a data warehouse infrastructure tool to process structured data in Hadoop.

It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

It stores schema in a database and processed data into HDFS.

It provides SQL type language for querying called HiveQL or HQL.

It is designed for OLAP.

Hive :

Hive comes bundled with the Spark library as HiveContext, which inherits from SQLContext.

Using HiveContext, you can create and find tables in the HiveMetaStore and write queries on it using HiveQL.

Users who do not have an existing Hive deployment can still create a HiveContext.

When not configured by the hive-site.xml, the context automatically creates a metastore called metastore_db and a folder called warehouse in the current directory.

Spark-Hive :

Spark SQL supports queries written using HiveQL.

Its a SQL-like language that produces queries that are converted to Spark jobs.

HiveQL is more mature and supports more complex queries than Spark SQL.

Spark-Hive :(continued)

first create a SqlContext instance,

val sqlContext = new SqlContext(sc)

2) submit the queries by calling the sql method on the HiveContext instance.

val res=sqlContext.sql("select * from employee")

To construct a HiveQL query,

first create a new HiveContext instance,

val conf = new SparkConf().setAppName("Demo").setMaster("local[2]") val sc = new SparkContext(conf) val hiveContext = new HiveContext(sc)

submit the queries by calling the sql method on the HiveContext instance.

val res=hiveContext.sql("select * from employee")

References

http://spark.apache.org/docs/latest/sql-programming-guide.html

http://www.tutorialspoint.com/spark_sql/spark_introduction.htm

https://cwiki.apache.org

A Step to programming with Apache Spark

Software

Transcript of A Step to programming with Apache Spark