Apache Spark part of Eindhoven Java Meetup

ApacheSparkAN ENGINEFORLARGE-SCALEDATAPROCESSING

Introducingmyself…

• MylèneReiners

• Architect@Atos

• Focusinnovation

Sketchingthecontext

• BigData• Newinsights• Analytics• Datadiscovery

Sketchingthecontext

• Hadoop• Storingandmanagingdata

ApacheSpark

• Speed

• Generalpurpose

ShortdemoinScala(shell)

• Simpledataanalysis• Read“README.md”• Countthenumberoflines

RoleofSparkContext (sc)

RDD

• ResilientDistributedDataset• Creation• Transformations• Actions

RDD

• Lazy

• Recomputed

Javaexample(accumulator)JavaRDD<String> rdd = sc.textFile(args[1]);

final Accumulator<Integer> blankLines = sc.accumulator(0);

JavaRDD<String> callSigns = rdd.flatMap(

new FlatMapFunction<String, String>() {

public Iterable<String> call(String line) {

if (line.equals("")) {

blankLines.add(1);

}

return Arrays.asList(line.split(" "));

}});

callSigns.saveAsTextFile("output.txt")

ApacheSparkstack

SparkSQL

• Interfaceforworkingwith(semi)structureddata

Hiveexample// Import Spark SQL

import org.apache.spark.sql.hive.HiveContext;

// Or if you can't have the hive dependencies

import org.apache.spark.sql.SQLContext;

// Import the JavaSchemaRDD

import org.apache.spark.sql.SchemaRDD;

import org.apache.spark.sql.Row;

(...)

JavaSparkContext ctx = new JavaSparkContext(...);

SQLContext hiveCtx = new HiveContext(ctx);

Hiveexample(cont’d)SchemaRDD input = hiveCtx.jsonFile(inputFile);

// Register the input schema RDD

input.registerTempTable("tweets");

// Select tweets based on the retweetCount

SchemaRDD topTweets = hiveCtx.sql("SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10");

SparkStreaming

• Actingondataassoonasitarrives

• Dstreams

Example// Create a StreamingContext with a 1-second batch size from a SparkConf

JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

// Create a DStream from all the input on port 7777

JavaDStream<String> lines = jssc.socketTextStream("localhost", 7777);

// Filter our DStream for lines with "error"

JavaDStream<String> errorLines = lines.filter(new Function<String, Boolean>() {

public Boolean call(String line) {

return line.contains("error");

}});

// Print out the lines with errors

errorLines.print();

Example// Start our streaming context and wait for it // to "finish"

jssc.start();

// Wait for the job to finish

jssc.awaitTermination();

GraphX

• Graphdatabase

Example// Load the edges as a graph

val graph = GraphLoader.edgeListFile(sc, "followers.txt")

// Run PageRank

val ranks = graph.pageRank(0.0001).vertices

// Join the ranks with the usernames

val users = sc.textFile("users.txt").map { line =>

val fields = line.split(",")

(fields(0).toLong, fields(1))

}

Exampleval ranksByUsername =

users.join(ranks)

.map {case (id, (username, rank)) => (username, rank)

}

// Print the result

println(ranksByUsername.collect().mkString("\n"))

MLib

• Machinelearning

Thankyou

Apache Spark part of Eindhoven Java Meetup

Data & Analytics

Transcript of Apache Spark part of Eindhoven Java Meetup