Apache Spark part of Eindhoven Java Meetup

22
Apache Spark AN ENGINE FOR LARGE-SCALE DATA PROCESSING

Transcript of Apache Spark part of Eindhoven Java Meetup

Page 1: Apache Spark part of Eindhoven Java Meetup

ApacheSparkAN ENGINEFORLARGE-SCALEDATAPROCESSING

Page 2: Apache Spark part of Eindhoven Java Meetup

Introducingmyself…

• MylèneReiners

• Architect@Atos

• Focusinnovation

Page 3: Apache Spark part of Eindhoven Java Meetup

Sketchingthecontext

• BigData• Newinsights• Analytics• Datadiscovery

Page 4: Apache Spark part of Eindhoven Java Meetup

Sketchingthecontext

• Hadoop• Storingandmanagingdata

Page 5: Apache Spark part of Eindhoven Java Meetup

ApacheSpark

• Speed

• Generalpurpose

Page 6: Apache Spark part of Eindhoven Java Meetup

ShortdemoinScala(shell)

• Simpledataanalysis• Read“README.md”• Countthenumberoflines

Page 7: Apache Spark part of Eindhoven Java Meetup

RoleofSparkContext (sc)

Page 8: Apache Spark part of Eindhoven Java Meetup

RDD

• ResilientDistributedDataset• Creation• Transformations• Actions

Page 9: Apache Spark part of Eindhoven Java Meetup

RDD

• Lazy

• Recomputed

Page 10: Apache Spark part of Eindhoven Java Meetup

Javaexample(accumulator)JavaRDD<String> rdd = sc.textFile(args[1]);

final Accumulator<Integer> blankLines = sc.accumulator(0);

JavaRDD<String> callSigns = rdd.flatMap(

new FlatMapFunction<String, String>() {

public Iterable<String> call(String line) {

if (line.equals("")) {

blankLines.add(1);

}

return Arrays.asList(line.split(" "));

}});

callSigns.saveAsTextFile("output.txt")

Page 11: Apache Spark part of Eindhoven Java Meetup

ApacheSparkstack

Page 12: Apache Spark part of Eindhoven Java Meetup

SparkSQL

• Interfaceforworkingwith(semi)structureddata

Page 13: Apache Spark part of Eindhoven Java Meetup

Hiveexample// Import Spark SQL

import org.apache.spark.sql.hive.HiveContext;

// Or if you can't have the hive dependencies

import org.apache.spark.sql.SQLContext;

// Import the JavaSchemaRDD

import org.apache.spark.sql.SchemaRDD;

import org.apache.spark.sql.Row;

(...)

JavaSparkContext ctx = new JavaSparkContext(...);

SQLContext hiveCtx = new HiveContext(ctx);

Page 14: Apache Spark part of Eindhoven Java Meetup

Hiveexample(cont’d)SchemaRDD input = hiveCtx.jsonFile(inputFile);

// Register the input schema RDD

input.registerTempTable("tweets");

// Select tweets based on the retweetCount

SchemaRDD topTweets = hiveCtx.sql("SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10");

Page 15: Apache Spark part of Eindhoven Java Meetup

SparkStreaming

• Actingondataassoonasitarrives

• Dstreams

Page 16: Apache Spark part of Eindhoven Java Meetup

Example// Create a StreamingContext with a 1-second batch size from a SparkConf

JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

// Create a DStream from all the input on port 7777

JavaDStream<String> lines = jssc.socketTextStream("localhost", 7777);

// Filter our DStream for lines with "error"

JavaDStream<String> errorLines = lines.filter(new Function<String, Boolean>() {

public Boolean call(String line) {

return line.contains("error");

}});

// Print out the lines with errors

errorLines.print();

Page 17: Apache Spark part of Eindhoven Java Meetup

Example// Start our streaming context and wait for it // to "finish"

jssc.start();

// Wait for the job to finish

jssc.awaitTermination();

Page 18: Apache Spark part of Eindhoven Java Meetup

GraphX

• Graphdatabase

Page 19: Apache Spark part of Eindhoven Java Meetup

Example// Load the edges as a graph

val graph = GraphLoader.edgeListFile(sc, "followers.txt")

// Run PageRank

val ranks = graph.pageRank(0.0001).vertices

// Join the ranks with the usernames

val users = sc.textFile("users.txt").map { line =>

val fields = line.split(",")

(fields(0).toLong, fields(1))

}

Page 20: Apache Spark part of Eindhoven Java Meetup

Exampleval ranksByUsername =

users.join(ranks)

.map {case (id, (username, rank)) => (username, rank)

}

// Print the result

println(ranksByUsername.collect().mkString("\n"))

Page 21: Apache Spark part of Eindhoven Java Meetup

MLib

• Machinelearning

Page 22: Apache Spark part of Eindhoven Java Meetup

Thankyou