Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Cartia - Codemotion Milan...

Post on 16-Apr-2017

79 views 0 download

Transcript of Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Cartia - Codemotion Milan...

Sviluppare applicazioni nell'era dei "Big Data" con Scala e SparkMario Cartia

MILAN 25-26 NOVEMBER 2016

$ whoamiMario CartiaChief System Egineer

Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark

Big DataRisk

Big DataOpportunity

Jonas Bonér

The Reactive Manifesto (2013)

The Reactive Manifesto Responsiveo The system responds in a

timely manner if at all possible

Resiliento The system stays responsive

in the face of failure

The Reactive Manifesto Event-Driven

o Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling, isolation and location transparency

Elastico The system stays responsive under

varying workload reacting to changes in the input rate by increasing or decreasing the allocated resources

BytecodeInteroperability

History The design of Scala started in

2001 at EPFL, Switzerland by Martin Odersky

First internal use in 2003 to teach “Functional and Logic Programming Course”

Public announcement of Scala 1.0 in 2004

History

The latest version of Scala is 2.12.0 released on 3 november 2016

Scala 2.0 was released in march 2006 On May 2011 Odersky and Bonér launched Typesafe Inc. to provide commercial support and education for Scala (Lightbend from feb 2016)

Features Object Orientedo You can construct elegant class

hierarchies for maximum code reuse and extensibility

Functionalo You can implement object

behavior using higher-order functions

Features Object Orientedo In contrast to Java, all values in

Scala are objects (including primitive types and functions)

o Multiple inheritance using traits (mixin-based composition )

o Statically typedo …

Features Functionalo Every function is a valueo Lambda expressionso Immutable objectso Higher-order functionso Case classes with support for

pattern matching to model algebraic types

o …

Features Othero Type inferenceo Infix notationo Parallel and concurrent

programmingo Actor model (Akka)o …

Akka Is a free and open-source toolkit

and runtime simplifying the construction of concurrent and distributed applications on the JVM

Supports multiple programming models for concurrency, but it emphasizes actor-based concurrency, with inspiration drawn from Erlang

Akka Language bindings exist for both

Java and Scala Akka is written in Scala and, as of

Scala 2.10, Akka's actor implementation is included as part of the Scala standard library

Concurrency is message-based and asynchronous

O’REILLY 2016 European Software Development Salary Survey

Top Adopters

Useful Tools scala scalac scaladoc scalap

similar to Javacounterpart

Useful Tools scalao With no arguments specified,

a Scala shell (REPL) starts and reads commands interactively

$ scalaWelcome to Scala version 2.12.0 Type in expressions to have them evaluated.Type :help for more information.

scala> val i = 2i: Int = 2

scala>

Useful Tools Scala Build Tool (sbt) is an open

source build tool for Scala projects, similar to Maven or Ant with the following characteristics:o build descriptions written in Scala

using a DSLo dependency management using

Ivy (supports Maven-format repositories)

o support for mixed Java/Scala projects

o …

Hello, World!

Hello, World! (REPL)

Learning Resources

Learning Resources

Learning Resources

Learning Resources

History Originally developed in 2012 at

the University of California, Berkeley's AMPLab

In 2013 creators founded a company named Databricks that provide services and support for Spark

First stable release (1.0) on May 2014

Features Provides an interface for

programming entire clusters with implicit data parallelism and fault-tolerance

Provides programmers with an API centered on a data structure called the resilient distributed dataset (RDD)

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk

Modules Spark SQLo Lets you query structured data

inside Spark programs, using either SQL or a easy to use DataFrame API

o Spark SQL reuses the Hive frontend and metastore, giving you full compatibility with existing Hive data, queries, and UDFs

Modules Spark MLlibo Contains many algorithms and

utilities, including:• Classification• Regression• Clustering• Recommendation• Distributed linear algebra• Statistics• …

Modules Spark Streamingo Brings Apache Spark's language-

integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs

o Recovers both lost work and operator state (e.g. sliding windows) out of the box, without any extra code on your part

Modules Spark Streamingo Lets you reuse the same code for

batch processing, join streams against historical data, or run ad-hoc queries on stream state

o Can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ or custom data sources

Modules Spark GraphXo Collection of API for graphs and

graph-parallel computationo Provides a variety of graph

algorithms like:• PageRank• Connected components• Label propagation• SVD++• Strongly connected components• Triangle count

How it works? Spark features an advanced

Directed Acyclic Graph (DAG) engine supporting cyclic data flow

Each Spark job creates a DAG of task stages to be performed on the cluster

How it works?

How it works?

How it works?val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")).map( word => (word, 1)) .reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Driver Program

RDD

SparkContext

Transformations

Action

How it works?Spark UI

Running modes Standaloneo Spark provides a simple

standalone deploy mode mainly for testing purpose

YARNo Send jobs to Hadoop cluster

Mesoso Send jobs to Apache Mesos

distributed kernel

Learning Resources

Learning Resources

Learning Resources

I corsi di Codemotion Training

Percorsi didattici dal taglio pratico – anche online

> WEB APP SECURITY

> WEB DEVELOPMENT

> IOT

> UX & UI

> BIG DATA

> MOBILE DEVELOPMENT

> LEGAL SOFTWARE DISCIPLINE

> FRONTEND DEVELOPMENT

Bootcamp “Sviluppo Applicazioni Big Data con

Scala e Spark”Dove: Milano

Quando: 2 dicembre 2016Info: desk Codemotion

Prossimo appuntamento!

Email: training@codemotion.it

Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark

Question Time!

Thanks!

MILAN 25-26 NOVEMBER 2016

Follow me!https://twitter.com/mariocartiahttps://it.linkedin.com/in/mariocartia

Email:mario@big-data.ninja

All pictures belongto their respective authors