Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project...

download Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

of 59

  • date post

    11-Apr-2017
  • Category

    Software

  • view

    1.347
  • download

    5

Embed Size (px)

Transcript of Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project...

  • Click to edit Master text styles

    Click to edit Master text styles

    After Dark Real-time Advanced Analytics, Machine Learning, Graph Analytics, Text NLP, and Recommendations

    Barcelona Spark Meetup

    Oct 20th, 2015

    Chris FreglyPrincipal Data Solutions Engineer

    IBM Spark Technology Center** Were Hiring!! Nice People Only, Please. **

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Who Am I?

    2

    Streaming Data EngineerNetflix Open Source Committer

    Data Solutions Engineer

    Apache Contributor

    Principal Data Solutions EngineerIBM Technology Center

    Meetup OrganizerAdvanced Apache Meetup

    Book AuthorAdvanced (2016)

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Advanced Apache Spark MeetupTotal Spark Experts: ~1350+ in 3 mos! #4 most active Spark Meetup in the world! Main Goals Dig deep into the Spark & extended-Spark codebase Study integrations such as Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc

    Surface and share the patterns and idioms of these well-designed, distributed, big data components

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark4

    Core

    Spark Streaming

    real-time Spark SQL structured data

    MLlib machine learning

    GraphX graph

    analytics

    BlinkDB approx queries

    What is Spark?

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Spark Deployments In Production

    5

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Tools of the Talk

    6

    Redis Docker Cassandra MLlib, GraphX Parquet, JSON Apache Zeppelin Spark Streaming, Kafka Spark SQL, DataFrames Spark JDBC/ODBC Hive ThriftServer ElasticSearch, Logstash, Kibana (ELK)

    and

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    SMACK Stack!

    7

    S park (Data Processing) M esos (Cluster Manager) A kka (Actors) C assandra (NoSQL) K afka (Streaming)

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Themes of this Talk

    Parallelism Performance Streaming Approximations Similarity Measures Recommendations

    8

    and

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Goals of Spark After Dark Generate high-quality recommendations

    Demonstrate Spark high-level libraries

    Spark Streaming -> Kafka, Approximates

    Spark SQL -> DataFrames, Cassandra

    GraphX -> PageRank, Shortest Path

    MLlib -> Matrix Factor, Word2Vec

    9

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Popular Dating Sites

    10

  • Click to edit Master text styles

    Click to edit Master text stylesParallelism

    11

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    My First Experience With ParallelismBrady Bunch circa 1980 Season 5, Episode 18: Two Petes in a Pod

    12

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Parallel Algorithm: O(log n)

    13

    O(log n)

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Non-Parallel Algorithm: O(n)

    14

    O(n)

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Spark is Parallel!

    15

  • Click to edit Master text styles

    Click to edit Master text stylesPerformance

    16

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Spark Beats Hadoop @ 100 TB GraySort

    17

    On-disk only 28,000 partitions No in-memory caching

    (2014) (2013) (2014)

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Improved Shuffle and Network Layer Sort-based shuffle

    Minimize OS resources

    Switched to async Netty

    Keep CPUs hot

    Reuse byte buffers to minimize GC

    Use epoll for I/O to stay in kernel space 18

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Project Tungsten: CPU and Memory More JVM bytecode generation, JIT optimize

    CPU-cache-aware data structs and algos -->

    Custom memory management Serializers Performance New HashMap

    19

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    DataFrames and Catalyst Optimizer

    20

    20

    https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

    Please Use DataFrames!

    --> -->

    JVM bytecode generation

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Columnar Storage Format

    21

    Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Parquet File FormatBased on Google Dremel

    Implemented by Twitter and Cloudera

    Columnar storage format

    Optimized for fast columnar aggregations

    Tight compression

    Supports pushdowns

    Nested, self-describing, evolving schema22

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Types of Compression Run Length Encoding: Repeated data Dictionary Encoding: Fixed set of values

    Delta, Prefix Encoding: Sorted data

    23

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Types of Query Optimizations Column, Partition Pruning Row, Predicate Pushdown

    SELECT b FROM table WHERE a in [a2,a3]

    24

  • Click to edit Master text styles

    Click to edit Master text stylesStreaming

    25

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Direct Kafka Streaming KafkaRDD No single Receiver, no Write Ahead Log (WAL) Workers pull from Kafka in parallel Each KafkaRDD partition stores relevant offsets Upon Worker Node failure, rebuild from offsets Optimizes happy path by avoiding the WAL

    26

    At least once delivery guarantee

  • Click to edit Master text styles

    Click to edit Master text stylesApproximations

    27

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Count Min Sketch Approximate counters

    Better than HashMap

    Low, fixed memory Known error bounds Large num of counters From Twitters Algebird Streaming example in Spark codebase

    28

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    HyperLogLog Approximate cardinality

    Approx count distinct!

    From Twitters Algebird!

    Low memory

    1.5KB @ 2% error, 10^9 elements !

    Streaming example in Spark codebase

    RDD: countApproxDistinctByKey()29

  • Click to edit Master text styles

    Click to edit Master text styles

    spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Monte Carlo SimulationsFrom Manhattan Project (A-bomb) Simulate movement of neutrons

    Law of Large Numbers (LLN) Average of results of many trials Converge on expected value

    SparkPi example in Spark codebase

    Pi ~ (# red dots /

    # total dots * 4)

    30

  • Click to edit Master text styles

    Click to edit Master text stylesRecomme