Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8...

20
Processing of big data with Apache Spark JavaSkop ‘18 Aleksandar Donevski

Transcript of Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8...

Page 1: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data

with Apache Spark

JavaSkop ‘18

Aleksandar Donevski

Page 2: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark2

AGENDA

• What is Apache Spark?

• Spark vs Hadoop MapReduce

• Application Requirements

• Example Architecture

• Application Challenges

Page 3: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark3

• Engine for processing of large-scale data

• Open source

• Interact with Java, Scala, Python, and R

• Run as Standalone or on YARN, Kubernetes, and Mesos

• Access HDFS, HBase, Cassandra, S3, and etc.

WHAT IS APACHE SPARK?

Page 4: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark4

SPARK ARCHITECTURE

Page 5: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark5

• Characteristics:

• Immutable, distributed, partitioned, and resilient

• API:

1. Transformations:

‒ map(), filter(), distinct(), union(), subtract(), and etc.

2. Actions:

‒ reduce(), collect(), count(), first(), take(), and etc.

RESILIENT DISTRIBUTED DATASET (RDD)

Page 6: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark6

• Transformations are executed on workers

• Actions may transfer data from the workers to the driver

• collect() sends all the partitions to the single driver

• Persistence:

• persist() and cache()

RDD OPERATIONS

Page 7: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark7

AGENDA

• What is Apache Spark?

• Spark vs Hadoop MapReduce

• Application Requirements

• Example Architecture

• Application Challenges

Page 8: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark8

• Spark

• Real time, streaming

• Processes data in-memory

• Handle structures which could not be decomposed to key-value pairs

• MapReduce

• Batch mode, not real-time

• Persist on disk after map operation

• Key-value pairs

SPARK VS MAPREDUCE

Page 9: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark9

AGENDA

• What is Apache Spark?

• Spark vs Hadoop MapReduce

• Application Requirements

• Example Architecture

• Application Challenges

Page 10: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark10

• Verify that application is thread-safe

• Use synchronization blocks appropriately

• Avoid duplication of objects

• Try to use array of objects and primitive types

• Avoid unneeded data in the objects

• Always remember that application is executed in parallel!

APPLICATION REQUIREMENTS

Page 11: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark11

• Define the application pipeline with usage of SparkContext object

• “Encapsulate” the common data needed through all pipeline steps

• Prepare the common data and broadcast it through the workers as needed

APPLICATION PIPELINE

Page 12: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark12

AGENDA

• What is Apache Spark?

• Spark vs Hadoop MapReduce

• Application Requirements

• Example Architecture

• Application Challenges

Page 13: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark13

List of Files BalanceProcess Metrics

Load Content

Load Content

Process & Enhance

Persist Content

Process & Enhance

Persist Content

Driver

Worker 1

Worker N

Page 14: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark14

AGENDA

• What is Apache Spark?

• Spark vs Hadoop MapReduce

• Application Requirements

• Example Architecture

• Application Challenges

Page 15: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark15

• Example:

• Log4j 1.x – Spark

• Log4j 2.x – Application (Async Loggers)

• Application fails at the very beginning

• Resolution:

• Shading dependencies

• Provide the dependencies in “--jars” property and add “spark.{driver,executor}.userClassPathFirst=true” properties

DEPENDENCY ISSUES

Page 16: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark16

• Example:

• Default usage of 1Gb RAM per executor

• Executors fail with OOM error, thus application fails

• Resolution:

• Verify cluster available memory

• Monitor and measure memory usage

• Tune per application case

MEMORY ISSUES

Page 17: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark17

• Example:

• Application execution time is taking too long for simple set of data

• Last task executing time is taking too long

• Resolution:

• Verify the partitioning

• Adjust the processing time of each task

PERFORMANCE ISSUES

Page 18: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark18

• Example:

• Default Java serialization is being used

• Serialization time is taking too long

• Resolution:

• Verify the objects data and data structures used

• Use Kryo serialization

APPLICATION ISSUES

Page 19: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark19

• Three to four month cycle releases

• Lots of hood changes

• Verification if application is affected

API ISSUES

Page 20: Processing of big data with Apache Spark - JUGjug.mk/presentations/javaskop18/spark.pdf · 8 Processing of big data with Apache Spark • Spark • Real time, streaming • Processes

Processing of big data with Apache Spark20