Spark Streaming の基本とスケールする時系列データ処理 - Spark Meetup December...

40
® © 2015 MapR Technologies 1 ® © 2015 MapR Technologies Spark Streaming の基本と スケールする時系データ処 草薙 昭彦 – MapR Technologies 2015 12 9

Transcript of Spark Streaming の基本とスケールする時系列データ処理 - Spark Meetup December...

  • 2015 MapR Technologies 1

    2015 MapR Technologies

    Spark Streaming MapR Technologies2015 12 9

  • 2015 MapR Technologies 2

    Apache Spark Streaming ? Apache Spark Streaming

    (@nagix)

  • 2015 MapR Technologies 3

    Spark Streaming ?

    :

    Web

    put put

    put put

    Time stamped data

    data

    Data for real-time monitoring

  • 2015 MapR Technologies 4

    ?

    2015 MapR Technologies 4

    What is time series data? Stuff with timestamps

    Sensor data log files Phones..

    Credit Card Transactions Web user behaviour

    Social media Log files

    Geodata

    Sensors 2015 MapR Technologies 4

    What is time series data? Stuff with timestamps

    Sensor data log files Phones..

    Credit Card Transactions Web user behaviour

    Social media Log files

    Geodata

    Sensors 2015 MapR Technologies 4

    What is time series data? Stuff with timestamps

    Sensor data log files Phones..

    Credit Card Transactions Web user behaviour

    Social media Log files

    Geodata

    Sensors

    2015 MapR Technologies 4

    What is time series data? Stuff with timestamps

    Sensor data log files Phones..

    Credit Card Transactions Web user behaviour

    Social media Log files

    Geodata

    Sensors 2015 MapR Technologies 4

    What is time series data? Stuff with timestamps

    Sensor data log files Phones..

    Credit Card Transactions Web user behaviour

    Social media Log files

    Geodata

    Sensors 2015 MapR Technologies 4

    What is time series data? Stuff with timestamps

    Sensor data log files Phones..

    Credit Card Transactions Web user behaviour

    Social media Log files

    Geodata

    Sensors

    Web

  • 2015 MapR Technologies 5

    Apache Spark Streaming ?

    ? ?

    2015 MapR Technologies 5

    Why Spark Streaming ?

    What If? You want to analyze data as it arrives?

    For Example Time Series Data: Sensors, Clicks, Logs, Stats

    :

  • 2015 MapR Technologies 6

    2015 MapR Technologies 6

    Batch Processing

    It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees

    It was hot at 6:05 yesterday!

    Batch processing may be too late for some events

    2015 MapR Technologies 6

    Batch Processing

    It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees

    It was hot at 6:05 yesterday!

    Batch processing may be too late for some events

    2015 MapR Technologies 6

    Batch Processing

    It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees

    It was hot at 6:05 yesterday!

    Batch processing may be too late for some events

    It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees

    6:05 !

  • 2015 MapR Technologies 7

    2015 MapR Technologies 6

    Batch Processing

    It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees

    It was hot at 6:05 yesterday!

    Batch processing may be too late for some events

    2015 MapR Technologies 7

    Event Processing

    It's 6:05 and 90 degrees

    Someone should open a window!

    Streaming

    Its becoming important to process events as they arrive

    It's 6:05 and 90 degrees

    !

  • 2015 MapR Technologies 8

    Spark Streaming

    Spark API

  • 2015 MapR Technologies 9

    2015 MapR Technologies 9

    Stream Processing Architecture

    Streaming

    Sources/Apps

    MapR-FS

    Data Ingest

    Topics

    MapR-DB

    Data Storage

    MapR-FS

    Apps$

    Stream Processing

    HDFS

    HDFS

    HBase

  • 2015 MapR Technologies 10

    : : HDFS : TCP

    Twitter, Kafka, Flume, ZeroMQ, Akka Actor

    Transformation

  • 2015 MapR Technologies 11

    Spark Streaming

    X (Batch) DStream = RDD

    Spark Streaming

    DStream RDD Batch

    Batch

    time 0 1

    time 1 2

    RDD @ time 2

    time 2 3

    RDD @ time 3 RDD @ time 1

  • 2015 MapR Technologies 12

    Resilient Distributed Datasets (RDD)

    Spark RDD Read Only

  • 2015 MapR Technologies 13

    Resilient Distributed Datasets (RDD)

    Spark RDD Read Only

  • 2015 MapR Technologies 14

    RDD

    RDD

    textFile = sc.textFile(SomeFile.txt) !

  • 2015 MapR Technologies 15

    RDD

    RDDRDDRDDRDD

    Transformations

    linesWithErrorRDD = linesRDD.filter(lambda line: ERROR in line) !

    linesRDD = sc.textFile(LogFile.txt) !

  • 2015 MapR Technologies 16

    RDD

    RDDRDDRDDRDD

    Transformations

    Action Value

    linesWithErrorRDD.count()!6 !!linesWithErrorRDD.first() !# Error line!

    textFile = sc.textFile(SomeFile.txt) !

    linesWithErrorRDD = linesRDD.filter(lambda line: ERROR in line) !

  • 2015 MapR Technologies 17

    Dstream

    transform

    Transform map

    reduceByValue count

    DStream RDD

    DStream RDD

    transform transform

    Transformation RDD

    time 0 1

    time 1 2

    RDD @ time 2

    time 2 3

    RDD @ time 3 RDD @ time 1

    RDD @ time 1 RDD @ time 2 RDD @ time 3

  • 2015 MapR Technologies 18

    Transformation: DStream

    RDD : map, filter, union, reduce, join, ... : UpdateStateByKey(function),

    countByValueAndWindow, ...

  • 2015 MapR Technologies 19

    Spark Streaming

    Batch

    Spark

    Batch

    Spark Streaming

    DStream RDD Batch

    time 0 1

    time 1 2

    RDD @ time 2

    time 2 3

    RDD @ time 3 RDD @ time 1

  • 2015 MapR Technologies 20

    Transformation :

    saveAsHadoopFiles HDFS saveAsHadoopDataset HBase saveAsTextFiles foreach RDD Batch

  • 2015 MapR Technologies 21

  • 2015 MapR Technologies 22

    :

    read

    Spark

    Spark

    Streaming

  • 2015 MapR Technologies 23

    CSV Sensor

    case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }

  • 2015 MapR Technologies 24

    data

    alerts stats

    data

    alerts

    stats

    hz psi psi hz_avg psi_min

    COHUTTA_3/10/14_1:01 10.37 84 0

    COHUTTA_3/10/14 10 0

  • 2015 MapR Technologies 25

    Spark Streaming

    Spark Streaming : 1. Spark StreamingContext 2. DStream

    1. Transformation

    DStream 2.

    3.

    streamingContext.start() 4.

    streamingContext.awaitTermination()

  • 2015 MapR Technologies 26

    DStream

    val ssc = new StreamingContext(sparkConf, Seconds(2))val linesDStream = ssc.textFileStream("/mapr/stream")

    batch

    time 0-1

    linesDStream

    batch time 1-2

    batch time 1-2

    DStream: RDD

    RDD

  • 2015 MapR Technologies 27

    DStream

    val linesDStream = ssc.textFileStream("directory path")val sensorDStream = linesDStream.map(parseSensor)

    map Batch

    RDD

    batch time 0-1

    linesDStream RDD

    sensorDstream RDD

    batch time 1-2

    map map

    batch time 1-2

  • 2015 MapR Technologies 28

    DStream

    // RDD sensorDStream.foreachRDD { rdd => // val alertRDD = sensorRDD.filter(sensor => sensor.psi < 5.0) . . .}

  • 2015 MapR Technologies 29

    DataFrame SQL

    // RDD sensorDStream.foreachRDD { rdd => . . . alertRdd.toDF().registerTempTable("alert") // val alertViewDF = sqlContext.sql( "select s.resid, s.psi, p.pumpType from alert s join pump p on s.resid = p.resid join maint m on p.resid=m.resid") . . .}

  • 2015 MapR Technologies 30

    HBase

    // RDD sensorDStream.foreachRDD { rdd => . . . // put HBase rdd.map(Sensor.convertToPutAlert) .saveAsHadoopDataset(jobConfig)}

  • 2015 MapR Technologies 31

    HBase

    rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)

    map

    Put HBase

    batch time 0-1

    linesRDD DStream

    sensorRDD DStream

    batch time 1-2

    map map

    batch time 1-2

    HBase

    save save save

    :

  • 2015 MapR Technologies 32

    sensorDStream.foreachRDD { rdd => . . .

    }// ssc.start() // ssc.awaitTermination()

  • 2015 MapR Technologies 33

    HBase

    Read

    Write

    HBase Spark

    :

  • 2015 MapR Technologies 34

    HBase

    2015 MapR Technologies 32

    HBase

    HBase Read and Write

    val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

    keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)

    newAPIHadoopRDD

    Row key Result

    saveAsHadoopDataset

    Key Put

    HBase

    Scan Result

    val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)

  • 2015 MapR Technologies 35

    HBase

    // HBase (rowkey, Result) RDD val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])// val resultRDD = hBaseRDD.map(tuple => tuple._2)// (RowKey, ColumnValue) RDD val keyValueRDD = resultRDD.map( result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toDouble(result.value)))// rowkey group by, val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list => StatCounter(list))

  • 2015 MapR Technologies 36

    HBase

    // HBase data val jobConfig: JobConf = new JobConf(conf, this.getClass) jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName) // put hbase stats keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)

  • 2015 MapR Technologies 37

    https://www.mapr.com/blog/spark-streaming-hbase

  • 2015 MapR Technologies 38

  • 2015 MapR Technologies 39

    MapR Converged Data Platform

    2015 MapR Technologies

    NEW

    MapR Streams Kafka API

  • 2015 MapR Technologies 40

    Q & A @mapr_japan maprjapan

    [email protected]

    MapR

    maprtech

    mapr-technologies