Spark Streaming の基本とスケールする時系列データ処理 - Spark Meetup December...
-
Upload
mapr-technologies-japan -
Category
Data & Analytics
-
view
2.214 -
download
2
Transcript of Spark Streaming の基本とスケールする時系列データ処理 - Spark Meetup December...
-
2015 MapR Technologies 1
2015 MapR Technologies
Spark Streaming MapR Technologies2015 12 9
-
2015 MapR Technologies 2
Apache Spark Streaming ? Apache Spark Streaming
(@nagix)
-
2015 MapR Technologies 3
Spark Streaming ?
:
Web
put put
put put
Time stamped data
data
Data for real-time monitoring
-
2015 MapR Technologies 4
?
2015 MapR Technologies 4
What is time series data? Stuff with timestamps
Sensor data log files Phones..
Credit Card Transactions Web user behaviour
Social media Log files
Geodata
Sensors 2015 MapR Technologies 4
What is time series data? Stuff with timestamps
Sensor data log files Phones..
Credit Card Transactions Web user behaviour
Social media Log files
Geodata
Sensors 2015 MapR Technologies 4
What is time series data? Stuff with timestamps
Sensor data log files Phones..
Credit Card Transactions Web user behaviour
Social media Log files
Geodata
Sensors
2015 MapR Technologies 4
What is time series data? Stuff with timestamps
Sensor data log files Phones..
Credit Card Transactions Web user behaviour
Social media Log files
Geodata
Sensors 2015 MapR Technologies 4
What is time series data? Stuff with timestamps
Sensor data log files Phones..
Credit Card Transactions Web user behaviour
Social media Log files
Geodata
Sensors 2015 MapR Technologies 4
What is time series data? Stuff with timestamps
Sensor data log files Phones..
Credit Card Transactions Web user behaviour
Social media Log files
Geodata
Sensors
Web
-
2015 MapR Technologies 5
Apache Spark Streaming ?
? ?
2015 MapR Technologies 5
Why Spark Streaming ?
What If? You want to analyze data as it arrives?
For Example Time Series Data: Sensors, Clicks, Logs, Stats
:
-
2015 MapR Technologies 6
2015 MapR Technologies 6
Batch Processing
It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees
It was hot at 6:05 yesterday!
Batch processing may be too late for some events
2015 MapR Technologies 6
Batch Processing
It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees
It was hot at 6:05 yesterday!
Batch processing may be too late for some events
2015 MapR Technologies 6
Batch Processing
It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees
It was hot at 6:05 yesterday!
Batch processing may be too late for some events
It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees
6:05 !
-
2015 MapR Technologies 7
2015 MapR Technologies 6
Batch Processing
It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees
It was hot at 6:05 yesterday!
Batch processing may be too late for some events
2015 MapR Technologies 7
Event Processing
It's 6:05 and 90 degrees
Someone should open a window!
Streaming
Its becoming important to process events as they arrive
It's 6:05 and 90 degrees
!
-
2015 MapR Technologies 8
Spark Streaming
Spark API
-
2015 MapR Technologies 9
2015 MapR Technologies 9
Stream Processing Architecture
Streaming
Sources/Apps
MapR-FS
Data Ingest
Topics
MapR-DB
Data Storage
MapR-FS
Apps$
Stream Processing
HDFS
HDFS
HBase
-
2015 MapR Technologies 10
: : HDFS : TCP
Twitter, Kafka, Flume, ZeroMQ, Akka Actor
Transformation
-
2015 MapR Technologies 11
Spark Streaming
X (Batch) DStream = RDD
Spark Streaming
DStream RDD Batch
Batch
time 0 1
time 1 2
RDD @ time 2
time 2 3
RDD @ time 3 RDD @ time 1
-
2015 MapR Technologies 12
Resilient Distributed Datasets (RDD)
Spark RDD Read Only
-
2015 MapR Technologies 13
Resilient Distributed Datasets (RDD)
Spark RDD Read Only
-
2015 MapR Technologies 14
RDD
RDD
textFile = sc.textFile(SomeFile.txt) !
-
2015 MapR Technologies 15
RDD
RDDRDDRDDRDD
Transformations
linesWithErrorRDD = linesRDD.filter(lambda line: ERROR in line) !
linesRDD = sc.textFile(LogFile.txt) !
-
2015 MapR Technologies 16
RDD
RDDRDDRDDRDD
Transformations
Action Value
linesWithErrorRDD.count()!6 !!linesWithErrorRDD.first() !# Error line!
textFile = sc.textFile(SomeFile.txt) !
linesWithErrorRDD = linesRDD.filter(lambda line: ERROR in line) !
-
2015 MapR Technologies 17
Dstream
transform
Transform map
reduceByValue count
DStream RDD
DStream RDD
transform transform
Transformation RDD
time 0 1
time 1 2
RDD @ time 2
time 2 3
RDD @ time 3 RDD @ time 1
RDD @ time 1 RDD @ time 2 RDD @ time 3
-
2015 MapR Technologies 18
Transformation: DStream
RDD : map, filter, union, reduce, join, ... : UpdateStateByKey(function),
countByValueAndWindow, ...
-
2015 MapR Technologies 19
Spark Streaming
Batch
Spark
Batch
Spark Streaming
DStream RDD Batch
time 0 1
time 1 2
RDD @ time 2
time 2 3
RDD @ time 3 RDD @ time 1
-
2015 MapR Technologies 20
Transformation :
saveAsHadoopFiles HDFS saveAsHadoopDataset HBase saveAsTextFiles foreach RDD Batch
-
2015 MapR Technologies 21
-
2015 MapR Technologies 22
:
read
Spark
Spark
Streaming
-
2015 MapR Technologies 23
CSV Sensor
case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }
-
2015 MapR Technologies 24
data
alerts stats
data
alerts
stats
hz psi psi hz_avg psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
-
2015 MapR Technologies 25
Spark Streaming
Spark Streaming : 1. Spark StreamingContext 2. DStream
1. Transformation
DStream 2.
3.
streamingContext.start() 4.
streamingContext.awaitTermination()
-
2015 MapR Technologies 26
DStream
val ssc = new StreamingContext(sparkConf, Seconds(2))val linesDStream = ssc.textFileStream("/mapr/stream")
batch
time 0-1
linesDStream
batch time 1-2
batch time 1-2
DStream: RDD
RDD
-
2015 MapR Technologies 27
DStream
val linesDStream = ssc.textFileStream("directory path")val sensorDStream = linesDStream.map(parseSensor)
map Batch
RDD
batch time 0-1
linesDStream RDD
sensorDstream RDD
batch time 1-2
map map
batch time 1-2
-
2015 MapR Technologies 28
DStream
// RDD sensorDStream.foreachRDD { rdd => // val alertRDD = sensorRDD.filter(sensor => sensor.psi < 5.0) . . .}
-
2015 MapR Technologies 29
DataFrame SQL
// RDD sensorDStream.foreachRDD { rdd => . . . alertRdd.toDF().registerTempTable("alert") // val alertViewDF = sqlContext.sql( "select s.resid, s.psi, p.pumpType from alert s join pump p on s.resid = p.resid join maint m on p.resid=m.resid") . . .}
-
2015 MapR Technologies 30
HBase
// RDD sensorDStream.foreachRDD { rdd => . . . // put HBase rdd.map(Sensor.convertToPutAlert) .saveAsHadoopDataset(jobConfig)}
-
2015 MapR Technologies 31
HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
map
Put HBase
batch time 0-1
linesRDD DStream
sensorRDD DStream
batch time 1-2
map map
batch time 1-2
HBase
save save save
:
-
2015 MapR Technologies 32
sensorDStream.foreachRDD { rdd => . . .
}// ssc.start() // ssc.awaitTermination()
-
2015 MapR Technologies 33
HBase
Read
Write
HBase Spark
:
-
2015 MapR Technologies 34
HBase
2015 MapR Technologies 32
HBase
HBase Read and Write
val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
newAPIHadoopRDD
Row key Result
saveAsHadoopDataset
Key Put
HBase
Scan Result
val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
-
2015 MapR Technologies 35
HBase
// HBase (rowkey, Result) RDD val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])// val resultRDD = hBaseRDD.map(tuple => tuple._2)// (RowKey, ColumnValue) RDD val keyValueRDD = resultRDD.map( result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toDouble(result.value)))// rowkey group by, val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list => StatCounter(list))
-
2015 MapR Technologies 36
HBase
// HBase data val jobConfig: JobConf = new JobConf(conf, this.getClass) jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName) // put hbase stats keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
-
2015 MapR Technologies 37
https://www.mapr.com/blog/spark-streaming-hbase
-
2015 MapR Technologies 38
-
2015 MapR Technologies 39
MapR Converged Data Platform
2015 MapR Technologies
NEW
MapR Streams Kafka API
-
2015 MapR Technologies 40
Q & A @mapr_japan maprjapan
MapR
maprtech
mapr-technologies