Pydata Amazon Kinesisのご紹介
-
Upload
toshiaki-enami -
Category
Technology
-
view
173 -
download
4
Transcript of Pydata Amazon Kinesisのご紹介
-
Amazon Kinesis
-
[email protected] @ToshiakiEnami
AWS Amazon Kinesis Amazon DynamoDB
-
AWS
S3
ProcessSubmissions
StoreBatches
ProcessHourly w/Hadoop
ClientsSubmitting
Data
DataWarehouse
100ETL Job
100
, keep everything
-
Ingest
Client/Sensor
Ingest Processing StorageAnalytics + Visualization + Reporting
-
Ingest Layer"
"
Processing
Kafka
OrKinesis
Processing
Kin
esis
-
Kinesis
-
Amazon Kinesis
Kinesis1AZ
-
POS
-
Kinesis
Kinesis Client Library + Connector Library
HTTPS Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Apache Storm
Amazon Elastic MapReduce
MobileSDK & Cognito
-
Kinesis
Data Sources
App.4
[Machine Learning]
App.1
[Aggregate & De-Duplicate]
Data Sources
Data Sources
Data Sources
App.2
[Metric Extraction]
S3
DynamoDB
Redshift
App.3
[Real-timeDashboard]
Data Sources
Availability Zone
Shard 1Shard 2Shard N
Availability Zone
Availability Zone
Kinesis
AWS Endpoint
StreamStream1ShardShard 1MB/sec, 1000 TPS 2 MB/sec, 5TPS Data RecordData Record24 AZShard
Stream
-
Kinesis &
$0.0195/shard/
Put $0.043/100Put
$14 Get EC2
-
PutRecord API http://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html
AWS SDK for Java, Javascript, Python, Ruby, PHP, .Net
botoput_record
http://docs.pythonboto.org/en/latest/ref/kinesis.html#module-boto.kinesis.layer1
-
DataRecord
Shard Shard
MD5Shard
0
2128
Shard-1
MD5()
Shard-0
0
2127
-
shard
KinesisStream 24
SeqNo(14)
SeqNo(17)
SeqNo(25)
SeqNo(26)
SeqNo(32)
-
Web
-
Fluentd Plugin Web
GithubPluginhttps://github.com/awslabs/aws-uent-plugin-kinesis
Log4J JavaLog4J
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/kinesis-pig-publisher.html
Web
# KINESIS appender log4j.logger.KinesisLogger=INFO, KINESIS log4j.additivity.KinesisLogger=false log4j.appender.KINESIS=com.amazonaws.services.kinesis.log4j.KinesisAppender log4j.appender.KINESIS.layout=org.apache.log4j.PatternLayout log4j.appender.KINESIS.layout.ConversionPattern=%m
log4j.properties
-
MQTT Broker Kinesis-MQTT Bridge
MQTT) MQTT BrokerMQTT-Kinesis BridgeKinesis
GithubMQTT-Kinesis Bridge
https://github.com/awslabs/mqtt-kinesis-bridge
MQTT Broker Kinesis-MQTT Bridge
Auto scaling Group
-
CognitoMobileSDKKinesis Kinesis
App w/SDK
End Users
Login OAUTH/OpenID Access Token
Cognito ID, Temp
Credentials
Access Token Pool ID
Role ARNs
Put Recode
Identitypool
Identity Providers
Access Policy identitypool Unauthenticated
Identities
authenticated identities AWS
Account
Amazon Cognito - ID
-
GetShardIterator APIShardGetRecords
API http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html
AWS SDK for Java, Javascript, Python, Ruby, PHP, .Net
botoget_shard_iterator, get_records
http://docs.pythonboto.org/en/latest/ref/kinesis.html#module-boto.kinesis.layer1
-
GetShardIterator GetShardIterator APIShardIteratorType
ShardIteratorType
AT_SEQUENCE_NUMBER ( ) AFTER_SEQUENCE_NUMBER ( ) TRIM_HORIZON ( Shard ) LATEST ( )
Seq: xxx
LATEST
AT_SEQUENCE_NUMBERAFTER_SEQUENCE_NUMBER
TRIM_HORIZON
GetShardIterator
-
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
KCL Worker 1
KCL Worker 2
EC2 Instance
KCL Worker 3
KCL Worker 4
EC2 Instance
KCL Worker n
EC2 Instance
Kinesis
Kinesis Client Library (KCL)Client library for fault-tolerant, at least-once, Continuous Processing
ShardWorker Worker Worker worker AutoScaling At least once
-
Kinesis Client Library
StreamShard-0
Shard-1
Kinesis
(KCL)
Instance A 12345
Instance A 98765
Data Record(12345)
Data Record(24680)
Data Record(98765)
DynamoDBInstance A
1. Kinesis Client LibraryShardData Record2. ID
DynamoDB3. Shard
Key, Attribute
-
Kinesis Client LibraryStream
Shard-0
Shard-1
Kinesis
(KCL)
Instance A 12345
Instance B 98765
Data Record(12345)
Data Record(24680)
Data Record(98765)
DynamoDB
Instance A
Kinesis
(KCL)
Instance B
1.
Key, Attribute
-
Kinesis Client LibraryStream
Shard-0
Shard-1
Kinesis
(KCL)
Instance AInstance B
12345
Instance B 98765
Data Record(12345)
Data Record(24680)
Data Record(98765)
DynamoDB
Instance A
Kinesis
(KCL)
Instance B
Instance AInstance BDynamoDB
Key, Attribute
-
Kinesis Client LibraryStream
Shard-0
Kinesis
(KCL)
Shard
Shard-0 Instance A 12345
Shard-1 Instance A 98765
Data Record(12345)
Data Record(24680)
DynamoDB
Instance A
Shard-1Shard-1DynamoDB
Shard-1Data Record(98765)
New
Key, Attribute
-
Kinesis
(12345)
(98765)
(24680)
(12345)
(98765)
(24680)
(KCL)
DynamoDBInstance A
Shard
Shard-0 Instance A 12345
Shard-1 Instance A 98765
(KCL)
Instance AShard
Shard-0 Instance A 24680
Shard-1 Instance A 98765
Archive Table
Calc Table
-
Kinesis Client Library (KCL) for Python
KCL for PythonKCL for JavaMultiLangDaemonPython
MultiLangDaemon
STDIN/STDOUT
-
Kinesis Client Library (KCL) for Python KCL for PythonKCL for JavaMultiLangDaemon
Python
MultiLangDaemon
STDIN/STDOUT
KCL(Java)
Shard-0
Shard-1 Worker Thread
Worker Thread Python Logic Process
Python Logic Process
-
KCL for Python#!env python from amazon_kclpy import kcl import json, base64 class RecordProcessor(kcl.RecordProcessorBase): def initialize(self, shard_id): pass def process_records(self, records, checkpointer): pass def shutdown(self, checkpointer, reason): pass if __name__ == "__main__": kclprocess = kcl.KCLProcess(RecordProcessor()) kclprocess.run()
-
KCL for Python
https://github.com/awslabs/amazon-kinesis-client-python/blob/master/amazon_kclpy/kcl.py
https://github.com/awslabs/amazon-kinesis-client/tree/master/src/main/java/com/amazonaws/services/kinesis/multilang
KCL for Python
KCL for Java
-
Multi Language Protocol
Action Parameter Initialize "shardId" : "string" processRecords [{ "data" : base64encoded_string",
"partitionKey" : partition key", "sequenceNumber" : sequence number"; }] // a list of records
checkpoint "checkpoint" : sequence number", "error" : NameOfException"
shutdown "reason" : TERMINATE|ZOMBIE"
-
KCL for Python
failoverTimeMillis WorkerWorkerDynamoDBPIOPS
maxRecords 1
idleTimeBetweenReadsInMillis
callProcessRecordsEvenForEmptyRecordList
True or Fault
parentShardPollIntervalMillis ShardDynamoDBPIOPS
cleanupLeasesUponShardCompletion shrad
taskBackoTimeMillis KCL
metricsBuerTimeMillis CloudWatchAPI
metricsMaxQueueSize CloudWatchAPI
validateSequenceNumberBeforeCheckpointing
Checkpointing
maxActiveThreads MultiLangDaemon
-
KCL for Python[ec2-user@ip-172-31-17-43 samples]$ amazon_kclpy_helper.py --print_command -j /usr/bin/java -p /home/ec2-user/amazon-kinesis-client-python/samples/sample.properties /usr/bin/java -cp /usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/amazon-kinesis-client-1.2.0.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/jackson-annotations-2.1.1.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/commons-codec-1.3.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/commons-logging-1.1.1.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/joda-time-2.4.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/jackson-databind-2.1.1.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/jackson-core-2.1.1.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/aws-java-sdk-1.7.13.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/httpclient-4.2.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/httpcore-4.2.jar:/home/ec2-user/amazon-kinesis-client-python/samples com.amazonaws.services.kinesis.multilang.MultiLangDaemon sample.properties
KCL
-
Kinesis
A
BSeqNo
(14)SeqNo(17)
SeqNo(25)
SeqNo(26)
SeqNo(32)
-
Kinesis
Simple ETL KinesisIngestS3DynamoDBRedshift-
ETL/MapReduce KinesisIngestHadoopSparkStorm- - ETL
Filter KinesisFiltering/MapReduce- -
AWS Lambda AWS Lambda
-
KCL
Dashboard
Redshift
DynamoDB
-
Simple ETL DynamoDBRedshiftS3Kinesis
Connector Libraryhttps://github.com/awslabs/amazon-kinesis-connectors
Redshift
S3
Redshift
S3
Transformer Filter Buer Emitter
Kinesis Connector
-
ETL/MapReduce1 HadoopSpark KinesisHivePigHadoopETLMap Reduce
Kinesis Stream, S3, DynamoDB, HDFSHive Table
JOIN Data pipeline / CrontabKinesis
EMR AMI 3.0.4Kinesis
EMR Cluster S3
Data Pipeline
DataPipelineHiveKinesisS3
Kinesis
-
ETL/MapReduce2 Apache Storm Bolt KinesisApache StormSpout
https://github.com/awslabs/kinesis-storm-spout
Data Sources
Data Sources
Data Sources
Storm Spout
Storm Bolt
Storm Bolt
Storm Bolt
-
Filter Kinesis FilterMapReduceKinesis Kinesis
Data Sources
Data Sources
Data Sources
Kinesis App
Kinesis App
Kinesis App
Kinesis App
Filter Layer () Process Layer ()
-
Apache SparkApache Storm
Data Sources
Data Sources
Data Sources
Jubatus
Dashboard
Jubatus
-
AWS Lambda Lambda Function
Data Sources
Data Sources
Data Sources
AWS Lambda
Redshift
S3
-
KinesisEC2
Jubatus
(iPhone)
HTTP/WS
Put Record
HTTP/WS Get Records
-
IoT
AWS