Pydata Amazon Kinesisのご紹介

48
Amazon Kinesisの紹介 と使いドコロ アマゾンデータサービスジャパン株式会社 パートナーソリューションアーキテクト 榎並 晃

Transcript of Pydata Amazon Kinesisのご紹介

  • Amazon Kinesis

  • [email protected] @ToshiakiEnami

    AWS Amazon Kinesis Amazon DynamoDB

  • AWS

    S3

    ProcessSubmissions

    StoreBatches

    ProcessHourly w/Hadoop

    ClientsSubmitting

    Data

    DataWarehouse

    100ETL Job

    100

    , keep everything

  • Ingest

    Client/Sensor

    Ingest Processing StorageAnalytics + Visualization + Reporting

  • Ingest Layer"

    "

    Processing

    Kafka

    OrKinesis

    Processing

    Kin

    esis

  • Kinesis

  • Amazon Kinesis

    Kinesis1AZ

  • POS

  • Kinesis

    Kinesis Client Library + Connector Library

    HTTPS Post

    AWS SDK

    LOG4J

    Flume

    Fluentd

    Get* APIs

    Apache Storm

    Amazon Elastic MapReduce

    MobileSDK & Cognito

  • Kinesis

    Data Sources

    App.4

    [Machine Learning]

    App.1

    [Aggregate & De-Duplicate]

    Data Sources

    Data Sources

    Data Sources

    App.2

    [Metric Extraction]

    S3

    DynamoDB

    Redshift

    App.3

    [Real-timeDashboard]

    Data Sources

    Availability Zone

    Shard 1Shard 2Shard N

    Availability Zone

    Availability Zone

    Kinesis

    AWS Endpoint

    StreamStream1ShardShard 1MB/sec, 1000 TPS 2 MB/sec, 5TPS Data RecordData Record24 AZShard

    Stream

  • Kinesis &

    $0.0195/shard/

    Put $0.043/100Put

    $14 Get EC2

  • PutRecord API http://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecord.html

    AWS SDK for Java, Javascript, Python, Ruby, PHP, .Net

    botoput_record

    http://docs.pythonboto.org/en/latest/ref/kinesis.html#module-boto.kinesis.layer1

  • DataRecord

    Shard Shard

    MD5Shard

    0

    2128

    Shard-1

    MD5()

    Shard-0

    0

    2127

  • shard

    KinesisStream 24

    SeqNo(14)

    SeqNo(17)

    SeqNo(25)

    SeqNo(26)

    SeqNo(32)

  • Web

  • Fluentd Plugin Web

    GithubPluginhttps://github.com/awslabs/aws-uent-plugin-kinesis

    Log4J JavaLog4J

    http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/kinesis-pig-publisher.html

    Web

    # KINESIS appender log4j.logger.KinesisLogger=INFO, KINESIS log4j.additivity.KinesisLogger=false log4j.appender.KINESIS=com.amazonaws.services.kinesis.log4j.KinesisAppender log4j.appender.KINESIS.layout=org.apache.log4j.PatternLayout log4j.appender.KINESIS.layout.ConversionPattern=%m

    log4j.properties

  • MQTT Broker Kinesis-MQTT Bridge

    MQTT) MQTT BrokerMQTT-Kinesis BridgeKinesis

    GithubMQTT-Kinesis Bridge

    https://github.com/awslabs/mqtt-kinesis-bridge

    MQTT Broker Kinesis-MQTT Bridge

    Auto scaling Group

  • CognitoMobileSDKKinesis Kinesis

    App w/SDK

    End Users

    Login OAUTH/OpenID Access Token

    Cognito ID, Temp

    Credentials

    Access Token Pool ID

    Role ARNs

    Put Recode

    Identitypool

    Identity Providers

    Access Policy identitypool Unauthenticated

    Identities

    authenticated identities AWS

    Account

    Amazon Cognito - ID

  • GetShardIterator APIShardGetRecords

    API http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html

    AWS SDK for Java, Javascript, Python, Ruby, PHP, .Net

    botoget_shard_iterator, get_records

    http://docs.pythonboto.org/en/latest/ref/kinesis.html#module-boto.kinesis.layer1

  • GetShardIterator GetShardIterator APIShardIteratorType

    ShardIteratorType

    AT_SEQUENCE_NUMBER ( ) AFTER_SEQUENCE_NUMBER ( ) TRIM_HORIZON ( Shard ) LATEST ( )

    Seq: xxx

    LATEST

    AT_SEQUENCE_NUMBERAFTER_SEQUENCE_NUMBER

    TRIM_HORIZON

    GetShardIterator

  • Shard 1

    Shard 2

    Shard 3

    Shard n

    Shard 4

    KCL Worker 1

    KCL Worker 2

    EC2 Instance

    KCL Worker 3

    KCL Worker 4

    EC2 Instance

    KCL Worker n

    EC2 Instance

    Kinesis

    Kinesis Client Library (KCL)Client library for fault-tolerant, at least-once, Continuous Processing

    ShardWorker Worker Worker worker AutoScaling At least once

  • Kinesis Client Library

    StreamShard-0

    Shard-1

    Kinesis

    (KCL)

    Instance A 12345

    Instance A 98765

    Data Record(12345)

    Data Record(24680)

    Data Record(98765)

    DynamoDBInstance A

    1. Kinesis Client LibraryShardData Record2. ID

    DynamoDB3. Shard

    Key, Attribute

  • Kinesis Client LibraryStream

    Shard-0

    Shard-1

    Kinesis

    (KCL)

    Instance A 12345

    Instance B 98765

    Data Record(12345)

    Data Record(24680)

    Data Record(98765)

    DynamoDB

    Instance A

    Kinesis

    (KCL)

    Instance B

    1.

    Key, Attribute

  • Kinesis Client LibraryStream

    Shard-0

    Shard-1

    Kinesis

    (KCL)

    Instance AInstance B

    12345

    Instance B 98765

    Data Record(12345)

    Data Record(24680)

    Data Record(98765)

    DynamoDB

    Instance A

    Kinesis

    (KCL)

    Instance B

    Instance AInstance BDynamoDB

    Key, Attribute

  • Kinesis Client LibraryStream

    Shard-0

    Kinesis

    (KCL)

    Shard

    Shard-0 Instance A 12345

    Shard-1 Instance A 98765

    Data Record(12345)

    Data Record(24680)

    DynamoDB

    Instance A

    Shard-1Shard-1DynamoDB

    Shard-1Data Record(98765)

    New

    Key, Attribute

  • Kinesis

    (12345)

    (98765)

    (24680)

    (12345)

    (98765)

    (24680)

    (KCL)

    DynamoDBInstance A

    Shard

    Shard-0 Instance A 12345

    Shard-1 Instance A 98765

    (KCL)

    Instance AShard

    Shard-0 Instance A 24680

    Shard-1 Instance A 98765

    Archive Table

    Calc Table

  • Kinesis Client Library (KCL) for Python

    KCL for PythonKCL for JavaMultiLangDaemonPython

    MultiLangDaemon

    STDIN/STDOUT

  • Kinesis Client Library (KCL) for Python KCL for PythonKCL for JavaMultiLangDaemon

    Python

    MultiLangDaemon

    STDIN/STDOUT

    KCL(Java)

    Shard-0

    Shard-1 Worker Thread

    Worker Thread Python Logic Process

    Python Logic Process

  • KCL for Python#!env python from amazon_kclpy import kcl import json, base64 class RecordProcessor(kcl.RecordProcessorBase): def initialize(self, shard_id): pass def process_records(self, records, checkpointer): pass def shutdown(self, checkpointer, reason): pass if __name__ == "__main__": kclprocess = kcl.KCLProcess(RecordProcessor()) kclprocess.run()

  • KCL for Python

    https://github.com/awslabs/amazon-kinesis-client-python/blob/master/amazon_kclpy/kcl.py

    https://github.com/awslabs/amazon-kinesis-client/tree/master/src/main/java/com/amazonaws/services/kinesis/multilang

    KCL for Python

    KCL for Java

  • Multi Language Protocol

    Action Parameter Initialize "shardId" : "string" processRecords [{ "data" : base64encoded_string",

    "partitionKey" : partition key", "sequenceNumber" : sequence number"; }] // a list of records

    checkpoint "checkpoint" : sequence number", "error" : NameOfException"

    shutdown "reason" : TERMINATE|ZOMBIE"

  • KCL for Python

    failoverTimeMillis WorkerWorkerDynamoDBPIOPS

    maxRecords 1

    idleTimeBetweenReadsInMillis

    callProcessRecordsEvenForEmptyRecordList

    True or Fault

    parentShardPollIntervalMillis ShardDynamoDBPIOPS

    cleanupLeasesUponShardCompletion shrad

    taskBackoTimeMillis KCL

    metricsBuerTimeMillis CloudWatchAPI

    metricsMaxQueueSize CloudWatchAPI

    validateSequenceNumberBeforeCheckpointing

    Checkpointing

    maxActiveThreads MultiLangDaemon

  • KCL for Python[ec2-user@ip-172-31-17-43 samples]$ amazon_kclpy_helper.py --print_command -j /usr/bin/java -p /home/ec2-user/amazon-kinesis-client-python/samples/sample.properties /usr/bin/java -cp /usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/amazon-kinesis-client-1.2.0.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/jackson-annotations-2.1.1.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/commons-codec-1.3.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/commons-logging-1.1.1.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/joda-time-2.4.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/jackson-databind-2.1.1.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/jackson-core-2.1.1.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/aws-java-sdk-1.7.13.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/httpclient-4.2.jar:/usr/lib/python2.6/site-packages/amazon_kclpy-1.0.0-py2.6.egg/amazon_kclpy/jars/httpcore-4.2.jar:/home/ec2-user/amazon-kinesis-client-python/samples com.amazonaws.services.kinesis.multilang.MultiLangDaemon sample.properties

    KCL

  • Kinesis

    A

    BSeqNo

    (14)SeqNo(17)

    SeqNo(25)

    SeqNo(26)

    SeqNo(32)

  • Kinesis

    Simple ETL KinesisIngestS3DynamoDBRedshift-

    ETL/MapReduce KinesisIngestHadoopSparkStorm- - ETL

    Filter KinesisFiltering/MapReduce- -

    AWS Lambda AWS Lambda

  • KCL

    Dashboard

    Redshift

    DynamoDB

  • Simple ETL DynamoDBRedshiftS3Kinesis

    Connector Libraryhttps://github.com/awslabs/amazon-kinesis-connectors

    Redshift

    S3

    Redshift

    S3

    Transformer Filter Buer Emitter

    Kinesis Connector

  • ETL/MapReduce1 HadoopSpark KinesisHivePigHadoopETLMap Reduce

    Kinesis Stream, S3, DynamoDB, HDFSHive Table

    JOIN Data pipeline / CrontabKinesis

    EMR AMI 3.0.4Kinesis

    EMR Cluster S3

    Data Pipeline

    DataPipelineHiveKinesisS3

    Kinesis

  • ETL/MapReduce2 Apache Storm Bolt KinesisApache StormSpout

    https://github.com/awslabs/kinesis-storm-spout

    Data Sources

    Data Sources

    Data Sources

    Storm Spout

    Storm Bolt

    Storm Bolt

    Storm Bolt

  • Filter Kinesis FilterMapReduceKinesis Kinesis

    Data Sources

    Data Sources

    Data Sources

    Kinesis App

    Kinesis App

    Kinesis App

    Kinesis App

    Filter Layer () Process Layer ()

  • Apache SparkApache Storm

    Data Sources

    Data Sources

    Data Sources

    Jubatus

    Dashboard

    Jubatus

  • AWS Lambda Lambda Function

    Data Sources

    Data Sources

    Data Sources

    AWS Lambda

    Redshift

    S3

  • KinesisEC2

    Jubatus

    (iPhone)

    HTTP/WS

    Put Record

    HTTP/WS Get Records

  • IoT

    AWS