AWS DataLake© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 2-H1-1-10 /...

Post on 17-Aug-2020

4 views 0 download

Transcript of AWS DataLake© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 2-H1-1-10 /...

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

2-H1-1-10 / 2-H1-1-19

AWS DataLake

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

お手元のサミットガイドブックの表紙に記載している 『QRコード』 からご回答ください。もれなく素敵なAWSオリジナルグッズをプレゼントします。

本セッションのFeedbackをお願いします

プレゼントの引き換えは、パミール3F展示会場内アンケート確認エリア・受付エリアのいずれかにお越し下さい。

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

名前:上原誠 (うえはらまこと)

所属:ソリューションアーキテクト

担当:メディア系、アドテクのお客様

最近よく触ってるサービス: AWS Glue

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

• / ML

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

1.

2. RDB

3.

4. AWS

5. AWS

6.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

1.

2. RDB

3.

4. AWS

5. AWS

6.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

2

/ ML

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

• データ活用の際には、さまざまなツールが必要となる

• ツールやモデルによって必要なリソースも異なる

• データの取得と保存、加工整形処理を行うためのツール

• 可視化- BIツール / 基礎集計のための SQL

• ディープラーニングフレームワークや Hadoop クラスタ、

Notebook などの分析ツール

• CPU / GPU / メモリ / IO

• 複数インスタンス / クラスタ

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

• データ活用は基本的に試行錯誤を伴う

• データの量や中身の変化への継続的な対応

• ビジネスの状況変化に伴う、さまざまな指標の定期的な見直し

• 機械学習モデルの構築・改善サイクル

• この試行錯誤のサイクルをいかに高速に積み重ねるかが、

良い結果を導きだすために重要

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

1.

2. RDB

3.

4. AWS

5. AWS

6.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

RDB

Databases

Logs

RDBBI

Report

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

RDB

• スキーマが定義されている

• データの型も強制できる

• アクセスするツールが簡単、エコシステムが安定

• トランザクション

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

RDB

Databases

Logs

BI

Report

1:

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

1:

Databases

Logs

RDBBI

Report

Logs

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

1:

Databases

Logs

RDBBI

Report

Events

Media

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

2:

Databases

Logs

RDB

BI

Report

Lab

Realtime

Machine

Learning

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

RDB

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

1.

2. RDB

3.

4. AWS

5. AWS

6.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Hadoop Hadoop

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Data Lake

Data Lake

………

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

• ストレージと計算処理の分離➢ それぞれ独立してスケールできるので最適化しやすい

• Single Source of Truth (SSOT)

➢ データレイクにあるものを正とすれば良い

• 様々な input/output 手法に対応➢ in/out が独立、ETL も独立できるので、後からの拡張がスムーズ

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

( )

RDBMS

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

1.

2. RDB

3.

4. AWS

5. AWS

6.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

これまで:

1. ディスクの容量に上限がある

2. データはサマリーだけ、もしくは期間限定で保存

3. 処理できる内容は固定的

On AWS:

1. 安価・上限無しのストレージ

2. オリジナルデータを全て残す

3. 処理対象・処理内容はビジネスに合わせて変わる

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Amazon S3

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Glacier

AWS Amazon S3

Amazon S3EC2 RDS

Storage Gateway

EBS

Redshift

CloudFront

Elastic Transcoder

Amazon Lex

Amazon PollyAmazon

Rekognition

Amazon Machine

Learning

AWS IoT

QuickSight

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

AWS Amazon S3

S3

• IAM

• CloudTrail S3

• S3

• ( CSE, SSE )

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Fluentd

Logstash

Flume

OSS

OSS on EC2

MQTT HTTPS

AWS IoTAmazon Kinesis

S3

AWS Snowball

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

OSS

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Amazon Kinesis

Kinesis DataStreams

ストリームデータを処理・分析するための

データを格納

Kinesis DataFirehose

ストリームデータをS3, Redshift,

Amazon ES, Splunk に簡単にロード

Kinesis DataAnalytics

ストリーミングデータを標準的な SQL クエリで

簡単に分析

ストリームデータを収集・処理・配信するためのフルマネージドサービス群

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Kinesis Data Firehose AWS Athena

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Amazon S3

Snowball

Fluentd

Logstash

FlumeAWS IoT

Amazon KinesisAmazon

Elasticsearch

Service

AWS DMSOSS

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Amazon RedshiftAmazon EMR

DWH

S3

Amazon Athena

S3

AWS Glue

ETL

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Amazon Redshift

Leader node

Compute nodes

SQL Client / BI Tools

JDBC / ODBC Driver

• MPP(Massively Parallel Processing)

• 2PB

• JDBC/ODBC BI

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Leader node

Compute nodes

SQL Client / BI Tools

JDBC / ODBC Driver

• Redshift S3

• Redshift JOIN

• S3

Spectrum

Redshift

• Redshift S3

Amazon Redshift Spectrum

Redshift S3

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Amazon EMR

Hadoop/Spark

AWS

• Big Data

• Spot

Hadoop

Amazon EMR

Spot

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

EMRFS: S3 HDFS

と指定するだけで と同様に にアクセス

• 計算資源とストレージを分離できる• コスト面でもメリット大

• クラスタの削除が可能• クラスタを消してもデータをロストしない

• 複数クラスタ間でデータ共有が簡単

• データは耐久性の高い に配置

EMR

EMR

Amazon

S3

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Amazon Athena

• サーバーレス検索サービス

• S3 のデータに対して直接クエリできる

• Presto ベースで標準 SQL が実行可能

• 実行したクエリの容量ぶんの従量課金

• スキャンされたデータ1TBあたり5$

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Amazon Athena

• クエリエディタにSQLを記述しクエリを実行

51

S3

• 過去のクエリ結果は History からダウンロード可能

• クエリ結果は S3 に自動保存

• コンソールに結果が表示

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

AWS Glue

• ETL +

• GUI ETL PySpark

Scala

• Athena, Redshift Spectrum,

EMR Spark, Hive

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

• GUI

• CSV→Parquet

Glue : GUI ETL

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

ジョブ=ETL処理を実行する単位

• PySpark, Scala で記述

• Extract(抽出)や、Load(取り込み)は抽象化されているため、主にTransform(変換)を既述する

■サンプルスクリプト集

https://github.com/awslabs/aws-glue-samples(※同じサイトにFAQもあり、こちらも必読)

Glue : ETL

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

• 作成したコードを読み込んで実行➢ IAM ロールで権限を設定

• ジョブの実行開始方法➢ API コール(手動)

➢ トリガー(スケジュール実行可能)

• リトライ制限の指定や、パラメータを渡すことが可能

• 実行ログは CloudWatch Logs に出力

Glue :

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

ETL

Glue data

catalog

Uses Amazo Athena

Redshift Spectrum

EMR (Hive/Spark)

Uses

Uses

Uses

Amazon S3

AWS Glue

AWS

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Amazon S3

Amazon EMR AWS Glue

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Amazon QuickSight

BI

• $9/

SPICE

• Redshift, RDS, S3, Athena, Salesforce,

• AWS

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

EC2+BIツール• 多彩なパートナーソリューション・OSSをEC2上で活用

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Amazon S3

EC2QuickSight

RDS

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Amazon

Glacier

Amazon

S3Amazon

DynamoDB

Amazon RDS/

AuroraAmazon

CloudSearchAmazon

Elasticsearch

Amazon

QuickSightAmazon

Elasticsearch

Amazon

Kinesis

Analytics

Amazon

EMR

Amazon

Redshift

Amazon

Machine

Learning

Amazon

Athena

AWS IoTAmazon

Kinesis Snowball

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

1.

2. RDB

3.

4. AWS

5. AWS

6.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

primeNumber

DMP (Data Management Platform)

https://speakerdeck.com/hiro_koba_jp/aws-etlji-ri-aws-gluehuo-yong-shi-li-at-primenumber

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

1.

2. RDB

3.

4. AWS

5. AWS

6.

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

• AWS

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Thank you !