AWS 기반 데이터 레이크(Datalake) 구축 및 분석 - 김민성 (AWS...

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

김민성, 솔루션스 아키텍트

2017년 8월 29일

AWS 기반 데이터 레이크(Datalake) 구축 및 분석

강연 중 질문하는 법

자신이 질문한 내역이 표시되며, 전체 공개로 답변된 내용은 검은색,

질문자 본인에게만 공개로 답변된 내용은 붉은 색으로 돌아옵니다.

본 세션의 주요 주제

• 빅데이터

• Data Lake의 정의 및 조건

• AWS에서의 Data Lake

왜? 빅데이터인가?

빅 데이터의 도전과 과제

Source: Forrsights Strategy Spotlight: Business Intelligence And Big Data, Q4 2012. Base: 634 business intelligence users and planners

Unstructured50TB

Semi-structured2 TB

Structured12 TB

12%

기업별

평균 데이터 볼륨의 크기

9 TB 75 TB

0.6 TB 5 TB

4 TB 50 TB

중소기업 대기업

빅데이터의 종류

통찰그리고

변화의 기회

빅데이터 분석 도구

Amazon

Kinesis

Amazon

Glacier

S3 DynamoDB

RDS

EMR

Amazon

Redshift

Data PipelineAmazon Kinesis

Streams app

Lambda Amazon ML

SQS

ElastiCache

DynamoDB

Streams Amazon Kinesis

Analytics

Amazon Elasticsearch

Service

데이터 아키텍처

Hadoop Cluster

SQL Database

Data Warehouse Appliance

모놀릭 기반의 빅데이터 분석 아키텍처

CPU

Memory

HDFS Storage

CPU

Memory

HDFS Storage

CPU

Memory

HDFS Storage

Hadoop Master Node

Multiple layers of functionality all on a single cluster

어떻게 극복할 것인가?(S3 데이터 레이크 기반의 EMR클러스터)

EMR

EMR ClusterS3

1. 데이터를S3에 저장

2. 하둡 클러스터, 노드 수, 노드 type,

Hive/Pig/Hbase와 같은Hadoop 툴 선택

4. S3에서 추출

3. EMR 콘솔, CLI, SDK, 혹은 API를 통해클러스터 시작

EMR

EMR Cluster

S3

쉽게 클러스터의크기 조정

같은 데이터를사용하는 다른클러스터 생성

클러스터 스케일 조정

인스턴스 용량 설정YARN 리소스사용기반 설정

자동확장/축소선택

EMR

EMR Cluster

S3

스팟 노드를사용하여 비용절감

EMR Instance Fleets 활용

최대 5개까지 서로다른 타입 선택

Spot인스턴스 중단시온디맨드로 자동 전환

EMR ClusterS3

모든 작업이 끝나면 클러스터종료 (따라서 과금 중단!)

메타데이터 관리

[ {

"Classification": "hive-site",

"Properties": {

"javax.jdo.option.ConnectionURL":"jdbc:mysql:\/\/RDS-endpoint:3306\/hive?createDatabaseIfNotExist=true",

"javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",

"javax.jdo.option.ConnectionUserName": "username",

"javax.jdo.option.ConnectionPassword": "password"

}

} ]

aws emr create-cluster --release-label emr-5.4.0 --instance-type

m3.xlarge --instance-count 2 --applications Name=Hive --

configurations hivemetadata.json --use-default-roles

부트스트랩 활용

S3를 스크립트 저장소로 활용

사례 연구: 삼성 빅데이터 분석

S3 : 원천 데이터의 적제 및 분석 시스템간의 BUS 역할 제공

Unlocking Data(Data Lake)

Data Lake – Unlocking Data

대부분의 회사와 조직은 데이터 잠금 해제를 위해혁신 이니셔티브에 착수함

데이터가 이미 있지만 사용되지 않거나 격리 된 데이터가

사용되지 않고 잠겨있음.

22

Data Lake의 특징과 장점

Store and analyze all of your data,

from all of your sources, in one

centralized location.

“Why is the data distributed in

many locations? Where is the

single source of truth ?”

1. 모든 데이터를 한곳에

23


Quickly ingest data

without needing to force it into a

pre-defined schema.

“How can I collect data quickly

from various sources and store

it efficiently?”

2. 신속한 데이터 추출 및 저장

24


Separating your storage and compute

allows you to scale each component as

required

“How can I scale up with the

volume of data being generated?”

3. 데이터 저장과 처리를 분리

25


4. 구조화 없이 분석 처리 (Schema on Read)

“Is there a way I can apply multiple

analytics and processing frameworks

to the same data?”

A Data Lake enables ad-hoc

analysis by applying schemas

on read, not write.

AWS Approach to Data Lake

Data Lake 로써의 S3

S3 Data Lake

Fixed Cluster Data Lake AWS S3 Data Lake

클러스터에 포함 된 단일 도구 (예 : Hadoop 또는 데이터웨어 하우스 또는Cassandra 등)로만 제한되고, 유스케이스 및 생태계 도구가 빠르게 변함

스토리지 용량을 추가하기 위해노드를 추가하는 데 비용이 많이 증가

노드 손실에 대한 데이터를 복제하는데 고비용 구조

로컬 스토리지 용량 확장의 복잡성

추가 저장 장치를 추가하고 적용하는많은 데이터 이행 기간 필요

고정된 클러스터가 아닌 다양한데이터 객체를 지원하는 S3 저장소를기반으로 컴퓨팅 처리 자원과 분리

데이터 관련 모든 생태계의 도구를사용할 수있는 유연성과 적합성을제공함

미래 지향적으로 검증된 아키텍처로새로운 활요 사례나 새로운 도구를간편하게 지원

현재의 최상의 제품을 플러그 앤플레이(Plug and Play)로 활용

Data Lake로써의 S3

Designed for 11 9s of durability

Designed for 99.99% availability

Durable Available High performance Multiple upload Range GET

Store as much as you need

Scale storage and compute independently

No minimum usage commitments

Scalable Amazon EMR

Amazon Redshift

Amazon DynamoDB

Integrated

Simple REST API

AWS SDKs

Event notification

Lifecycle Management

Easy to use

만약 빅데이터 분석을 처음 시작하신다면…

Data Lake

Amazon EMR

WebMobile Application

LOG 데이터

Logstash

Crewing

Amazon Kinesis

실시간 분석 데이터 변환

원천 데이터 수집

실시간 예측

Amazon ML

Amazon EMR Amazon Elasticache Amazon DDB

Amazon

Elasticsearch

Amazon ML Amazon Athena

다양한 목적에 따른 분석 도구

“수많은 원천데이터를실시간으로 수집 변환 하고”

“실시간으로 분석 하고”

“실시간으로 예측하며”

“분석에 목적에 맞춰다양한 도구를 기반으로

분석 역량의 확장”

사례: 실 시간 로그 데이터 행동 분석

S3 Data Lake의 구축

Amazon Redshift Amazon Elastic MapReduce

Data Warehouse Semi-structured

Amazon GlacierAmazon Simple Storage Service

Data Storage Archive

Amazon DynamoDB

Amazon Machine Learning

Amazon Kinesis

NoSQL Predictive Models Other AppsStreaming

InternetS3 Endpoint

S3의 혁신

Data Lake 로써의 S3

감사합니다.

질문에 대한 답변 드립니다.

발표자료/녹화영상 제공합니다.http://bit.ly/awskr-webinar

더 나은 세미나를 위해여러분의 의견을 남겨 주세요!

http://bit.ly/awskr-webinar

AWS 기반 데이터 레이크(Datalake) 구축 및 분석 - 김민성 (AWS...

Technology

Transcript of AWS 기반 데이터 레이크(Datalake) 구축 및 분석 - 김민성 (AWS...