(BDT317) Building A Data Lake On AWS

Ian Meyers, Principal Solution Architect, AWS

October 2015

BDT317

Building Your Data Lake on AWS

Benefits of the Enterprise Data Warehouse

• Self documenting schema

• Enforced data types

• Ubiquitous and common security model

• Simple tools to access, robust ecosystem

• Transactionality

STORAGECOMPUTE

But customers have additional requirements…

The Rise of “Big Data”

Enterprise

data warehouse

Amazon

STORAGECOMPUTE

COMPUTECOMPUTE

COMPUTE

COMPUTECOMPUTE

COMPUTE

Benefits of Separation of Compute & Storage

• All your data, without paying for unused cores

• Independent cost attribution per dataset

• Use the right tool for a job, at the right time

• Increased durability without operations

• Common model for data, without enforcing access

method

Comparison of a Data Lake to an Enterprise Data Warehouse

Complementary to EDW (not replacement) Data lake can be source for EDW

Schema on read (no predefined schemas) Schema on write (predefined schemas)

Structured/semi-structured/Unstructured data Structured data only

Fast ingestion of new data/content Time consuming to introduce new content

Data Science + Prediction/Advanced Analytics + BI use

casesBI use cases only (no prediction/advanced analytics)

Data at low level of detail/granularity Data at summary/aggregated level of detail

Loosely defined SLAs Tight SLAs (production schedules)

Flexibility in tools (open source/tools for advanced

analytics)Limited flexibility in tools (SQL only)

Enterprise DWEMR S3

EMR S3

The New Problem

Enterprise

data warehouse

Which system has my data?

How can I do machine

learning against the DW?

I built this in Hive, can we get

it into the Finance reports?

These sources are giving

different results…

But I implemented the

algorithm in Anaconda…

Dive Into The Data Lake

≠Enterprise

data warehouseEMR S3

Dive Into The Data Lake

Enterprise

data warehouseEMR S3

Load Cleansed Data

Export Computed Aggregates

Ingest any data

Data cleansing

Data catalogue

Trend analysis

Machine learning

Structured analysis

Common access tools

Efficient aggregation

Structured business rules

Components of a Data Lake

Data Storage

• High durability

• Stores raw data from input sources

• Support for any type of data

• Low cost

Streaming

• Streaming ingest of feed data

• Provides the ability to consume any

dataset as a stream

• Facilitates low latency analytics

Storage & Streams

Catalogue & Search

Entitlements

API & UI

Storage & Streams

Catalogue & Search

Entitlements

API & UICatalogue

• Metadata lake

• Used for summary statistics and data

Classification management

Search

• Simplified access model for data

discovery

Storage & Streams

Catalogue & Search

Entitlements

API & UIEntitlements system

• Encryption

• Authentication

• Authorisation

• Chargeback

• Quotas

• Data masking

• Regional restrictions

Storage & Streams

Catalogue & Search

Entitlements

API & UIAPI & User Interface

• Exposes the data lake to customers

• Programmatically query catalogue

• Expose search API

• Ensures that entitlements are respected

STORAGE

High durability

Stores raw data from input sources

Support for any type of data

Low cost

Storage & Streams

Catalogue & Search

Entitlements

API & UI

Amazon Simple Storage ServiceHighly scalable object storage for the Internet

1 byte to 5 TB in size

Designed for 99.999999999% durability, 99.99%

availability

Regional service, no single points of failure

Server side encryption

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Storage Lifecycle Integration

S3 – Standard S3 – Infrequent Access Amazon Glacier

Data Storage Format

• Not all data formats are created equally

• Unstructured vs. semi-structured vs. structured

• Store a copy of raw input

• Data standardisation as a workflow following ingest

• Use a format that supports your data, rather than force

your data into a format

• Consider how data will change over time

• Apply common compression

Consider Different Types of Data

Unstructured• Store native file format (logs, dump files, whatever)

• Compress with a streaming codec (LZO, Snappy)

Semi-structured - JSON, XML files, etc.• Consider evolution ability of the data schema (Avro)

• Store the schema for the data as a file attribute (metadata/tag)

Structured• Lots of data is CSV!

• Columnar storage (Orc, Parquet)

Where to Store Data

• Amazon S3 storage uses a flat keyspace

• Separate data storage by business unit, application, type, and time

• Natural data partitioning is very useful

• Paths should be self documenting and intuitive

• Changing prefix structure in future is hard/costly

Metadata

Services

CRUD API

Query API

Analytics API

Systems of

Reference

Return

URLsURLs as deeplinks to

applications, file

exchanges via S3

(RESTful file services)

or manifests for Big

Data Analytics / HPC.

Integration Layer

System to system via Amazon SNS/Amazon SQS

System to user via mobile push

Amazon Simple Workflow for high level system integration / orchestration

http://en.wikipedia.org/wiki/Resource-oriented_architecture

s3://${system}/${application}/${YYY-MM-DD}/${resource}/${resourceID}#appliedSecurity/${entitlementGroupApplied}

Resource Oriented Architecture

STREAMING

Streaming ingest of feed data

Provides the ability to consume any

dataset as a stream

Facilitates low latency analytics

Storage & Streams

Catalogue & Search

Entitlements

API & UI

Why Do Streams Matter?

• Latency between event & action

• Most BI systems target event to action latency of 1 hour

• Streaming analytics would expect event to action latency

< 2 seconds

• Stream orientation simplifies architecture, but can

increase operational complexity

• Increase in complexity needs to be justified by business

value of reduced latency

Amazon KinesisManaged service for real time big data processing

Create streams to produce & consume data

Elastically add and remove shards for performance

Use Amazon Kinesis Worker Library to process data

Integration with S3, Amazon Redshift, and DynamoDB

Compute Storage

Database

App Services

Networking

Analytics

Data Sources

Sources

Data Sources

[Archive/Ingestion]

[Sliding Window Analysis]

[Data Loading]

[Event Processing Systems]

DynamoDB

Amazon Redshift

Data Sources

Availability

Shard 1

Shard 2

Shard N

Availability

ZoneAvailability

Amazon Kinesis Architecture

Streaming Storage Integration

Object store

Amazon S3

Streaming store

Amazon

Kinesis

Analytics applicationsRead & write file dataRead & write to streams

(BDT317) Building A Data Lake On AWS

Technology

Transcript of (BDT317) Building A Data Lake On AWS

Building Static Websites on AWS

Building Big Data Applications on AWS

Building Web Scale Applications with AWS

AWS Summit Berlin 2013 - Building web scale applications with AWS

AWS January 2016 Webinar Series - Best Practices for Building IoT Backends with AWS IoT & AWS Lambda

Building an AWS Perimeter

Building a data lake on Amazon Web Services (AWS)

Ebook Building Data Lake AWS - Morris & Opazo · Amazon Kinesis, which enables you to ingest data in real-time, AWS Import/ Export, a service ... processing. With AWS, you don’t

Building Bulletproof Infrastructure on AWS

Momentum Building in the AWS Band

Building a HIPAA Compliant Data Lake on AWS with...architected a Data Lake on AWS, adhering to best practices recommendations, such as creating selfdocumenting and intuitive data paths

Building Serverless APIs on AWS

Aws building fault_tolerant_applications

Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects

AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry

What's New with AWS Lake Formation

Building Scalable Databases on AWS - AWS Summit 2012 - NYC

Ebook Construyendo Data Lake AWS - Morris & Opazo · Tu Partner de Amazon Web Services (AWS) EBOOK: Construyendo un Data Lake en AWS Somos una empresa de servicios especializada en

Building Web-Scale Applications with AWS - MV Dironamvdirona.com/jrh/TalksAndPapers/Building Web-Scale Applications... · Building Web-Scale Applications with AWS ... Amazon Elastic

Building an AWS Hybrid Cloud