IronSource Atom - Redshift - Lessons Learned

Post on 16-Apr-2017

69 views 1 download

Transcript of IronSource Atom - Redshift - Lessons Learned

All content is the property and proprietary interest of CloudZone, The removal of any proprietary notices, including attribution information, is strictly prohibited.

All content is the property and proprietary interest of CloudZone, The removal of any proprietary notices, including attribution information, is strictly prohibited.

Big Data Month 2016 – Up Next…

15.11

22.11

22.11

28.11 30.11

14.11

All content is the property and proprietary interest of CloudZone, The removal of any proprietary notices, including attribution information, is strictly prohibited.

13:00 – 13:20 Intro to Amazon Redshift by IronSource13:20 – 15:00 LAB I – Using Amazon RedShift15:00 – 15:15 Break15:15 – 17:25 LAB II – Table Layout and Schema Design with Amazon Redshift17:25 – 17:30 Your next steps on AWS by CloudZone

Master AWS Redshift - Agenda

Shimon Tolts General Manager, Data Solutions

AtomData Pipeline Processing 200B events

with Node.js And Docker On AWS

About ironSource: Hypergrowth

People Reached Each Month

4200Apps Installed Every Minutewith the ironSource Platform

Registered & Analyzed Data EventsEvery Month

200B

800M

50B

0

100B

150B

200B

Jun 201

5

Jul 201

5

Aug 201

5

Sep 201

5

Oct 201

5

Nov 201

5

Dec 201

5

Jan 201

6

Feb 201

6

Mar 201

6

Apr 201

6

May 201

6

We needed a way to manage this data:

Our Business Challenge

ProcessCollect Store

Collection

● Multi region layer - Latency based

routing

● Low latency from client to Atom servers

● High Availability - AWS regions does

fail!

● Storing raw data + headers upon

receiving

Data Enrichment● Enrich data before storing in your Data

Lake and/or Warehouse○ IP to Country○ Currency conversion ○ Decrypt data○ User Agent parsing - OS, Browser, Device...

● Any custom logic you would like! - fully extendible

Data Targets● Near real-time data insertion - 1

minute!● Stream data to Google Storage and/or

AWS S3● Smart insertion of data into AWS

Redshift○ Set the amount of parallel copys○ Configure priority on tables

● BigQuery - Streaming data using batch files import (saves 20% cost)

Micro-Services Architecture● Everything is a service● Decoupling● Distributed systems

Separate lifecycle● Communication using RESTful /

Queue / Streams

Docker● Linux Container● Save provisioning time● Infrastructure as code● Dev-Test-Production - identical

container● Ship easily

Cloud infrastructure● Pay as you go - (grow)● SaaS services ● Auto-scaling-groups● DynamoDB● RDS *SQL● Redshift data warehouse

Continuous Integration● From commit to production● Jenkins commit hook● Git branching model● AWS dynamic slaves● Unit tests● Docker builds● Updating live environment

Diagram

● Xplenty - hadoop service - ~40min query● One big cluster - 96 xlarge nodes● No WLM configuration● CSV copy● No reserved nodes● different ETL process implemented by every department.

STARTING POINT

● using 8xlnodes if needed● Redshift cluster per department● “hot and cold” clusters - SSD: fast and furios, HDD: slow but cheap● WLM configuration● Reserved Nodes● JSON copy● One pipeline to rule them all - ironBeast - currently supporting over 50B events per month. inserting data to more than 10 Redshift clusters.

SOLUTION:

WORK LOAD MANAGEMENT

THINGS WE LEARNED ALONG THE WAY● https://github.com/awslabs/amazon-redshift-utils (AdminViews)

● users permissions does not apply on new tables created in a schema

● Vacuum Vacuum Vacuum

● Avoid parallel inserts (especially in 8xl nodes) - if you copy to multiple tables, it is better to

implement a COPY queue

● STL_LOAD_ERRORS - money on the floor

● Columnar datastore does not mean you can use as much columns as you want - it is better to

split to multiple tables.

● Encode your columns - ‘analyze compression’

● instances that query Redshift should use MTU 1500 - link

Redshift use cases

10 MillionFree Monthly Events

Thank you!

ironsrc.com/atom

shimont@ironsrc.com @shimontolts