AWS Summit Barcelona - Data Analysis on AWS

Post on 11-May-2015

770 views 0 download

Tags:

Transcript of AWS Summit Barcelona - Data Analysis on AWS

AWS Summit 2013 Barcelona Oct 24 – Barcelona, Spain

Carlos Conde

Sr. Mgr. Solutions Architecture

DATA ANALYSIS ON AWS

GENERATE STORE ANALYZE SHARE

THE COST OF DATA

GENERATION IS FALLING

THE MORE DATA YOU COLLECT

THE MORE VALUE YOU CAN

DERIVE FROM IT

GENERATE STORE ANALYZE SHARE

Lower cost,

higher throughput

GENERATE STORE ANALYZE SHARE

Lower cost,

higher throughput

Highly

constrained

Generated data

Available for analysis

DATA VOLUME

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

GENERATE STORE ANALYZE SHARE

GENERATE STORE ANALYZE SHARE

ACCELERATE

+ ELASTIC AND HIGHLY SCALABLE

+ NO UPFRONT CAPITAL EXPENSE

+ ONLY PAY FOR WHAT YOU USE

+ AVAILABLE ON-DEMAND

= REMOVE CONSTRAINTS

GENERATE STORE ANALYZE SHARE

GENERATE STORE ANALYZE SHARE

AWS Import / Export

AWS Direct Connect

Generated and stored in AWS

Inbound data transfer is free

Multipart upload to S3

Physical media

AWS Direct Connect

Regional replication of AMIs and snapshots

GENERATE STORE ANALYZE SHARE

Amazon S3,

Amazon Glacier,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

AWS Storage Gateway,

Data on Amazon EC2

AMAZON S3 SIMPLE STORAGE SERVICE

AMAZON

DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED

NoSQL DATABASE SERVICE

DURABLE &

AVAILABLE CONSISTENT, DISK-ONLY

WRITES (SSD)

LOW LATENCY AVERAGE READS < 5MS,

WRITES < 10MS

NO ADMINISTRATION

500,000 WRITES PER SECOND

DURING SUPER BOWL

AMAZON

REDSHIFT FULLY MANAGED, PETA-BYTE SCALE

DATAWAREHOUSE ON AWS

DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was…

AMAZON REDSHIFT

A Whole Lot Simpler

A Lot Cheaper

A Lot Faster

AMAZON REDSHIFT

RUNS ON OPTIMIZED HARDWARE

HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate

HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage

30 MINUTES

DOWN TO

12 SECONDS

Extra Large Node

(HS1.XL)

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB – 64 TB)

AMAZON REDSHIFT LETS YOU

START SMALL AND GROW BIG

Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)

CREATE A DATAWAREHOUSE IN

MINUTES

JDBC/ODBC

Price Per Hour for

HS1.XL Single

Node

Effective Hourly

Price Per TB

Effective Annual

Price per TB

On-Demand $ 0.850 $ 0.425 $ 3,723

1 Year

Reservation $ 0.500 $ 0.250 $ 2,190

3 Year

Reservation $ 0.228 $ 0.114 $ 999

DATA WAREHOUSING DONE THE AWS WAY

No upfront costs, pay as you go

Really fast performance at a really low price

Open and flexible with support for popular tools

Easy to provision and scale up massively

USAGE SCENARIOS

Redshift Reporting

and BI EMR

S3

DynamoDB Redshift

OLTP

Web Apps Reporting

and BI

RDBMS Redshift

OLTP

ERP Reporting

& BI

+

RDBMS Redshift

OLTP

ERP Reporting

& BI

Social Point Analytics in AWS Marc Canaleta (CTO)

@mcanaleta AWS Summit Barcelona 2013

Social Games developer para Mobile y Facebook

Fundada en 2008, oficinas en Barcelona (22@), 170 personas.

Top #20 mobile grossing games worldwide

Top #3 facebook developer

Juegos Sociales: interacción entre amigos, viralidad

Modelo freemium: Jugar es gratis, algunos items de pago

Sector Midcore

Leader in Breeding & Collecting strategy games

Top 20 Grossing en iOS App Store worldwide

Lanzado

recientemente en Android, featured en Google Play

6M DAU en Facebook

No mantener ni planificar hardware: aumenta la velocidad del negocio

Flexible: Pago por uso

Facilita la escalabilidad:

Auto Scaling

Facilita la alta disponibilidad: múltiples availability zones

Managed components: Load Balancers, Bases de datos, …

Analytics Driven. Necesarias para casi todos nuestros equipos:

Ingenieros: analíticas realtime, monitorización, detección de problemas

Producto: tomar decisiones, A/B testing, game balancing, …

Marketing: optimización de campañas

Finanzas: seguimiento del negocio

ANALYTICS QUEUES

BACKEND SERVERS BACKEND SERVERS

FLASH CLIENT IOS CLIENT ANDROID

CLIENT

ANALYTICS QUEUES ANALYTICS QUEUES

LOGFILES STORAGE

ANALYTICS DATABASE

BACKEND SERVERS Symfony 2

Redis

AWS S3

AWS Redshift

REDIS

Backend escribe eventos en listas de redis

Porque Redis? Coste y rendimiento: 10K eventos/segundo/servidor

Problema: es una base de datos en memoria, hay que vaciar las colas

constantemente Escalado y HA: N servidores distribuidos aleatoriamente

BACKEND

REDIS REDIS

Procesos python consumen las colas constantemente y

Calculan métricas Real Time

Almacenan logfiles de

eventos para subirlos a S3

Encolan en SQS la URL del objeto S3

Consumer

Redis Queue

LPOP event

Event Log File

Amazon S3

write event

put object

CARGA DE DATOS

GENERACIÓN DE EVENTOS

Redis Real Time

INCR counter

Amazon SQS

enqueue S3 object URL

Python es muy adecuado para desarrollar workers y tratar datos

Redis: estructuras como contadores,

sets, sorted sets, para métricas Real Time

S3: espacio virtualmente infinito, escalable, alta disponibilidad

SQS fiabilidad y disponibilidad a mayor precio que Redis

Consumer

LPOP event

Redis Real Time

INCR counter

Event Log File

Amazon S3

write event

put object

Amazon SQS

enqueue S3 object URL

CARGA DE DATOS

Redis Queue

GENERACIÓN DE EVENTOS

Amazon S3 Amazon SQS

Importer

TSV

RedShift

Los importers leen URLs de SQS

Se descargan logfiles de S3

Convierten a TSV

Importan masivamente a Redshift (N logfiles a la vez)

PROCESADO DE EVENTOS

Nos permite ser flexibles -> cambios de esquema sin downtime

Muy escalable (con downtime de escrituras)

Poco riesgo de implantación Sistema offline Backups

Mantenimiento mínimo: vacuums, espacio

Buen soporte de SQL, a diferencia de otras columnar databases

Transformaciones y cálculos diarios implementados en SQL

Ejemplo: UPDATE USER SET total_revenues = (SELECT SUM(amount) FROM transaction t

WHERE t.user_id = user.user_id);

Por qué no hadoop?

Mucho más complejo y lento; de momento las operaciones SQL cumplen todos nuestros requisitos

¿Te gustaría trabajar en el sector de los videojuegos?

Buscamos talento. El talento atrae al talento.

www.socialpoint.es/jobs

¡GRACIAS!

GENERATE STORE ANALYZE SHARE

Amazon EC2

Amazon Elastic

MapReduce

AMAZON ELASTIC

MAPREDUCE HADOOP AS A SERVICE

• A FRAMEWORK

• SPLITS DATA INTO PIECES

• LETS PROCESSING OCCUR

• GATHERS THE RESULTS

Corporate Data

Center

Elastic Data

Center

Corporate Data

Center

Elastic Data

Center

Application data

and logs for

analysis pushed

to S3

Corporate Data

Center

Elastic Data

Center

Amazon Elastic

Map Reduce

name node to

control analysis

N

Corporate Data

Center

Elastic Data

Center

Hadoop cluster

started by Elastic

Map Reduce

N

Corporate Data

Center

Elastic Data

Center

N

Adding many

hundreds or

thousands of

nodes

Corporate Data

Center

Elastic Data

Center

N

Disposed of when

job completes

Corporate Data

Center

Elastic Data

Center

Results of

analysis pulled

back into your

systems

GENERATE STORE ANALYZE SHARE

Amazon S3,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

Data on Amazon EC2

PUBLIC DATA SETS http://aws.amazon.com/publicdatasets

GENERATE STORE ANALYZE SHARE

GENERATE STORE ANALYZE SHARE

FROM DATA TO

ACTIONABLE

INFORMATION