Autoscaling Spark on AWS EC2 - 11th Spark London meetup

download Autoscaling Spark on AWS EC2 - 11th Spark London meetup

of 46

  • date post

    07-Jan-2017
  • Category

    Technology

  • view

    2.349
  • download

    1

Embed Size (px)

Transcript of Autoscaling Spark on AWS EC2 - 11th Spark London meetup

Autoscaling Spark for Fun and ProfitRafal Kwasny11th Spark London Meetup2015-11-26

How many of you use spark in production?

Who am IDevOPSBuild a few platforms in my lifemostly adtech, in-game analytics for Sony PlaystationCurrently advising Investment BanksCTO Entropy Investments

How do you run spark?Who runs on AWS?

Who uses EMR?

So how to use autoscaling on AWS?

Overviewtypical architecture for AWS

How autoscaling works

Scripts to make your life easier

Typical architecture for AWS

Typical architecture for AWS

Generate some data

Typical architecture for AWS

Store it in S3

Single source of datavery good durability & availabilityOffloading storage complexity to AWS

Typical architecture for AWS

or store it in a message queue

Typical architecture for AWS

Use your favourite tool for ETL

Typical architecture for AWS

Ship it back to S3

ParquetColumnar storeStandard supported by Spark, Hive, Presto, ImpalaOptimised for:column based aggregationsNot optimised for`select *` type queriesINSERT/UPDATEs

Typical architecture for AWS

Or send it somewhere

Typical architecture for AWS

EMRspark-ec2build cluster from scratch

When on EC2 you have 2 main options:spark-ec2 scriptsEMR (Elastic Map-Reduce)

Map-reduce is about quickly writing very inefficient code and then running it at massive scale(C) Someone

ProblemEC2 is a pay-for-what-you-use model

You just have to decide how much resources you want to use before starting a cluster

ProblemMost common problems while running on EC2

Scaling upMy team needs a new cluster, how big it should be?Scaling downDid I shut down the DEV cluster before leaving the office on Friday evening?

How to automate scaling?

Types of scalingVertical scaling - Lets get a bigger boxChange instance typeChange EBS parameters

Horizontal scaling - Just add more nodes

AutoscalingAutomatic resizing based on demandDefine minimum/maximum instance countDefine when scaling should occurUse metricsRun your jobs and dont worry about infrastructure

Architecture with autoscaling

no HDFSno state on workers

Using RAM/local SSDs for cachingOnly saving output into S3

Fault recovery

Autoscaling components AMI - machine image with installed sparkLaunch configuration - defines:AMIinstance typeinstance storage public IPsecurity groups

Autoscaling components Autoscaling grouplaunch configurationavailability zonesVPC detailsmin/max serverswhen to scalemetrics/health checks

Putting it all together

Then you can run your job

Complicated?AWS provides a lot of services

spark-cloudBetter scripts to start spark clusters on EC2

Alpha version

https://github.com/entropyltd/spark-cloud

Whats inside spark-cloudBuilding AMIs through packer

Packer is a tool for creating machine and container images for multiple platforms from a single source configuration.

Supports AWS, DigitalOcean, Docker, OpenStack, Parallels, QEMU, VirtualBox, VMware

Current functionalityStart cluster

Shutdown cluster

But more to come :)

Spot instancesSpot instances

Spot instancesOn-Demand: $1.400Spot: $0.1589% cheaper

SummarySpark and EC2 is a very common combinationBecause it makes your life easierAnd cheaperspark-cloud script will help youYou can just worry about writing good Spark code!

Thank Yourafal@entropy.be

Amazon S3 TipsDont use s3n://Use s3a:// with hadoop 2.6Parallel rename, especially important for committing outputSupports IAM authenticationno xyz_$folder$" filesinput seekmultipart upload ( no 5GB limit )Error recovery and retry

More info https://issues.apache.org/jira/browse/HADOOP-10400

Why not EMR?Why pay for EMR? It costs more than a spot instancevendor lock-in and proprietary librariesnetlib-java