Running Spark and MapReduce together in Production

21
RUNNING SPARK AND MAPREDUCE TOGETHER IN PRODUCTION David Chaiken, CTO of Altiscale [email protected] #HadoopSherpa

Transcript of Running Spark and MapReduce together in Production

Page 1: Running Spark and MapReduce together in Production

RUNNING SPARK AND MAPREDUCE TOGETHER IN PRODUCTION

David Chaiken, CTO of Altiscale

[email protected]

#HadoopSherpa

Page 2: Running Spark and MapReduce together in Production

2

AGENDA

• Why run MapReduce and Spark together in production?

• What about H2O, Impala, and other memory-intensive frameworks?

• Batch + Interactive = Challenges

• Specific issues and solutions

• Ongoing Challenges: Keeping Things Running

• Perspective: Hadoop as a Service versus DIY*

* do it yourself

Page 3: Running Spark and MapReduce together in Production

3

ALTISCALE PERSPECTIVE:INFRASTRUCTURE NERDS

• Experienced Technical Yahoos• Raymie Stata, CEO. Former Yahoo! CTO,

advocate of Apache Software Foundation

• David Chaiken, CTO.Former Yahoo! Chief Architect

• Charles Wimmer, Head of Operations.Former Yahoo! SRE

• Hadoop as a Service, built and managed by Big Data, SaaS, and enterprise software veterans• Yahoo!, Google, LinkedIn, VMWare, Oracle, ...

Page 4: Running Spark and MapReduce together in Production

4

SOLVED: COST-EFFECTIVE DATA SCIENCE AT SCALE

But how do you make it easier for data scientists?

Two bad options:1. Use Hadoop directly

using unfamiliar and unproductive command-line tools and APIs

2. Use Hadoop indirectly via a back-and-forth with data engineers who translate needs into Hadoop programs

 

Page 5: Running Spark and MapReduce together in Production

5

Data Scientist’s Workflow

Modeling

Exploration

Production

Cleansing

Flattening

Serving

Hive

SourceData

CSV

COMMON HADOOP WORKFLOW

Model

Flatten

Explore

Exploration

Page 6: Running Spark and MapReduce together in Production

6

ENTER SPARK. . . AND IMPALA AND H2O

• Interactive, iterative analysis

• Quick turns

• Memory heavy

Page 7: Running Spark and MapReduce together in Production

7

DOES THIS MEAN THAT MAPREDUCEDOESN’T MATTER ANYMORE?

HA!(Don’t believe the hype.)

Page 8: Running Spark and MapReduce together in Production

Exploration

Hive

SourceData

CSV

IT MATTERS SO MUCH THAT YOU WANT BOTH ON ONE CLUSTER.

Flattening

Exploration

Modeling

Serving

Production

Cleansing

BIG DATA MODELING WORKFLOW

8

Page 9: Running Spark and MapReduce together in Production

THE CHALLENGE. . .

9

“Why is my Spark job not starting?”

“Why is my Spark job consuming so many resources?”

Resourceconflicts!

9

Page 10: Running Spark and MapReduce together in Production

SPECIFIC ISSUES AND SOLUTIONS

10

Page 11: Running Spark and MapReduce together in Production

11

INTERACTIVE:INCREASE CONTAINER SIZEChallenge: Memory intensive systems take as much local DRAM as available.

Solutions: • Spark and H20: Increase YARN container memory

size• Impala: Box using operating system containers

Page 12: Running Spark and MapReduce together in Production

• Caution: Larger YARN container settings for interactive jobs may not be right for batch systems like Hive

• Container size: needs to combine vcores and memory:yarn.scheduler.maximum-allocation-vcoresyarn.nodemanager.resource.cpu-vcores ...

HIVE+INTERACTIVE:WATCH OUT FOR LARGE CONTAINER SIZE

12

Page 13: Running Spark and MapReduce together in Production

HIVE + INTERACTIVE:WATCH OUT FOR FRAGMENTATION

• Caution: Attempting to schedule interactive systems and batch systems like Hive may result in fragmentation

• Interactive systems may require all-or-nothing scheduling

• Batch jobs with little tasks may starve interactive jobs

13

Page 14: Running Spark and MapReduce together in Production

HIVE + INTERACTIVE:WATCH OUT FOR FRAGMENTATION

Solutions:

• Reserve interactive nodes before starting batch jobs

• Reduce interactive container size (if the algorithm permits)

• Node labels (YARN-2492) and gang scheduling (YARN-624)

14

Page 15: Running Spark and MapReduce together in Production

ONGOING CHALLENGESKeeping things running. . .

15

Page 16: Running Spark and MapReduce together in Production

16

CHALLENGE: SECURITY

• Challenge: User Management not uniform• MapReduce: collaboration requires getting groups right• Hive: proxyuser settings have to be right for hiveserver2• Spark application owner versus connected users• Impala: “I just gotta be me!”• As usual, watch out for cluster administrator accounts!

• Challenge: Port and Protocol Management• Best security practice: open specific ports for specific

protocols• Spark: “I just gotta be free!”• Spark improved between version 1.0.2 -> 1.1.0,

but still confusing

Page 17: Running Spark and MapReduce together in Production

17

CHALLENGE: WEB SERVING

• How to provide interactive services to business user?

• Concerns: security, variable resources, latency, availability

• Keep serving infrastructure separate from Hadoop

Page 18: Running Spark and MapReduce together in Production

18

CHALLENGE:RESOURCE ATTRIBUTION (BILLING)

• Accounting for long-running Spark, H2O, Impala clusters?

• Is reserving resources the same as using the resources?

• Trade-off: availability/response time vs. oversubscription.

Page 19: Running Spark and MapReduce together in Production

19

CHALLENGE:STABILITY VERSUS AGILITY

• Never-ending story: latest hotness versus SLAs*

• New system stability curve. Example…• SPARK-1476: 2GB limit in Spark for blocks

• Interoperation issues. Example…• IMPALA-1416: Queries fail with metastore exception

after upgrade and compute stat• HIVE-8627: Compute stats on a table from Impala

caused the table to be corrupted

• Many issues come down to YARN container size and JVM heap size configuration

* service level agreements

Page 20: Running Spark and MapReduce together in Production

20

PERSPECTIVE: HADOOP AS A SERVICE VERSUS DIY (DO IT YOURSELF)

• Data Scientists and Data Engineers:use the right tools for the right job

• Data Scientists and Data Engineers:don’t spend your time on cluster maintenance

• Hadoop As A Service: have your cake and eat it, too• Benefit from the experiences of other customers• One size does not fit all, but one configuration schema

does• Leave the maintenance to us infrastructure nerds

Page 21: Running Spark and MapReduce together in Production

21

QUESTIONS? COMMENTS?