[150824]symposium v4

Hadoop MapReduce

How to Survive Out-of-Memory Errors

Member: Yoonseung Choi Soyeong Park

Faculty Mentor: Prof. Harry XuStudent Mentor: Khanh Nguyen

The International Summer Undergraduate Research Fellowship

Outline

• Introduction• What is MapReduce?• How does MapReduce work?• Limitations of MapReduce• What are our goals?• Operation test• Conclusions

“There was 5 exabytes of information created between the dawn of civilization through 2003,

But that much information is now created every 2 days, and the pace is increasing...”

- Eric Schmidt, The Former Google CEO

Data scientists want to analyze these large data sets

But single machines have limitations in processing these data sets

How can we handle that?

Furthermore, data sets are now growing very rapidly

We don’t wantto understand

parallelization,fault tolerance,

data distribution,and load balancing!

Distributed processing

Therefore, we purposeThe ‘MapReduce’

parallelizationfault tolerance

data distributionload balancing

MapReduce is a programming model for processing large data sets

Many real world tasks are expressible in this model

The model is easy to use, even for programmers without experience with parallel and distributed systems

[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”.

* https://en.wikipedia.org/wiki/Apache_Hadoop

MapReduce Layer

HDFS Layer

What is MapReduce?Mapper takes an input

and produces a set of intermediate key/value pairs

Reducer merges together these intermediate values associated with the same intermediate key

[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”. p.12

How does MapReduce work?The cat sees the dog, and the dog sees the cat.

The cat sees the dog

And the dog sees the cat

cat, 1dog, 1sees, 1the, 2

and, 1

- Wordcount program - A sentence is split into two map tasks

Map Phase

ReducePhase

Limitations of MapReduce

There are many reasons for poor performanceAnd even experts sometimes can’t figure them out

What are our goals?• Research Out-of-Memory Error(OOM) cases • Document OOM cases • Implement and simulate StackOverflow OOM cases• Develop solutions for such OOM cases

… all done!!

Two Categories

1. Inappropriate Configuration Configuration which causes poor performance

2. Large Intermediate Results Temporary data structure grows too large

[3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce jobs, Chinese Academy of Sciences.

Operation test environments1. Standalone & Pseudo-distributed mode

- ‘14 MacBook Pro, 2.8 GHz Intel Core i5 8GB 1600 MHz DDR3, 500GB HDD

- ‘12 MacBook Air 1.4, GHz Intel Core i5 4GB 1600 MHz DDR3, 256GB HDD

2. Fully-distributed mode- Raspberry Pi 2 Model B (3 nodes)

A quad-core ARM Cortex-A7 CPU (1Ghz Overclock)1GB 500MHz SDRAM, 64GB HDD, 100Mbps

Ethernet

Split size variation [Single node]

* ‘14 MacBook Pro 2.8, GHz Intel Core i5, 8GB 1600 MHz DDR3, 500GB SSD

Input: StackOverflow’s users profiles (1GB)

16 32 64 128 2560

200173.3

47.326.7 24.3

86.364.7

78.759

16 32 64 128 2560

200169.7

64.7 48.737.7

77.7 55 39

[ Distributed grep (no Reducer) ][ Standard deviation of users’ age ]

Standalone Pseudo-distributed(2Mapper 2Reducer)

Pseudo-distributed(4Mapper 4Reducer)

Split size variation [Single node]

* ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD

Input: StackOverflow’s Comments (8.5GB)

16 32 64 128 256250

1450 1577.7

425411

1586.3

634454.3

540.3397.7

Standard deviation of comment’s text length

Count Min and Max value

Standalone Pseudo-distributed(2Mapper 2Reducer)

Pseudo-distributed(4Mapper 4Reducer)

16 32 64 128 256250

398 389.3

610.7 612418.7

609 488

362.7 254.3

[ ](MB)(MB)

(sec) (sec)

Split size variation [Fully-distributed] Input: StackOverflow’s users profiles (1GB)

32 64 128 2560

375396

442548

313 296350

* Raspberry Pi 2 Model B (3 nodes) A quad-core ARM Cortex-A7 CPU (1Ghz Overclock) 1GB 500MHz SDRAM, 64GB HDD, 100Mbps Ethernet

16 32 64 128 2560

462.7 428.7476.7

561.7 604

333.3 303345 339.7

[ Distributed grep (no Reducer) ][ average users’ age based on countries ]

6 Mapper 12 Mapper

(MB)(MB)

(sec) (sec)

io.sort.mb variation [Single node]

* ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD

Input: StackOverflow’s Comments (8.5GB)Test program: Standard deviation of comment’s text length

20 40 80 160 320600

900872

827 814 798 803.7

661 638.7 632 629.7 629.7633.7 641 635.7 629.3 629

Stand alone Pseudo distributed; 2M2R Pseudo distributed; 4M2R

I am working well with small datasets like 200-500MB. But for datasets above 1GB, I am getting an error like this:

* http://stackoverflow.com/questions/23042829/getting-java-heap-space-error-while-running-a-mapreduce-code-for-large-dataset

2. Large Intermediate Results

Problem Investigation

SplitedInput files

Task 1Task 2Task 3Task 4Task 5

[K, V]

The Mapper

Intermediatekey/value pairs

almost1 GB

Problem Investigation

[K, V]

The Reducer

Intermediatekey/value pairs

almost1 GB

I just have 1GB heap

space!

almost1 GB

Java heap can’t contain intermediate data structure

Configuration was:1.3GB Input, 256MB Split size, 1024MB Java Heap Space

Error: Java heap space

Summary of Solutions• Modify the configuration parameters

• Alter the program’s algorithm: Some alternative solution was suggested from the site -> Succeed with original version failed Configuration ( 256MB Split size & 1024MB Java heap size )

Java Heap size 1024MB 2048MB

Split size128 MB Successful Successful256 MB Failed Successful

Conclusions• How to solve the poor performance

1. Adjust ‘split size’ & ‘sort space’ - the more size, the less time to spend2. Adjust the number of Mapper - Utilize all CPU Cores - Larger number of mapper not always right

• If intermediate data structure is too large, - Modify the configuration parameter or - Alter the program’s algorithm

References[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”. [Online]. Available: http://static.googleusercontent.com/media/research.google.com/ko//archive/mapreduce-osdi04.pdf

[2] 한기용 , Do it! 직접 해보는 하둡 프로그래밍 . Seoul: EasysPublishing, 2013.

[3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce jobs, Chinese Academy of Sciences.

[4] Donald Miner and Adam Shook, MapReduce Design Patterns. O’Reilly Media. Inc, 2012.

Thank YouAnd if you want to know more technical information,please enter our GitHub repository. Our project is Open Source.https://github.com/I-SURF-Hadoop/MapReduce

appendixHow does MapReduce really work?

How does MapReduce work?

[ Map Phase ]cat, 1

dog, 1

sees, 1

the, 2

Combining & Sorting

The cat sees the dog, and the dog sees the cat.

the, 1

cat, 1

sees, 1

the, 1

dog, 1

MapReduce library first splits the input into M pieces.

A map worker processes these pieces using a user-defined Map function. Intermediate key/value pairs will be produced by this function.

The cat sees the dog

How does MapReduce work?The cat sees the dog, and the dog sees the cat.

sees, 2the, 4

cat, 2dog, 2

and, 1[ Reduce Phase ] When a reduce worker has read all intermediate data, it sorts them by the intermediate keys.

The reduce worker iterates the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the values to the user’s Reduce function.

and, 1

Shuffling

Two independent reducer

[150824]symposium v4

Engineering

Transcript of [150824]symposium v4

Benetton v4

Symposium on Innovation & Technologyhketa.org.hk/image/data/Symposium/2017 Summary v4.pdfSymposium on Innovation & Technology - Oct 13, 2017 in Theatre 1 of Hong Kong Convention &

v4-19-Release & v4-20-Release

Dolgozat v4

Ad layout V4 - redhillaerodrome.co.uk layout V4.pdf · AERODROME LAYOUT. Title: Ad layout V4 Created Date: 20181130172511Z ...

The USFT OBD-V4 USFleetTracking OBD-V4 Specs

LEED v4 and Beyond - ChamberMastercloud.chambermaster.com/userfiles/UserFiles/chambers/9079/File/... · LEED v4 LEED v4 . LEED v4: Scope of ... Scope of Market Sectors . 12 Building

14. Alpen Adria Jugendturnier Florett weiblich - OEFV...COJOCARI Maria MDA MOLDAWIEN 1 D1 V4 D V3 D2 V4 V4 4 20 14 0.500 6 4. FABIANEK Marie AUT STLFC 2 V4 V4 D2 V4 D2 V4 V4 5 24 8

jknperak.moh.gov.myjknperak.moh.gov.my/v4/images/stories/berita/jkn-med/symposium mental... · pendidikan psikiatri Taiping . WELCOME MESSAGE It is with great pleasure that on behalf

ECHOLINK TORNADO V4 LITE ECHOLINK TORNADO V4 SUPER

150824- Hydraulic Specification Rev Acompleteconstructionsaust.com.au/.../Hydraulic-Specification-Rev-A.pdf · hydraulic services specification - for tender montassori els - block

APRILIA TUONO V4 1100 - SoyMotero.net Aprilia... · tuono v4 1100 rr tuono v4 1100 factory. aprilia tuono v4 1100 . la aprilia de tus sueÑos. ahora 15.199 ...

150824 schwartz russiacontribchina web

PROS 2013 SMS Audit Results v4 - acsf.aeroacsf.aero/assets/2014-Symposium-Presentations/ARGUS-2013-Safet… · ARGUS International, Inc. (ARGUS) is the industry leader in prov iding

150824 La Verdad- El Puesto de Control Aduanero Funciona Con 26 Policías Menos

150824 het lukt ons leeuwarden

150824 catalogue B-Line 2015 - cooperfrance.com · Catalogue Général B-Line SERIES. Venez découvrir Eaton aujourd’hui. Powering business worldwide Eaton est une société de

Symposium Flyer v4 - csanz.edu.au€¦ · Title: Symposium Flyer v4 Created Date: 7/15/2014 2:57:10 PM

UNIVERSITY OF DELHIexam.du.ac.in/pdf/08082017/08082017_LLB_LC-I_IVth_Term_RESULT_… · university of delhi ... 150824 abhishek kaushik 310 ... 150960 ashish kumar shah 294 150961

LEED v4 - controlservices.comcontrolservices.com/documents/LEED v4 User Guide_Final_0.pdf · LEED v4 User Guide Access the FREE Introduction to LEED v4 Webinar Series Purchase the