Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites...

59
Introduc)on to Map-Reduce Vincent Leroy 1

Transcript of Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites...

Page 1: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Introduc)ontoMap-Reduce

VincentLeroy

1

Page 2: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Sources

•  ApacheHadoop•  Yahoo!DeveloperNetwork•  Hortonworks•  Cloudera•  Prac)calProblemSolvingwithHadoopandPig

2

Page 3: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

«BigData»

•  Google,2008– 20PB/day– 180GB/job(variable)

•  Webindex– 50Bpages– 15PB

•  LargeHadronCollider(LHC)@CERN:produces15PB/year

3

Page 4: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Capacityofa(large)server

•  RAM:256GB•  Harddrivecapacity:24TB•  Harddrivethroughput:100MB/s

4

Page 5: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Solu)on:Parallelism

•  1server– 8disks– ReadtheWeb:230days

•  HadoopCluster@Yahoo– 4000servers– 8disks/server– ReadtheWebinparallel:1h20

5

Page 6: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

DatacenterGoogle

6

Page 7: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Pi_allsinparallelism

•  Synchroniza)on– Mutex,semaphores…

•  Difficul)es– Deadlocks– Op)miza)on– Costly(experts)– Notreusable

7

Page 8: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Programmingmodels

•  Sharedmemory(mul)cores)

•  Messagepassing(MPI)

8

Page 9: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Faulttolerance

•  Aserverfailseveryfewmonths•  1000servers…– MTBF(mean)mebetweenfailures)<1day

•  Abigjobmaytakeseveraldays–  Therewillbefailures,thisisnormal–  Computa)onsshouldfinishwithinareasonable)meàYoucannotstartoverincaseoffailures

•  Checkpoin)ng,replica)on– Hardtoimplementcorrectly

9

Page 10: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

BigDataPla_orm

•  Leteveryonewriteprogramsformassivedatasets– Encapsulateparallelism•  Programmingmodel•  Deployment

– Encapsulatefaulttolerance•  Detectandhandlefailures

à Codeonce(experts),benefittoall

10

Page 11: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

MAP-REDUCEMODEL

11

Page 12: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

WhatareMapandReduce?

•  2simplefunc)onsinspiredfromfunc)onalprogramming– Transforma6on:mapmap(f,[x1,…,xn])=[f(x1),…,f(xn)]Ex:map(*2,[1,2,3])=[(*21),(*22),(*23)] =[2,4,6]

– Aggrega6on:reducereduce(f,[x1,…,xn])=f(x1,f(x2,f(x3,…f(xn-1,xn)))))Ex:reduce(+,[2,4,6])=(+2(+46)) =12

12

Page 13: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

WhatareMapandReduce?

•  Generic– Takeafunc)onasaparameter

•  Canbeinstan)atedandcombinedtosolvemanydifferentproblems– map(toUpperCase,[“hello”,“data”])=[“HELLO”,“DATA”]

–  reduce(max,[87,12,91])=91

•  Thedeveloperprovidesthefunc)onapplied

13

Page 14: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Dataaskey/valuepairs

•  MapReducedoesnotmanipulateatomicpiecesofdata– Everythingisa(Key,Value)pair– Keyandvaluecanbeofanytype•  Ex:(Hello,17)

–  Key=Hello,typetext–  Value=17typeint

•  Whenini)aldataisnotkey/value,interpretitaskey/value–  Inputtextfilebecomes[(#line,line_content)…]

14

Page 15: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Map-ReduceonKey-Valuepairs

•  MapandReduceadjustedtoKey-Valuepairs–  Inmap,fisappliedindependentlyoneverykey/valuepairf(key,value)àlist(key,value)

–  Inreduce,fisappliedtoallvaluesassociatedwiththesamekeyf(key,list(value))àlist(key,value)

– Thetypesofkeysandvaluestakenasinputdoesnothavetobethesameastheoutput

15

Page 16: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Example:Coun)ngfrequencyofwords

•  Input:Afileof2lines–  1,"abcaabc"–  2,"abbccaccb"

•  Output–  a,3–  b,3–  c,2–  aa,1–  bb,1–  cc,2

16

Page 17: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Wordfrequency:Mapper

•  Mapprocessesapor)on(line)oftext–  Splitwords–  Foreachword,countoneoccurrence–  Keynotusedinthisexample(linenumber)

•  map(IntlineNumber,Textline,Outputoutput){ foreachwordinline.split(space){ output.write(word,1) }}

17

Page 18: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Wordfrequency:Reducer•  Foreachkey,reduceprocessesallthecorrespondingvalues– Addnumberofoccurrences

•  reduce(Stringword,List<Int>occurrences,Outputoutput){ intcount=0 foreachintoccinoccurrences{ count+=occ } output.write(word,count)}

18

Page 19: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Execu)onflow1,"abcaabc" 2,"abbccaccb"a,1b,1c,1aa,1b,1c,1

a,1bb,1cc,1a,1cc,1b,1

Map

Reduce a,[1,1,1]

b,[1,1,1]

c,[1,1]

aa,[1]

bb,[1]

cc,[1,1]

a,3

b,3

c,2

aa,1

bb,1

cc,2 19

Page 20: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

HowtobuildaWebindex?

•  Ini)aldata:(URL,web_page_content)•  Goal:buildinvertedindex

Grenoble

h}ps://fr.wikipedia.org/wiki/Grenoble

h}p://www.grenoble.fr/

h}p://www.grenoble-tourisme.com/

h}p://wikitravel.org/en/Grenoble

UNIL

h}p://www.unil.ch/

h}ps://fr.wikipedia.org/wiki/Universit%C3%A9_de_Lausanne

h}ps://twi}er.com/unil

h}p://www.forma)on-con)nue-unil-epfl.ch/

20

Page 21: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

HowtobuildaWebindex?

•  map(URLpageURL,TextpageContent,Outputoutput){ foreachwordinpageContent.parse(){ output.write(word,pageURL) }}

21

Page 22: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

HowtobuildaWebindex?

•  reduce(Textword,List<URL>webPages,Outputoutput){ pos)ngList=initPos)ngList() foreachurlinwebPages{ pos)ngList.add(url) } output.write(word,pos)ngList)}

22

Page 23: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

APACHEHADOOP:MAPREDUCEFRAMEWORK

23

Page 24: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Objec)veofHadoopMapReduce

•  Provideasimpleandgenericprogrammingmodel:mapandreduce

•  Deployexecu)onautoma)cally•  Providefaulttolerance•  Scaletothousandsofmachines•  Performanceisimportantbutnotthepriority– What’simportantisthatjobsfinishwithinreasonable)me

–  Ifit’stoslow,addservers!KillItWithIron(KIWIprinciple)

24

Page 25: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Architecture

•  Fromamonolithicarchitecturetocomposablelayers

25

Page 26: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Execu)onsteps

Shuffle&Sort:groupbykeyandtransfertoreducer

26

Page 27: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Shuffle&Sort

•  Barrierintheexecu)on– Allmaptasksmustcompletebeforestar)ngreduce

•  Par))onertoassignkeystoserversexecu)ngreduce– Ex:hash(key)%nbServers– Dealwithloadbalancing

27

Page 28: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Combiner•  Poten)alproblemofamapfunc)on:manykey/valuepairsintheoutput– Materializedtodisk,senttothereduceroverthenetwork

–  Costlystepoftheexecu)on•  Addanoperator:Combiner– Mini-reducerexecutedonthedataproducedbymaponasinglemachinetostartaggrega)ngit

•  CombinermaybeusedbyHadoop(op)onal)–  Thecorrectnessoftheprogramshouldnotdependonit

28

Page 29: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

CombinerMap

Reduce

Key Value

Input MKI MVI

Output MK0 MV0

Key Value

Input RKI RVI

Output RK0 RV0

29

Page 30: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

CombinerMap

Reduce

Combine

Key Value

Input MKI MVI

Output MK0 MV0

Key Value

Input CKI CVI

Output CK0 CV0

Key Value

Input RKI RVI

Output RK0 RV0

30

Page 31: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

CombinerMap

Reduce

Combine

Key Value

Input MKI MVI

Output MK0 MV0

Key Value

Input CKI CVI

Output CK0 CV0

Key Value

Input RKI RVI

Output RK0 RV0

31

Page 32: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Combiner1,"abcaabc" 2,"abbccaccb"a,1b,1c,1aa,1b,1c,1

a,1bb,1cc,1a,1cc,1b,1

Map

Reduce a,[1,2]

b,[2,1]

c,[2]

aa,[1]

bb,[1]

cc,[2]

a,3

b,3

c,2

aa,1

bb,1

cc,2

a,1b,2c,2aa,1

a,2bb,1cc,2b,1

Combiner

32

Page 33: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Combiner

•  SameAPIasreduce(key,List<value>)– Notthesamecontract!Foronekey,yougetSOMEvalues

•  O�enthesameaggrega)onasreduce– E.g.WordCount

•  Differentwhenusingglobalproper)es– E.g.Keepwordspresentatleast5)mes

33

Page 34: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

HadoopMapReduceasadeveloper

•  Providethefunc)onsperformedbyMapandReduce(Java,C++)– Applica)ondependent

•  Definesthedatatypes(keys/values)–  Ifnotstandard(Text,IntWritable…)– Func)onsforseraliza)on

•  That’sall.

34

Page 35: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Importsimport java.io.IOException ; import java.util.* ; import org.apache.hadoop.fs.Path ; import org.apache.hadoop.io.IntWritable ; import org.apache.hadoop.io.LongWritable ; import org.apache.hadoop.io.Text ; import org.apache.hadoop.mapreduce.Mapper ; import org.apache.hadoop.mapreduce.Reducer ; import org.apache.hadoop.mapreduce.JobContext ; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat ; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat ; import org.apache.hadoop.mapreduce.Job ;

DonotusetheoldmapredAPI! 35

Page 36: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Mapper // input key type, input value type, output key type, output value type public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

@Override protected void map(LongWritable key, Text value,

Context context) throws IOException, InterruptedException {

for (String word : value.toString().split("\\s+")) { context.write(new Text(word), new IntWritable(1)); } }

}

36

Page 37: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Reducer// input key type, input value type, output key type, output value type public class WordCountReducer extends Reducer<Text, IntWritable, Text, LongWritable> {

@Override protected void reduce(Text key, Iterable<IntWritable>

values, Context context) throws IOException, InterruptedException {

long sum = 0; for (IntWritable value : values) { sum += value.get(); }

context.write(key, new LongWritable(sum)); }

}

37

Page 38: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Mainpublic class WordCountMain { public static void main(String [] args) throws Exception {

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCountMain.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(IntWritable.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(LongWritable.class);

job.setMapperClass(WordCountMapper.class);

job.setReducerClass(WordCountReducer.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

} 38

Page 39: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Writableexamplepublic class StringAndInt implements WritableComparable<StringAndInt> {

private IntWritable iw = new IntWritable(); private Text t = new Text(); public StringAndInt() {} public StringAndInt(String s, int i) { this.iw.set(i); this.t.set(s);} @Override public void write(DataOutput out) throws IOException { this.iw.write(out); this.t.write(out);} @Override public void readFields(DataInput in) throws IOException { this.iw.readFields(in); this.t.readFields(in);} @Override public int compareTo(StringAndInt o) { int c1 = this.t.compareTo(o.t); if (c1 != 0) { return c1; } else { return this.iw.compareTo(o.iw); }}

39

Page 40: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Terminology

•  MapReduceprogram=job•  Jobsaresubmi}edtotheJobTracker•  Ajobisdividedinseveraltasks– AMapisatask– AReduceisatask

•  TasksaremonitoredbyTaskTrackers– Aslowtaskiscalledastraggler

40

Page 41: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Jobexecu)on•  $hadoopjarwordcount.jarorg.myorg.WordCountinputPath(HDFS)

outputPath(HDFS)•  Checkparameters

–  Isthereanoutputdirectory?–  Doesitalreadyexist?–  Isthereaninputdirectory?

•  Computesplits•  Thejob(MapReducecode),itsconfigura)onandsplitsarecopied

withahighreplica)on•  Createanobjecttofollowtheprogressathetasksiscreatedbythe

JobTracker•  Foreachsplit,createaMap•  Createdefaultnumberofreducers

41

Page 42: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Tasktracker•  TaskTrackersendsaperiodicsignaltotheJobTracker–  Showthatthenodes)llfunc)ons–  TellwhethertheTaskTrackerisreadytoacceptanewtask

•  ATaskTrackerisresponsibleforanode–  Fixednumberofslotsformaptasks–  Fixednumberofslotsforreducetasks–  Taskscanbefromdifferentjobs

•  EachtaskrunsonitsownJVM–  PreventsataskcrashtocrashtheTaskTrackeraswell

42

Page 43: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

JobProgress

•  AMaptaskreportsonitsprogress,i.e.amountofthesplitprocessed

•  Forareducetask,3states–  copy–  sort–  reduce

•  ReportsenttotheTaskTracker•  Every5seconds,reportforwardedtotheJobTracker•  UsercanseetheJobTrackerstatethroughWebinterface

43

Page 44: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Progress

44

Page 45: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

EndofJob•  Outputofeachreducerwri}entoafile•  Jobtrackerno)fiestheclientandwritesa

reportforthejob14/10/2811:54:25INFOmapreduce.Job:Jobjob_1413131666506_0070completedsuccessfullyJobCountersLaunchedmaptasks=392Launchedreducetasks=88Data-localmaptasks=392[...]Map-ReduceFrameworkMapinputrecords=622976332Mapoutputrecords=622952022Reduceinputgroups=54858244Reduceinputrecords=622952022Reduceoutputrecords=546559709[...]

45

Page 46: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Serverfailureduringajob

•  Buginatask–  taskJVMcrashes→TaskTrackerJVMno)fied–  taskremovedfromitsslot

•  Taskbecomeunresponsive– )meouta�er10minutes–  taskremovedfromitsslot

•  Eachtaskmaybere-runuptoN)mes(default7)incaseofcrashes

46

Page 47: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

HDFS:DISTRIBUTEDFILESYSTEM

47

Page 48: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

RandomvsSequen)aldiskaccess•  Example

–  DB100Musers–  100B/user–  Alter1%records

•  Randomaccess–  Seek,read,write:30mS–  1Musersà8h20

•  Sequen)alaccess–  ReadALLWriteALL–  2x10GB@100MB/Sà3minutes

àItiso�enfastertoreadallandwriteallsequen)ally

48

Page 49: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

DistributedFileSystem(HDFS)

•  Goal–  Faulttolerance(redundancy)–  Performance(parallelaccess)

•  Largefiles–  Sequen)alreads–  Sequen)alwrites

•  “inplace”dataprocessing– Dataisstoredonthemachinesthatprocessit

•  Be}erusageofmachines(nodedicatedfiler)•  Lessnetworkbo}lenecks(be}erperformance)

49

Page 50: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

HDFSmodel

•  Dataorganizedinfilesanddirectoriesàmimicsastandardfilesystem

•  Filesdividedinblocks(default:64MB)spreadonservers

•  HDFSreportsthedatalayouttotheMap-ReduceframeworkàIfpossible,processdataonthemachineswhereitisalreadystored

50

Page 51: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Faulttolerance

•  Fileblocksreplicated(default:3)totoleratefailures

•  Placementaccordingtodifferentparameters– Powersupply– Networkequipment– Diverseserverstoincreasetheprobabilityofhavinga“close”copy

•  Checksumofdatatodetectcorrupterblocks(alsoavailableinmodernfilesystems)

51

Page 52: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Master/Workerarchitecture•  Amaster,theNameNode– Managethespaceoffilenames– Managesaccessrights–  Superviseopera)onsonfiles,blocks…–  Supervisethehealthofthefilesystem(failures,loadbalance…)

•  Many(1000s)slaves,theDataNodes–  Storethedata(blocks)–  Performreadandwriteopera)ons–  Performcopies(replica)on,orderedbytheNameNode)

52

Page 53: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

NameNode

•  Storesthemetadataofeachfileandblock(inode)– Filename,directory,blocksasso)ated,posi)onoftheseblocks,numberofreplicas…

•  Keepsallinmainmemory(RAM)– Limi)ngfactor=numberoffiles– 60Mobjectsin16GB

53

Page 54: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

DataNode

•  Manageandmonitorthestateofblocksstoredonthehostfilesystem(o�enLinux)

•  DirectlyaccessedbytheclientsàdatanevertransitthroughtheNameNode

•  SendheartbeatstotheNameNodetoshowthattheserverhasnotfailed

•  ReporttotheNameNodeifblocksarecorrupted

54

Page 55: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Wri)ngafile•  TheclientsendsaquerytotheNameNodetocreateanew

file•  TheNameNodechecks

–  Clientauthoriza)ons–  Filesystemconflicts(exis)ngfile…)

•  NameNodechosesDataNodestostorefileandreplicas–  DataNodes“pipelined”

•  BlocksareallocatedontheseDataNodes•  StreamofdatasenttothefirstDataNodeofthepipeline•  EachDataNodeforwardsthedatareceivedtothenext

DataNodeinthepipeline

55

Page 56: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

Readingafile•  ClientsendsarequesttotheNameNodetoreadafile•  NameNodechecksthefileexistsandbuildsalistofDataNodes

containingthefirstblocks•  Foreachblock,NameNodesendstheaddressoftheDataNodes

hos)ngthem–  Listorderedwrt.Proximitytotheclient

•  ClientconnectstotheclosestDataNodecontainingthe1stblockofthefile

•  Blockreadends:–  Closeconnec)ontotheDataNode–  Newconnec)ontotheDataNodecontainingthenextblock

•  Whenallblocksareread:–  QuerytheNameNodetoretrievethefollowingblocks

56

Page 57: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

HDFSStructure

1

2

1

2

34

1

2

3

57

Page 58: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

HDFScommands(directories)

•  Createdirectorydir$hadoopdfs-mkdir/dir

•  ListHDFScontent$hadoopdfs-ls

•  Removedirectorydir$hadoopdfs-rmr/dir

58

Page 59: Introduc)on to Map-Reduce - imaglig-membres.imag.fr › leroyv › wp-content › uploads › sites › 125 › 201… · • Hortonworks • Cloudera • Prac)cal Problem Solving

HDFScommands(files)

•  Copylocalfiletoto.txttoHDFSdir/$hadoopdfs-puttoto.txtdir/toto.txt

•  CopyHDFSfiletolocaldisk$hadoopdfs-getdir/toto.txt./

•  Readfile/dir/toto.txt$hadoopdfs-cat/dir/toto.txt

•  Removefile/dir/toto.txt$hadoopdfs-rm/dir/toto.txt

59