Cloudera CCD

51
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous writable objects. B. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous writable objects. C. A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order. D. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be same type. Answer: D Explanation: SequenceFile is a flat file consisting of binary key/value pairs. There are 3 different SequenceFile formats: Uncompressed key/value records. Record compressed key/value records - only 'values' are compressed here. Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable. Reference:http://wiki.apache.org/hadoop/SequenceFile QUESTION NO: 2 Given a directory of files with the following structure: line number, tab character, string: Example: 1. abialkjfjkaoasdfjksdlkjhqweroij 2. kadf jhuwqounahagtnbvaswslmnbfgy 3. kjfteiomndscxeqalkzhtopedkfslkj

description

Cloudera CCD

Transcript of Cloudera CCD

Page 1: Cloudera CCD

Cloudera CCD-333Cloudera Certified Developer for Apache HadoopVersion: 5.6QUESTION NO: 1What is a SequenceFile?A. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous writableobjects.B. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous writableobjects.C. A SequenceFile contains a binary encoding of an arbitrary number of WritableComparableobjects, in sorted order.D. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each keymust be the same type. Each value must be same type.Answer: DExplanation: SequenceFile is a flat file consisting of binary key/value pairs.There are 3 different SequenceFile formats:Uncompressed key/value records.Record compressed key/value records - only 'values' are compressed here.Block compressed key/value records - both keys and values are collected in 'blocks' separatelyand compressed. The size of the 'block' is configurable.Reference:http://wiki.apache.org/hadoop/SequenceFileQUESTION NO: 2Given a directory of files with the following structure: line number, tab character, string:Example:1. abialkjfjkaoasdfjksdlkjhqweroij2. kadf jhuwqounahagtnbvaswslmnbfgy3. kjfteiomndscxeqalkzhtopedkfslkjYou want to send each line as one record to your Mapper. Which InputFormat would you use tocomplete the line: setInputFormat (________.class);A. BDBInputFormatB. KeyValueTextInputFormatCloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 2C. SequenceFileInputFormatD. SequenceFileAsTextInputFormatAnswer: CExplanation: Note:The output format for your first MR job should be SequenceFileOutputFormat - this will store the

Page 2: Cloudera CCD

Key/Values output from the reducer in a binary format, that can then be read back in, in yoursecond MR job using SequenceFileInputFormat.Reference:http://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-inhadoop(see answer 1 and then see the comment #1 for it)QUESTION NO: 3In a MapReduce job, you want each of you input files processed by a single map task. How do youconfigure a MapReduce job so that a single map task processes each input file regardless of howmany blocks the input file occupies?A. Increase the parameter that controls minimum split size in the job configuration.B. Write a custom MapRunner that iterates over all key-value pairs in the entire file.C. Set the number of mappers equal to the number of input files you want to process.D. Write a custom FileInputFormat and override the method isSplittable to always return false.Answer: DExplanation: Note:*// Do not allow splitting.protected boolean isSplittable(JobContext context, Path filename) {return false;}*InputSplits: An InputSplit describes a unit of work that comprises a single map task in aMapReduce program. A MapReduce program applied to a data set, collectively referred to as aJob, is made up of several (possibly several hundred) tasks. Map tasks may involve reading awhole file; they often involve reading only part of a file. By default, the FileInputFormat and itsdescendants break a file up into 64 MB chunks (the same size as blocks in HDFS). You cancontrol this value by setting the mapred.min.split.size parameter in hadoop-site.xml, or byoverriding the parameter in the JobConf object used to submit a particular MapReduce job. Byprocessing a file in chunks, we allow several map tasks to operate on a single file in parallel. If thefile is very large, this can improve performance significantly through parallelism. Even moreimportantly, since the various blocks that make up the file may be spread across several differentnodes in the cluster, it allows tasks to be scheduled on each of these different nodes; theCloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 3

Page 3: Cloudera CCD

individual blocks are thus all processed locally, instead of needing to be transferred from one nodeto another. Of course, while log files can be processed in this piece-wise fashion, some fileformats are not amenable to chunked processing. By writing a custom InputFormat, you cancontrol how the file is broken up (or is not broken up) into splits.QUESTION NO: 4Which of the following best describes the workings of TextInputFormat?A. Input file splits may cross line breaks. A line that crosses tile splits is ignored.B. The input file is split exactly at the line breaks, so each Record Reader will read a series ofcomplete lines.C. Input file splits may cross line breaks. A line that crosses file splits is read by theRecordReaders of both splits containing the broken line.D. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaderof the split that contains the end of the broken line.E. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaderof the split that contains the beginning of the broken line.Answer: DExplanation: As the Map operation is parallelized the input file set is first split to several piecescalled FileSplits. If an individual file is so large that it will affect seek time it will be split to severalSplits. The splitting does not know anything about the input file's internal logical structure, forexample line-oriented text files are split on arbitrary byte boundaries. Then a new map task iscreated per FileSplit.When an individual map task starts it will open a new output writer per configured reduce task. Itwill then proceed to read its FileSplit using the RecordReader it gets from the specifiedInputFormat. InputFormat parses the input and generates key-value pairs. InputFormat must alsohandle records that may be split on the FileSplit boundary. For example TextInputFormat will readthe last line of the FileSplit past the split boundary and, when reading other than the first FileSplit,TextInputFormat ignores the content up to the first newline.Reference:How Map and Reduce operations are actually carried outhttp://wiki.apache.org/hadoop/HadoopMapReduce(Map, second paragraph)QUESTION NO: 5Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 4

Page 4: Cloudera CCD

Which of the following statements most accurately describes the relationship between MapReduceand Pig?A. Pig provides additional capabilities that allow certain types of data manipulation not possiblewith MapReduce.B. Pig provides no additional capabilities to MapReduce. Pig programs are executed asMapReduce jobs via the Pig interpreter.C. Pig programs rely on MapReduce but are extensible, allowing developers to do special-purposeprocessing not provided by MapReduce.D. Pig provides the additional capability of allowing you to control the flow of multiple MapReducejobs.Answer: DExplanation: In addition to providing many relational and data flow operators Pig Latin providesways for you to control how your jobs execute on MapReduce. It allows you to set values thatcontrol your environment and to control details of MapReduce such as how your data ispartitioned.Reference:http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html(topic: controllingexecution)QUESTION NO: 6You need to import a portion of a relational database every day as files to HDFS, and generateJava classes to Interact with your imported data. Which of the following tools should you use toaccomplish this?A. PigB. HueC. HiveD. FlumeE. SqoopF. OozieG. fuse-dfsAnswer: EExplanation: Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the followingcapabilities:Imports individual tables or entire databases to files in HDFSGenerates Java classes to allow you to interact with your imported dataCloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 5

Page 5: Cloudera CCD

Provides the ability to import from SQL databases straight into your Hive data warehouseNote:Data Movement Between Hadoop and Relational DatabasesData can be moved between Hadoop and a relational database as a bulk data transfer, orrelational tables can be accessed from within a MapReduce map function.Note:*Cloudera's Distribution for Hadoop provides a bulk data transfer tool (i.e., Sqoop) that importsindividual tables or entire databases into HDFS files. The tool also generates Java classes thatsupport interaction with the imported data. Sqoop supports all relational databases over JDBC,and Quest Software provides a connector (i.e., OraOop) that has been optimized for access todata residing in Oracle databases.Reference:http://log.medcl.net/item/2011/08/hadoop-and-mapreduce-big-data-analyticsgartner/(Data Movement between hadoop and relational databases, second paragraph)QUESTION NO: 7You have an employee who is a Date Analyst and is very comfortable with SQL. He would like torun ad-hoc analysis on data in your HDFS duster. Which of the following is a data warehousingsoftware built on top of Apache Hadoop that defines a simple SQL-like query language well-suitedfor this kind of user?A. PigB. HueC. HiveD. SqoopE. OozieF. FlumeG. Hadoop StreamingAnswer: CExplanation: Hive defines a simple SQL-like query language, called QL, that enables usersfamiliar with SQL to query the data. At the same time, this language also allows programmers whoare familiar with the MapReduce framework to be able to plug in their custom mappers andreducers to perform more sophisticated analysis that may not be supported by the built-incapabilities of the language. QL can also be extended with custom scalar functions (UDF's),

Page 6: Cloudera CCD

aggregations (UDAF's), and table functions (UDTF's).Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 6Reference:https://cwiki.apache.org/Hive/(Apache Hive, first sentence and second paragraph)QUESTION NO: 8Workflows expressed in Oozie can contain:A. Iterative repetition of MapReduce jobs until a desired answer or state is reached.B. Sequences of MapReduce and Pig jobs. These are limited to linear sequences of actions withexception handlers but no forks.C. Sequences of MapReduce jobs only; no Pig or Hive tasks or jobs. These MapReducesequences can be combined with forks and path joins.D. Sequences of MapReduce and Pig. These sequences can be combined with other actionsincluding forks, decision points, and path joins.Answer: DReference:http://incubator.apache.org/oozie/docs/3.1.3/docs/WorkflowFunctionalSpec.html(workflow definition, first sentence)QUESTION NO: 9You need a distributed, scalable, data Store that allows you random, realtime read/write access tohundreds of terabytes of data. Which of the following would you use?A. HueB. PigC. HiveD. OozieE. HBaseF. FlumeG. SqoopAnswer: EExplanation: Use Apache HBase when you need random, realtime read/write access to your BigData.Note:This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned,Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 7column-oriented store modeled after Google's Bigtable: A Distributed Storage System forStructured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided

Page 7: Cloudera CCD

by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoopand HDFS.FeaturesLinear and modular scalability.Strictly consistent reads and writes.Automatic and configurable sharding of tablesAutomatic failover support between RegionServers.Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.Easy to use Java API for client access.Block cache and Bloom Filters for real-time queries.Query predicate push down via server side FiltersThrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary dataencoding optionsExtensible jruby-based (JIRB) shellSupport for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMXReference:http://hbase.apache.org/(when would I use HBase? First sentence)QUESTION NO: 10Which of the following utilities allows you to create and run MapReduce jobs with any executableor script as the mapper and/or the reducer?A. OozieB. SqoopC. FlumeD. Hadoop StreamingAnswer: DExplanation: Hadoop streaming is a utility that comes with the Hadoop distribution. The utilityallows you to create and run Map/Reduce jobs with any executable or script as the mapper and/orthe reducer.Reference:http://hadoop.apache.org/common/docs/r0.20.1/streaming.html(HadoopStreaming,second sentence)Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 8QUESTION NO: 11What is the preferred way to pass a small number of configuration parameters to a mapper orreducer?A. As key-value pairs in the jobconf object.B. As a custom input key-value pair passed to each mapper or reducer.C. Using a plain text file via the Distributedcache, which each mapper or reducer reads.

Page 8: Cloudera CCD

D. Through a static variable in the MapReduce driver class (i.e., the class that submits theMapReduce job).Answer: AExplanation: In Hadoop, it is sometimes difficult to pass arguments to mappers and reducers. Ifthe number of arguments is huge (e.g., big arrays), DistributedCache might be a good choice.However, here, we’re discussing small arguments, usually a hand of configuration parameters.In fact, the way to configure these parameters is simple. When you initialize“JobConf”object tolaunch a mapreduce job, you can set the parameter by using“set”method like:1JobConf job = (JobConf)getConf();2job.set("NumberOfDocuments", args[0]);Here,“NumberOfDocuments”is the name of parameter and its value is read from“args[0]“, acommand line argument.Reference:Passing Parameters and Arguments to Mapper and Reducer in HadoopQUESTION NO: 12Given a Mapper, Reducer, and Driver class packaged into a jar, which is the correct way ofsubmitting the job to the cluster?A. jar MyJar.jarB. jar MyJar.jar MyDriverClass inputdir outputdirC. hadoop jar MyJar.jar MyDriverClass inputdir outputdirD. hadoop jar class MyJar.jar MyDriverClass inputdir outputdirAnswer: CExplanation: Example:Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 9Run the application:$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input/usr/joe/wordcount/outputQUESTION NO: 13What is the difference between a failed task attempt and a killed task attempt?A. A failed task attempt is a task attempt that threw an unhandled exception. A killed task attemptis one that was terminated by the JobTracker.B. A failed task attempt is a task attempt that did not generate any key value pairs. A killed taskattempt is a task attempt that threw an exception, and thus killed by the execution framework.C. A failed task attempt is a task attempt that completed, but with an unexpected status value. A

Page 9: Cloudera CCD

killed task attempt is a duplicate copy of a task attempt that was started as part of speculativeexecution.D. A failed task attempt is a task attempt that threw a RuntimeException (i.e., the task fails). Akilled task attempt is a task attempt that threw any other type of exception (e.g., IOException); theexecution framework catches these exceptions and reports them as killed.Answer: CExplanation: Note:*Hadoop uses "speculative execution." The same task may be started on multiple boxes. The firstone to finish wins, and the other copies are killed.Failed tasks are tasks that error out.*There are a few reasons Hadoop can kill tasks by his own decisions:a) Task does not report progress during timeout (default is 10 minutes)b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) orqueue (CapacityScheduler).c) Speculative execution causes results of task not to be needed since it has completed on otherplace.Reference:Difference failed tasks vs killed tasksQUESTION NO: 14Custom programmer-defined counters in MapReduce are:Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 10A. Lightweight devices for bookkeeping within MapReduce programs.B. Lightweight devices for ensuring the correctness of a MapReduce program. Mappers Incrementcounters, and reducers decrement counters. If at the end of the program the counters read zero,then you are sure that the job completed correctly.C. Lightweight devices for synchronization within MapReduce programs. You can use counters tocoordinate execution between a mapper and a reducer.Answer: AExplanation: Countersare a useful channel for gathering statistics about the job; for qualitycontrol,or for application-level statistics. They are also useful for problem diagnosis. Hadoopmaintains somebuilt-in counters for every job, which reports various metrics for your job.Hadoop MapReduce also allows the user to define a set of user-defined counters that can beincremented (or decremented by specifying a negative value as the parameter), by the driver,mapper or the reducer.

Page 10: Cloudera CCD

Reference:Iterative MapReduce and Counters,Introduction to Iterative MapReduce and Countershttp://hadooptutorial.wikispaces.com/Iterative+MapReduce+and+Counters(counters, secondparagraph)QUESTION NO: 15Can you use MapReduce to perform a relational join on two large tables sharing a key? Assumethat the two tables are formatted as comma-separated file in HDFS.A. Yes.B. Yes, but only if one of the tables fits into memory.C. Yes, so long as both tables fit into memory.D. No, MapReduce cannot perform relational operations.E. No, but it can be done with either Pig or Hive.Answer: AExplanation: Note:* Join Algorithms in MapReduceA) Reduce-side joinB) Map-side joinC) In-memory join/ Striped Striped variant variantCloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 11/ Memcached variant* Which join to use?/ In-memory join > map-side join > reduce-side join/ Limitations of each?In-memory join: memoryMap-side join: sort order and partitioningReduce-side join: general purposeQUESTION NO: 16To process input key-value pairs, your mapper needs to load a 512 MB data file in memory. Whatis the best way to accomplish this?A. Place the datafile in the DataCache and read the data into memory in the configure methodofthe mapper.B. Place the data file in theDistributedCacheand read the data into memory in the map method ofthe mapper.C. Place the data file in theDistributedCacheand read the data into memory in the configuremethod of the mapper.D. Serialize the data file, insert it in the Jobconf object, and read the data into memory in theconfigure method of the mapper.

Page 11: Cloudera CCD

Answer: BExplanation: Hadoop has a distributed cache mechanism to make available file locally that maybe needed by Map/Reduce jobsUse CaseLets understand our Use Case a bit more in details so that we can follow-up the code snippets.We have a Key-Value file that we need to use in our Map jobs. For simplicity, lets say we need toreplace all keywords that we encounter during parsing, with some other value.So what we need isA key-values files (Lets use a Properties files)The Mapper code that uses the codeWrite the Mapper code that uses itview sourceprint?Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 1201.public class DistributedCacheMapper extends Mapper<LongWritable, Text, Text, Text> {02.03.Properties cache;[email protected] void setup(Context context) throws IOException, InterruptedException {07.super.setup(context);08.Path[] localCacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());09.10.if(localCacheFiles != null) {11.// expecting only single file here12.for (int i = 0; i < localCacheFiles.length; i++) {13.Path localCacheFile = localCacheFiles[i];14.cache = new Properties();15.cache.load(new FileReader(localCacheFile.toString()));

Page 12: Cloudera CCD

16.}17.} else {18.// do your error handling here19.}20.21.Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 13}[email protected] void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException {25.// use the cache here26.// if value contains some attribute, cache.get(<value>)27.// do some action or replace with something else28.}29.30.}Note:* Distribute application-specific large, read-only files efficiently.DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives,jars etc.) needed by applications.Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. TheDistributedCache assumes that the files specified via hdfs:// urls are already present on theFileSystem at the path specified by the url.Reference:Using Hadoop Distributed CacheQUESTION NO: 17What types of algorithms are difficult to express MapReduce?A. Algorithms that requite global, shared state.Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 14

Page 13: Cloudera CCD

B. Large-scale graph algorithms that require one-step link traversal.C. Relational operations on large amounts of structured and semi structured data.D. Text analysis algorithms on large collections of unstructured text (e.g., Web crawls).E. Algorithms that require applying the same mathematical function to large numbers of individualbinary records.Answer: AExplanation: See 3) below.Limitations of Mapreduce–where not to use MapreduceWhile very powerful and applicable to a wide variety of problems, MapReduce is not the answer toevery problem. Here are some problems I found where MapReudce is not suited and some papersthat address the limitations of MapReuce.1. Computation depends on previously computed valuesIf the computation of a value depends on previously computed values, then MapReduce cannot beused. One good example is the Fibonacci series where each value is summation of the previoustwo values. i.e., f(k+2) = f(k+1) + f(k). Also, if the data set is small enough to be computed on asingle machine, then it is better to do it as a single reduce(map(data)) operation rather than goingthrough the entire map reduce process.2. Full-text indexing or ad hoc searchingThe index generated in the Map step is one dimensional, and the Reduce step must not generatea large amount of data or there will be a serious performance degradation. For example,CouchDB’s MapReduce may not be a good fit for full-text indexing or ad hoc searching. This is aproblem better suited for a tool such as Lucene.3. Algorithms depend on shared global stateSolutions to many interesting problems in text processing do not require global synchronization.Asa result, they can be expressed naturally in MapReduce, since map and reduce tasks runindependently and in isolation. However, there are many examples of algorithms that dependcrucially on the existence of shared global state during processing, making them difficult toimplement in MapReduce (since the single opportunity for global synchronization in MapReduce isthe barrier between the map and reduce phases of processing)Reference: Limitations of Mapreduce–where not to use MapreduceQUESTION NO: 18MapReduce is well-suited for all of the following applications EXCEPT? (Choose one):

Page 14: Cloudera CCD

Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 15A. Text mining on a large collections of unstructured documents.B. Analysis of large amounts of Web logs (queries, clicks, etc.).C. Online transaction processing (OLTP) for an e-commerce Website.D. Graph mining on a large social network (e.g., Facebook friends network).Answer: CExplanation: Hadoop Map/Reduce is designed for batch-oriented work load.MapReduce is well suited for data warehousing (OLAP), but not for OLTP.QUESTION NO: 19Your Custer’s HOFS block size is 64MB. You have a directory containing 100 plain text files, eachof which Is 100MB in size. The InputFormat for your job is TextInputFormat. How many Mapperswill run?A. 64B. 100C. 200D. 640Answer: CExplanation: Each file would be split into two as the block size (64 MB) is less than the file size(100 MB), so 200 mappers would be running.Note:If you're not compressing the files then hadoop will process your large files (say 10G), with anumber of mappers related to the block size of the file.Say your block size is 64M, then you will have ~160 mappers processing this 10G file (160*64 ~=10G). Depending on how CPU intensive your mapper logic is, this might be anacceptable blocks size, but if you find that your mappers are executing in sub minute times, thenyou might want to increase the work done by each mapper (by increasing the block size to 128,256, 512m - the actual size depends on how you intend to process the data).Reference:http://stackoverflow.com/questions/11014493/hadoop-mapreduce-appropriate-inputfiles-size(first answer, second paragraph)QUESTION NO: 20Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 16Does the MapReduce programming model provide a way for reducers to communicate with eachother?A. Yes, all reducers can communicate with each other by passing information through the jobconf

Page 15: Cloudera CCD

object.B. Yes, reducers can communicate with each other by dispatching intermediate key value pairsthat get shuffled to another reduceC. Yes, reducers running on the same machine can communicate with each other through sharedmemory, but not reducers on different machines.D. No, each reducer runs independently and in isolation.Answer: DExplanation: MapReduce programming model does not allow reducers to communicate with eachother. Reducers run in isolation.Reference:24 Interview Questions & Answers for Hadoop MapReduce developershttp://www.fromdev.com/2010/12/interview-questions-hadoop-mapreduce.html(See question no.9)QUESTION NO: 21Which of the following best describes the map method input and output?A. It accepts a single key-value pair as input and can emit only one key-value pair as output.B. It accepts a list of key-value pairs as input hut run emit only one key value pair as output.C. It accepts a single key-value pair as input and emits a single key and list of correspondingvalues as outputD. It accepts a single key-value pair as input and can emit any number of key-value pairs asoutput, including zero.Answer: DExplanation: public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>extends ObjectMaps input key/value pairs to a set of intermediate key/value pairs.Maps are the individual tasks which transform input records into a intermediate records. Thetransformed intermediate records need not be of the same type as the input records. A given inputpair may map to zero or many output pairs.Reference: org.apache.hadoop.mapreduceClass Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 17QUESTION NO: 22You have written a Mapper which invokes the following five calls to the outputcollector.collectmethod:C:\Documents and Settings\RraAsShHiIdD\Desktop\untitled.JPG

Page 16: Cloudera CCD

How many times will the Reducer's reduce method be invoked.A. 0B. 1C. 3D. 5E. 6Answer: CExplanation:Note:org.apache.hadoop.mapred Interface OutputCollector<K,V>Collects the <key, value> pairs output by Mappers and Reducers.OutputCollector is the generalization of the facility provided by the Map-Reduce framework tocollect data output by either the Mapper or the Reducer i.e. intermediate outputs or the output ofthe job.QUESTION NO: 23In a MapReduce job with 500 map tasks, how many map task attempts will there be?A. At least 500.B. Exactly 500.C. At most 500.D. Between 500 and 1000.E. It depends on the number of reducers in the job.Answer: ACloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 18Explanation:From Cloudera Training Course:Task attempt is a particular instance of an attempt to execute a task– There will be at least as many task attempts as there are tasks– If a task attempt fails, another will be started by the JobTracker– Speculative execution can also result in more task attempts than completed tasksQUESTION NO: 24The Hadoop framework provides a mechanism for coping with machine issues such as faultyconfiguration or impending hardware failure. MapReduce detects that one or a number ofmachines are performing poorly and starts more copies of a map or reduce task. All the tasks runsimultaneously and the task that finish first are used. This is called:A. CombinerB. IdentityMapperC. IdentityReducerD. Default PartitionerE. Speculative ExecutionAnswer: E

Page 17: Cloudera CCD

Explanation: Speculative execution: One problem with the Hadoop system is that by dividing thetasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program.For example if one node has a slow disk controller, then it may be reading its input at only 10% thespeed of all the other nodes. So when 99 map tasks are already complete, the system is stillwaiting for the final map task to check in, which takes much longer than all the other nodes.By forcing tasks to run in isolation from one another, individual tasks do not know where theirinputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore,the same input can be processed multiple times in parallel, to exploit differences in machinecapabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will scheduleredundant copies of the remaining tasks across several nodes which do not have other work toperform. This process is known as speculative execution. When tasks complete, they announcethis fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. Ifother copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasksand discard their outputs. The Reducers then receive their inputs from whichever Mappercompleted successfully, first.Reference:Apache Hadoop,Module 4: MapReduceCloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 19QUESTION NO: 25Which of the Following best describes the lifecycle of a Mapper?A. The TaskTracker spawns a new Mapper to process each key-value pair.B. The JobTracker spawns a new Mapper to process all records in a single file.C. The TaskTracker spawns a new Mapper to process all records in a single input split.D. The JobTracker calls the FastTracker's configure () method, then its map () method and finallyits closer ()Answer: AExplanation: For each map instance that runs, the TaskTracker creates a new instance of yourmapper.Note:

Page 18: Cloudera CCD

*The Mapper is responsible for processing Key/Value pairs obtained from the InputFormat. Themapper may perform a number of Extraction and Transformation functions on the Key/Value pairbefore ultimately outputting none, one or many Key/Value pairs of the same, or different Key/Valuetype.*With the new Hadoop API, mappers extend the org.apache.hadoop.mapreduce.Mapper class.This class defines an 'Identity' map function by default - every input Key/Value pair obtained fromthe InputFormat is written out.Examining the run() method, we can see the lifecycle of the mapper:/*** Expert users can override this method for more complete control over the* execution of the Mapper.* @param context* @throws IOException*/public void run(Context context) throws IOException, InterruptedException {setup(context);while (context.nextKeyValue()) {map(context.getCurrentKey(), context.getCurrentValue(), context);}cleanup(context);}setup(Context) - Perform any setup for the mapper. The default implementation is a no-op method.map(Key, Value, Context) - Perform a map operation in the given Key / Value pair. The defaultCloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 20implementation calls Context.write(Key, Value)cleanup(Context) - Perform any cleanup for the mapper. The default implementation is a no-opmethod.Reference:Hadoop/MapReduce/MapperQUESTION NO: 26Your client application submits a MapReduce job to your Hadoop cluster. The Hadoop frameworklooks for an available slot to schedule the MapReduce operations on which of the followingHadoop computing daemons?A. DataNodeB. NameNodeC. JobTracker

Page 19: Cloudera CCD

D. TaskTrackerE. Secondary NameNodeAnswer: CExplanation: JobTracker is the daemon service for submitting and tracking MapReduce jobs inHadoop. There is only One Job Tracker process run on any hadoop cluster. Job Tracker runs onits own JVM process. In a typical production cluster its run on a separate machine. Each slavenode is configured with job tracker node location. The JobTracker is single point of failure for theHadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoopperforms following actions(from Hadoop Wiki:)Client applications submit jobs to the Job tracker.The JobTracker talks to the NameNode to determine the location of the dataThe JobTracker locates TaskTracker nodes with available slots at or near the dataThe JobTracker submits the work to the chosen TaskTracker nodes.The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, theyare deemed to have failed and the work is scheduled on a different TaskTracker.A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to dothen: it may resubmit the job elsewhere, it may mark that specific record as something to avoid,and it may may even blacklist the TaskTracker as unreliable.When the work is completed, the JobTracker updates its status.Client applications can poll the JobTracker for information.Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,What is aJobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 21QUESTION NO: 27Which MapReduce daemon runs on each slave node and participates in job execution?A. TaskTrackerB. JobTrackerC. NameNodeD. Secondary NameNodeAnswer: AExplanation: Single instance of a Task Tracker is run on each Slave node. Task tracker is run asa separate JVM process.Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,What is

Page 20: Cloudera CCD

configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?http://www.fromdev.com/2010/12/interview-questions-hadoop-mapreduce.html(See answer toquestion no. 5)QUESTION NO: 28What is the standard configuration of slave nodes in a Hadoop cluster?A. Each slave node runs a JobTracker and a DataNode daemon.B. Each slave node runs a TaskTracker and a DataNode daemon.C. Each slave node either runs a TaskTracker or a DataNode daemon, but not both.D. Each slave node runs a DataNode daemon, but only a fraction of the slave nodes runTaskTrackers.E. Each slave node runs a TaskTracker, but only a fraction of the slave nodes run DataNodedaemons.Answer: BExplanation: Single instance of a Task Tracker is run on each Slave node. Task tracker is run asa separate JVM process.Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is run as aseparate JVM process.One or Multiple instances of Task Instance is run on each slave node. Each task instance is run asa separate JVM process. The number of Task instances can be controlled by configuration.Typically a high end machine is configured to run more task instances.Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 22Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,What isconfiguration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?QUESTION NO: 29Which happens if the NameNode crashes?A. HDFS becomes unavailable until the NameNode is restored.B. The Secondary NameNode seamlessly takes over and there is no service interruption.C. HDFS becomes unavailable to new MapReduce jobs, but running jobs will continue untilcompletion.D. HDFS becomes temporarily unavailable until an administrator starts redirecting client requeststo the Secondary NameNode.Answer: A

Page 21: Cloudera CCD

Explanation: The NameNode is a Single Point of Failure for the HDFS Cluster. When theNameNode goes down, the file system goes offline.Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,What is aNameNode? How many instances of NameNode run on a Hadoop Cluster?QUESTION NO: 30You are running a job that will process a single InputSplit on a cluster which has no other jobscurrently running. Each node has an equal number of open Map slots. On which node will Hadoopfirst attempt to run the Map task?A. The node with the most memoryB. The node with the lowest system loadC. The node on which this InputSplit is storedD. The node with the most free local disk spaceAnswer: CExplanation: The TaskTrackers send out heartbeat messages to the JobTracker, usually everyfew minutes, to reassure the JobTracker that it is still alive. These message also inform theCloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 23JobTracker of the number of available slots, so the JobTracker can stay up to date with where inthe cluster work can be delegated. When the JobTracker tries to find somewhere to schedule atask within the MapReduce operations, it first looks for an empty slot on the same server thathosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in thesame rack.QUESTION NO: 31How does the NameNode detect that a DataNode has failed?A. The NameNode does not need to know that a DataNode has failed.B. When the NameNode fails to receive periodic heartbeats from the DataNode, it considers theDataNode as failed.C. The NameNode periodically pings the datanode. If the DataNode does not respond, theNameNode considers the DataNode as failed.D. When HDFS starts up, the NameNode tries to communicate with the DataNode and considersthe DataNode as failed if it does not respond.Answer: B

Page 22: Cloudera CCD

Explanation: NameNode periodically receives a Heartbeat and a Blockreport from each of theDataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly.A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has notrecieved a hearbeat message from a data node after a certain amount of time, the data node ismarked as dead. Since blocks will be under replicated the system begins replicating the blocksthat were stored on the dead datanode. The NameNode Orchestrates the replication of datablocks from one datanode to another. The replication data transfer happens directly betweendatanodes and the data never passes through the namenode.Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,HowNameNode Handles data node failures?QUESTION NO: 32The NameNode uses RAM for the following purpose:Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 24A. To store the contents of files in HDFS.B. To store filenames, list of blocks and other meta information.C. To store the edits log that keeps track of changes in HDFS.D. To manage distributed read and write locks on files in HDFS.Answer: BExplanation: The NameNode is the centerpiece of an HDFS file system. It keeps the directorytree of all files in the file system, and tracks where across the cluster the file data is kept. It doesnot store the data of these files itself. There is only One NameNode process run on any hadoopcluster. NameNode runs on its own JVM process. In a typical production cluster its run on aseparate machine. The NameNode is a Single Point of Failure for the HDFS Cluster. When theNameNode goes down, the file system goes offline. Client applications talk to the NameNodewhenever they wish to locate a file, or when they want to add/copy/move/delete a file. TheNameNode responds the successful requests by returning a list of relevant DataNode serverswhere the data livesReference:24 Interview Questions & Answers for Hadoop MapReduce developers,What is aNameNode? How many instances of NameNode run on a Hadoop Cluster?

Page 23: Cloudera CCD

QUESTION NO: 33In the reducer, the MapReduce API provides you with an iterator over Writable values. Calling thenext () method:A. Returns a reference to a different Writable object each time.B. Returns a reference to a Writable object from an object pool.C. Returns a reference to the same writable object each time, but populated with different data.D. Returns a reference to a Writable object. The API leaves unspecified whether this is a reusedobject or a new object.E. Returns a reference to the same writable object if the next value is the same as the previousvalue, or a new writable object otherwise.Answer: CExplanation: Calling Iterator.next() will always return the SAME EXACT instance of IntWritable,with the contents of that instance replaced with the next value.Reference:manupulating iterator in mapreduceCloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 25QUESTION NO: 34What is a Writable?A. Writable is an interface that all keys and values in MapReduce must implement. Classesimplementing this interface must implement methods for serializing and deserializing themselves.B. Writable is an abstract class that all keys and values in MapReduce must extend. Classesextending this abstract base class must implement methods for serializing and deserializingthemselvesC. Writable is an interface that all keys, but not values, in MapReduce must implement. Classesimplementing this interface must implement methods for serializing and deserializing themselves.D. Writable is an abstract class that all keys, but not values, in MapReduce must extend. Classesextending this abstract base class must implement methods for serializing and deserializingthemselves.Answer: AExplanation: public interface WritableA serializable object which implements a simple, efficient, serialization protocol, based onDataInput and DataOutput.

Page 24: Cloudera CCD

Any key or value type in the Hadoop Map-Reduce framework implements this interface.Implementations typically implement a static read(DataInput) method which constructs a newinstance, calls readFields(DataInput) and returns the instance.Reference: org.apache.hadoop.io,Interface WritableQUESTION NO: 35In a MapReduce job, the reducer receives all values associated with the same key. Whichstatement is most accurate about the ordering of these values?A. The values are in sorted order.B. The values are arbitrarily ordered, and the ordering may vary from run to run of the sameMapReduce job.C. The values are arbitrarily ordered, but multiple runs of the same MapReduce job will alwayshave the same ordering.D. Since the values come from mapper outputs, the reducers will receive contiguous sections ofsorted values.Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 26Answer: DExplanation:Note:*The Mapper outputs are sorted and then partitioned per Reducer.*The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value)format.*Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches therelevant partition of the output of all the mappers, via HTTP.*A MapReduce job usually splits the input data-set into independent chunks which are processedby the map tasks in a completely parallel manner. The framework sorts the outputs of the maps,which are then input to the reduce tasks.*The MapReduce framework operates exclusively on <key, value> pairs, that is, the frameworkviews the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairsas the output of the job, conceivably of different types.The key and value classes have to be serializable by the framework and hence need to implementthe Writable interface. Additionally, the key classes have to implement the WritableComparableinterface to facilitate sorting by the framework.

Page 25: Cloudera CCD

Reference:MapReduce TutorialQUESTION NO: 36All keys used for intermediate output from mappers must do which of the following:A. Override isSplitableB. Implement WritableComparableC. Be a subclass of Filelnput-FormatD. Use a comparator for speedy sortingE. Be compressed using a splittable compression algorithm.Answer: BExplanation: The MapReduce framework operates exclusively on <key, value> pairs, that is, theframework views the input to the job as a set of <key, value> pairs and produces a set of <key,value> pairs as the output of the job, conceivably of different types.The key and value classes have to be serializable by the framework and hence need to implementthe Writable interface. Additionally, the key classes have to implement the WritableComparableinterface to facilitate sorting by the framework.Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 27Reference:MapReduce TutorialQUESTION NO: 37You have the following key value pairs as output from your Map task:(The, 1)(Fox, 1)(Runs, 1)(Faster, 1)(Than, 1)(The, 1)(Dog, 1)How many keys will be passed to the reducer?A. OneB. TwoC. ThreeD. FourE. FiveF. SixAnswer: FExplanation: Only one key value pair will be passed from thetwo(The, 1) key value pairs.QUESTION NO: 38You write a MapReduce job to process 100 files in HDFS. Your MapReduce algorithm usesTextInputFormat and the IdentityReducer: the mapper applies a regular expression over input

Page 26: Cloudera CCD

values and emits key-value pairs with the key consisting of the matching text, and the valuecontaining the filename and byte offset. Determine the difference between setting the number ofreducers to zero.Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 28A. There is no difference in output between the two settings.B. With zero reducers, no reducer runs and the job throws an exception. With one reducer,instances of matching patterns are stored in a single file on HDFS.C. With zero reducers, all instances of matching patterns are gathered together in one file onHDFS. With one reducer, instances of matching patterns stored in multiple files on HDFS.D. With zero reducers, instances of matching patterns are stored in multiple files on HDFS. Withone reducer, all instances of matching patterns are gathered together in one file on HDFS.Answer: DExplanation: *It is legal to set the number of reduce-tasks to zero if no reduction is desired.In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set bysetOutputPath(Path). The framework does not sort the map-outputs before writing them out to theFileSystem.*Often, you may want to process input data using a map function only. To do this, simply setmapreduce.job.reduces to zero. The MapReduce framework will not create any reducer tasks.Rather, the outputs of the mapper tasks will be the final output of the job.Note:ReduceIn this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method iscalled for each <key, (list of values)> pair in the grouped inputs.The output of the reduce task is typically written to the FileSystem viaOutputCollector.collect(WritableComparable, Writable).Applications can use the Reporter to report progress, set application-level status messages andupdate Counters, or just indicate that they are alive.The output of the Reducer is not sorted.QUESTION NO: 39For each intermediate key, each reducer task can emit:A. One final key value pair per key; no restrictions on the type.

Page 27: Cloudera CCD

B. One final key-value pair per value associated with the key; no restrictions on the type.C. As many final key-value pairs as desired, as long as all the keys have the same type and all thevalues have the same type.D. As many final key-value pairs as desired, but they must have the same type as the intermediatekey-value pairs.Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 29E. As many final key value pairs as desired. There are no restrictions on the types of those keyvaluepairs (i.e., they can be heterogeneous)Answer: AExplanation: Reducer reduces a set of intermediate values which share a key to a smaller set ofvalues.Reference:Hadoop Map-Reduce TutorialQUESTION NO: 40For each input key-value pair, mappers can emit:A. One intermediate key value pair, of a different type.B. One intermediate key value pair, but of the same type.C. As many intermediate key-value pairs as desired, but they cannot be of the same type as theinput key-value pair.D. As many intermediate key value pairs as desired, as long as all the keys have the same typeand all the values have the same type.E. As many intermediate key-value pairs as desired. There are no restrictions on the types ofthose key-value pairs (i.e., they can be heterogeneous).Answer: EExplanation: Mapper maps input key/value pairs to a set of intermediate key/value pairs.Maps are the individual tasks that transform input records into intermediate records. Thetransformed intermediate records do not need to be of the same type as the input records. Agiveninput pair may map to zero or many output pairs.Reference: Hadoop Map-Reduce TutorialQUESTION NO: 41During the standard sort and shuffle phase of MapReduce, keys and values are passed toreducers. Which of the following is true?A. Keys are presented to a reducer in sorted order; values for a given key are not sorted.

Page 28: Cloudera CCD

B. Keys are presented to a reducer in soiled order; values for a given key are sorted in ascendingorder.Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 30C. Keys are presented to a reducer in random order; values for a given key are not sorted.D. Keys are presented to a reducer in random order; values for a given key are sorted inascending order.Answer: DExplanation:QUESTION NO: 42What is the behavior of the default partitioner?A. The default partitioner assigns key value pairs to reducers based on an internal random numbergenerator.B. The default partitioner implements a round robin strategy, shuffling the key value pairs to eachreducer in turn. This ensures an even partition of the key space.C. The default partitioner computes the hash of the key. Hash values between specific ranges areassociated with different buckets, and each bucket is assigned to a specific reducer.D. The default partitioner computes the hash of the key and divides that value modulo the numberof reducers. The result determines the reducer assigned to process the key-value pair.E. The default partitioner computes the hash of the value and takes the mod of that value with thenumber of reducers. The result determines the reducer assigned to process the key value pair.Answer: DExplanation: The default partitioner computes a hash value for the key and assigns the partitionbased on this result.The default Partitioner implementation is called HashPartitioner. It uses the hashCode() method ofthe key objects modulo the number of partitions total to determine which partition to send a given(key, value) pair to.In Hadoop, the default partitioner is HashPartitioner, which hashes a record’s key to determinewhich partition (and thus which reducer) the record belongs in.The number of partition is thenequal to the number of reduce tasks for the job.Reference:Getting Started With (Customized) PartitioningQUESTION NO: 43

Page 29: Cloudera CCD

Which statement best describes the data path of intermediate key-value pairs (i.e., output of theCloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 31mappers)?A. Intermediate key-value pairs are written to HDFS. Reducers read the intermediate data fromHDFS.B. Intermediate key-value pairs are written to HDFS. Reducers copy the intermediate data to thelocal disks of the machines running the reduce tasks.C. Intermediate key-value pairs are written to the local disks of the machines running the maptasks, and then copied to the machine running the reduce tasks.D. Intermediate key-value pairs are written to the local disks of the machines running the maptasks, and are then copied to HDFS. Reducers read the intermediate data from HDFS.Answer: CExplanation: The mapper output (intermediate data) is stored on the Local file system (NOTHDFS) of each individual mapper nodes. This is typically a temporary directory location which canbe setup in config by the hadoop administrator. The intermediate data is cleaned up after theHadoop Job completes.Note:*Reducers start copying intermediate key-value pairs from the mappers as soon as they areavailable. The progress calculation also takes in account the processing of data transfer which isdone by reduce process, therefore the reduce progress starts showing up as soon as anyintermediate key-value pair for a mapper is available to be transferred to reducer. Though thereducer progress is updated still the programmer defined reduce method is called only after all themappers have finished.*Reducer is input the grouped output of a Mapper. In the phase the framework, for each Reducer,fetches the relevant partition of the output of all the Mappers, via HTTP.*Mapper maps input key/value pairs to a set of intermediate key/value pairs.Maps are the individual tasks that transform input records into intermediate records. Thetransformed intermediate records do not need to be of the same type as the input records. A giveninput pair may map to zero or many output pairs.

Page 30: Cloudera CCD

*All intermediate values associated with a given output key are subsequently grouped by theframework, and passed to the Reducer(s) to determine the final output.Reference:Questions & Answers for Hadoop MapReduce developers,Where is the Mapper Output(intermediate kay-value data) stored ?Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 32QUESTION NO: 44You've written a MapReduce job that will process 500 million input records and generate 500million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create asignificant amount of intermediate data that it needs to transfer between mappers and reducerswhich is a potential bottleneck. A custom implementation of which of the following interfaces ismost likely to reduce the amount of intermediate data transferred across the network?A. WritableB. WritableComparableC. InputFormatD. OutputFormatE. CombinerF. PartitionerAnswer: EExplanation: Users can optionally specify a combiner, via JobConf.setCombinerClass(Class), toperform local aggregation of the intermediate outputs, which helps to cut down the amount of datatransferred from the Mapper to the Reducer.Reference:Map/Reduce Tutorialhttp://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html(Mapper, 9th paragraph)QUESTION NO: 45If you run the word count MapReduce program with m mappers and r reducers, how many outputfiles will you get at the end of the job? And how many key-value pairs will there be in each file?Assume k is the number of unique words in the input files.A. There will be r files, each with exactly k/r key-value pairs.B. There will be r files, each with approximately k/m key-value pairs.C. There will be r files, each with approximately k/r key-value pairs.D. There will be m files, each with exactly k/m key value pairs.E. There will be m files, each with approximately k/m key-value pairs.Answer: AExplanation:

Page 31: Cloudera CCD

Note:*A MapReduce job withm mappers and r reducers involves up to m*r distinct copy operations,Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 33since eachmapper may have intermediate output going to every reducer.*In the canonical example of word counting, a key-value pair is emitted for every word found. Forexample, if we had 1,000 words, then 1,000 key-value pairs will be emitted from the mappers tothe reducer(s).QUESTION NO: 46You have a large dataset of key-value pairs, where the keys are strings, and the values areintegers. For each unique key, you want to identify the largest integer. In writing a MapReduceprogram to accomplish this, can you take advantage of a combiner?A. No, a combiner would not be useful in this case.B. Yes.C. Yes, but the number of unique keys must be known in advance.D. Yes, as long as all the keys fit into memory on each node.E. Yes, as long as all the integer values that share the same key fit into memory on each node.Answer: BExplanation:QUESTION NO: 47What happens in a MapReduce job when you set the number of reducers to zero?A. No reducer executes, but the mappers generate no output.B. No reducer executes, and the output of each mapper is written to a separate file in HDFS.C. No reducer executes, but the outputs of all the mappers are gathered together and written to asingle file in HDFS.D. Setting the number of reducers to zero is invalid, and an exception is thrown.Answer: BExplanation: *It is legal to set the number of reduce-tasks to zero if no reduction is desired.In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set bysetOutputPath(Path). The framework does not sort the map-outputs before writing them out to theFileSystem.*Often, you may want to process input data using a map function only. To do this, simply setmapreduce.job.reduces to zero. The MapReduce framework will not create any reducer tasks.

Page 32: Cloudera CCD

Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 34Rather, the outputs of the mapper tasks will be the final output of the job.QUESTION NO: 48You are developing a combiner that takes as input Text keys, IntWritable values, and emits Textkeys, Intwritable values. Which interface should your class implement?A. Mapper <Text, IntWritable, Text, IntWritable>B. Reducer <Text, Text, IntWritable, IntWritable>C. Reducer <Text, IntWritable, Text, IntWritable>D. Combiner <Text, IntWritable, Text, IntWritable>E. Combiner <Text, Text, IntWritable, IntWritable>Answer: DExplanation:QUESTION NO: 49Combiners Increase the efficiency of a MapReduce program because:A. They provide a mechanism for different mappers to communicate with each Other, therebyreducing synchronization overhead.B. They provide an optimization and reduce the total number of computations that are needed toexecute an algorithm by a factor of n, where is the number of reducer.C. They aggregate intermediate map output locally on each individual machine and thereforereduce the amount of data that needs to be shuffled across the network to the reducers.D. They aggregate intermediate map output horn a small number of nearby (i.e., rack-local)machines and therefore reduce the amount of data that needs to be shuffled across the network tothe reducers.Answer: CExplanation:Combiners are used to increase the efficiency of a MapReduce program. They are used toaggregate intermediate map output locally on individual mapper outputs. Combiners can help youreduce the amount of data that needs to be transferred across to the reducers. You can use yourreducer code as a combiner if the operation performed is commutative and associative. Theexecution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, ifrequired it may execute it more then 1 times. Therefore your MapReduce jobs should not dependon the combiners execution.Cloudera CCD-333 Exam

Page 33: Cloudera CCD

"A Composite Solution With Just One Click" - Certification Guaranteed 35Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,What arecombiners? When should I use a combiner in my MapReduce Job?http://www.fromdev.com/2010/12/interview-questions-hadoop-mapreduce.html(question no. 12)QUESTION NO: 50In a large MapReduce job with m mappers and r reducers, how many distinct copy operations willthere be in the sort/shuffle phase?A. mB. rC. m+r (i.e., m plus r)D. mxr (i.e., m multiplied by r)E. mr (i.e., m to the power of r)Answer: DExplanation: A MapReduce job withm mappers and r reducers involves up to m*r distinct copyoperations, since eachmapper may have intermediate output going to every reducer.QUESTION NO: 51When is the reduce method first called in a MapReduce job?A. Reduce methods and map methods all start at the beginning of a job, in order to provideoptimal performance for map-only or reduce-only jobs.B. Reducers start copying intermediate key value pairs from each Mapper as soon as it hascompleted. The reduce method is called as soon as the intermediate key-value pairs start toarrive.C. Reducers start copying intermediate key-value pairs from each Mapper as soon as it hascompleted. The reduce method is called only after all intermediate data has been copied andsorted.D. Reducers start copying intermediate key-value pairs from each Mapper as soon as it hascompleted. The programmer can configure in the job what percentage of the intermediate datashould arrive before the reduce method begins.Answer: CCloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 36Explanation: In a MapReduce job reducers do not start executing the reduce method until the allMap jobs have completed. Reducers start copying intermediate key-value pairs from the mappers

Page 34: Cloudera CCD

as soon as they are available. The programmer defined reduce method is called only after all themappers have finished.Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,When is thereducers are started in a MapReduce job?http://www.fromdev.com/2010/12/interview-questions-hadoop-mapreduce.html(question no. 17)QUESTION NO: 52What happens in a MapReduce job when you set the number of reducers to one?A. A single reducer gathers and processes all the output from all the mappers. The output iswritten in as many separate files as there are mappers.B. A single reducer gathers and processes all the output from all the mappers. The output iswritten to a single file in HDFS.C. Setting the number of reducers to one creates a processing bottleneck, and since the numberof reducers as specified by the programmer is used as a reference value only, the MapReduceruntime provides a default setting for the number of reducers.D. Setting the number of reducers to one is invalid, and an exception is thrown.Answer: AExplanation: *It is legal to set the number of reduce-tasks to zero if no reduction is desired.In this case the outputs of the map-tasks go directly to the FileSystem, into the output path setbysetOutputPath(Path). The framework does not sort the map-outputs before writing them out tothe FileSystem.*Often, you may want to process input data using a map function only. To do this, simply setmapreduce.job.reduces to zero. The MapReduce framework will not create any reducer tasks.Rather, the outputs of the mapper tasks will be the final output of the job.QUESTION NO: 53In the standard word count MapReduce algorithm, why might using a combiner reduce the overallJob running time?Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 37A. Because combiners perform local aggregation of word counts, thereby allowing the mappers toprocess input data faster.B. Because combiners perform local aggregation of word counts, thereby reducing the number of

Page 35: Cloudera CCD

mappers that need to run.C. Because combiners perform local aggregation of word counts, and then transfer that data toreducers without writing the intermediate data to disk.D. Because combiners perform local aggregation of word counts, thereby reducing the number ofkey-value pairs that need to be snuff let across the network to the reducers.Answer: AExplanation:*Simply speaking a combiner can be considered as a“mini reducer”that will be applied potentiallyseveral times still during the map phase before to send the new (hopefully reduced) set ofkey/value pairs to the reducer(s). This is why a combiner must implement the Reducer interface(or extend the Reducer class as of hadoop 0.20).*Combiners are used to increase the efficiency of a MapReduce program. They are used toaggregate intermediate map output locally on individual mapper outputs. Combiners can help youreduce the amount of data that needs to be transferred across to the reducers. You can use yourreducer code as a combiner if the operation performed is commutative and associative. Theexecution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, ifrequired it may execute it more then 1 times. Therefore your MapReduce jobs should not dependon the combiners execution.Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,What arecombiners? When should I use a combiner in my MapReduce Job?QUESTION NO: 54Which two of the following are valid statements? (Choose two)A. HDFS is optimized for storing a large number of files smaller than the HDFS block size.B. HDFS has the Characteristic of supporting a "write once, read many" data access model.C. HDFS is a distributed file system that replaces ext3 or ext4 on Linux nodes in a Hadoop cluster.D. HDFS is a distributed file system that runs on top of native OS filesystems and is well suited tostorage of very large data sets.Answer: B,DExplanation: B:HDFS is designed to support very large files. Applications that are compatible with

Page 36: Cloudera CCD

HDFS are those that deal with large data sets. These applications write their data only once butthey read it one or more times and require these reads to be satisfied at streaming speeds. HDFSCloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 38supports write-once-read-many semantics on files.D:*Hadoop Distributed File System: A distributed file system that provides high-throughput access toapplication data.*DFS is designed to support very large files.Reference:24 Interview Questions & Answers for Hadoop MapReduce developersQUESTION NO: 55You need to create a GUI application to help your company's sales people add and edit customerinformation. Would HDFS be appropriate for this customer information file?A. Yes, because HDFS is optimized for random access writes.B. Yes, because HDFS is optimized for fast retrieval of relatively small amounts of data.C. No, because HDFS can only be accessed by MapReduce applications.D. No, because HDFS is optimized for write-once, streaming access for relatively large files.Answer: DExplanation: HDFS is designed to support very large files. Applications that are compatible withHDFS are those that deal with large data sets. These applications write their data only once butthey read it one or more times and require these reads to be satisfied at streaming speeds. HDFSsupports write-once-read-many semantics on files.Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,What is HDFS? How it is different from traditional file systems?QUESTION NO: 56Which of the following describes how a client reads a file from HDFS?A. The client queries the NameNode for the block location(s). The NameNode returns the blocklocation(s) to the client. The client reads the data directly off the DataNode(s).B. The client queries all DataNodes in parallel. The DataNode that contains the requested dataresponds directly to the client. The client reads the data directly off the DataNode.C. The client contacts the NameNode for the block location(s). The NameNode then queries theDataNodes for block locations. The DataNodes respond to the NameNode, and the NameNode

Page 37: Cloudera CCD

redirects the client to the DataNode that holds the requested data block(s). The client then readsthe data directly off the DataNode.Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 39D. The client contacts the NameNode for the block location(s). The NameNode contactstheDataNode that holds the requested data block. Data is transferred from the DataNode to theNameNode, and then from the NameNode to the client.Answer: CExplanation: The Client communication to HDFS happens using Hadoop HDFS API. Clientapplications talk to the NameNode whenever they wish to locate a file, or when they want toadd/copy/move/delete a file on HDFS. The NameNode responds the successful requests byreturning a list of relevant DataNode servers where the data lives. Client applications can talkdirectly to a DataNode, once the NameNode has provided the location of the data.Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers,How the Clientcommunicates with HDFS?QUESTION NO: 57You need to create a job that does frequency analysis on input data. You will do this by writing aMapper that uses TextInputForma and splits each value (a line of text from an input file) intoindividual characters. For each one of these characters, you will emit the character as a key andas IntWritable as the value. Since this will produce proportionally more intermediate data thaninput data, which resources could you expect to be likely bottlenecks?A. Processor and RAMB. Processor and disk I/OC. Disk I/O and network I/OD. Processor and network I/OAnswer: BExplanation:QUESTION NO: 58Which of the following statements best describes how a large (100 GB) file is stored in HDFS?A. The file is divided into variable size blocks, which are stored on multiple data nodes. Each blockis replicated three times by default.B. The file is replicated three times by default. Eachcopy of the file is stored on a separate

Page 38: Cloudera CCD

datanodes.C. The master copy of the file is stored on a single datanode. The replica copies are divided intoCloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 40fixed-size blocks, which are stored on multiple datanodes.D. The file is divided into fixed-size blocks, which are stored on multiple datanodes. Each block isreplicated three times by default. Multiple blocks from the same file might reside on the samedatanode.E. The file is divided into fixed-size blocks, which are stored on multiple datanodes. Each block isreplicated three times by default.HDFS guarantees that different blocks from the same file arenever on the same datanode.Answer: EExplanation: HDFS is designed to reliably store very large files across machines in a largecluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are thesame size. The blocks of a file are replicated for fault tolerance. The block size and replicationfactor are configurable per file. An application can specify the number of replicas of a file. Thereplication factor can be specified at file creation time and can be changed later. Files in HDFS arewrite-once and have strictly one writer at any time. The NameNode makes all decisions regardingreplication of blocks. HDFS uses rack-aware replica placement policy. In default configurationthere are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rackand 3rd copy on a different rack.Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,How the HDFSBlocks are replicated?QUESTION NO: 59Your cluster has 10 DataNodes, each with a single 1 TB hard drive. You utilize all your diskcapacity for HDFS, reserving none for MapReduce. You implement default replication settings.What is the storage capacity of your Hadoop cluster (assuming no compression)?A. about 3 TBB. about 5 TBC. about 10 TB

Page 39: Cloudera CCD

D. about 11 TBAnswer: AExplanation: In default configuration there are total 3 copies of a datablock on HDFS, 2 copiesare stored on datanodes on same rack and 3rd copy on a different rack.Note:HDFS is designed to reliably store very large files across machines in a large cluster. Itstores each file as a sequence of blocks; all blocks in a file except the last block are the samesize. The blocks of a file are replicated for fault tolerance. The block size and replication factor areconfigurable per file. An application can specify the number of replicas of a file. The replicationCloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 41factor can be specified at file creation time and can be changed later. Files in HDFS are write-onceand have strictly one writer at any time. The NameNode makes all decisions regarding replicationof blocks. HDFS uses rack-aware replica placement policy.Reference:24 Interview Questions & Answers for Hadoop MapReduce developers,How the HDFSBlocks are replicated?QUESTION NO: 60You use the hadoop fs –put command to write a 300 MB file using an HDFS block size of 64 MB.Just after this command has finished writing 200 MB of this file, what would another user seewhen trying to access this file?A. They would see no content until the whole file is written and closed.B. They would see the content of the file through the last completed block.C. They would see the current state of the file, up to the last bit written by the command.D. They would see Hadoop throw an concurrentFileAccessException when they try to access thisfile.Answer: AExplanation:Note:*putUsage: hadoop fs -put <localsrc> ... <dst>Copy single src, or multiple srcs from local file system to the destination filesystem. Also readsinput from stdin and writes to destination filesystem.Cloudera CCD-333 Exam"A Composite Solution With Just One Click" - Certification Guaranteed 42