2/24/13
Short Apache Hadoop API
Overview
Adam KawaData Engineer @ Spotify
2/24/13
Image Source http://developer.yahoo.com/hadoop/tutorial/module4.html
2/24/13
InputFormat ReposibilitiesDivide input data into logical input splits
Data in HDFS is divided into block, but processed as input splits
InputSplit may contains any number of blocks (usually 1)
Each Mapper processes one input split
Creates RecordReaders to extract <key, value> pairs
2/24/13
InputFormat Classpublic abstract class InputFormat<K, V> {
public abstract
List<InputSplit> getSplits(JobContext context) throws ...;
public abstract
RecordReader<K,V> createRecordReader(InputSplit split,
TaskAttemptContext context) throws ...;
}
2/24/13
Most Common InputFormatsTextInputFormat
Each \n-terminated line is a value
The byte offset of that line is a key
Why not a line number?
KeyValueTextInputFormat
Key and value are separated by a separator (tab by default)
2/24/13
Binary InputFormatsSequenceFileInputFormat
SequenceFiles are flat files consisting of binary <key, value> pairs
AvroInputFormat
Avro supports rich data structures (not necessarily <key, value> pairs) serialized to files or messages
Compact, fast, language-independent, self-describing, dynamic
2/24/13
Some Other InputFormatsNLineInputFormat
Should not be too big since splits are calculated in a single thread (NLineInputFormat#getSplitsForFile)
CombineFileInputFormat
An abstract class, but not so difficult to extend
SeparatorInputFormat
How to here: http://blog.rguha.net/?p=293
2/24/13
Some Other InputFormatsMultipleInputs
Supports multiple input paths with a different InputFormat and Mapper for each path
MultipleInputs.addInputPath(job,
firstPath, FirstInputFormat.class, FirstMapper.class);
MultipleInputs.addInputPath(job,
secondPath, SecondInputFormat.class, SecondMapper.class);
2/24/13
InputFormat Class (Partial) Hierarchy
2/24/13
InputFormat Interesting FactsIdeally InputSplit size is equal to HDFS block size
Or InputSplit contains multiple collocated HDFS block
InputFormat may prevent splitting a file
A whole file is processed by a single mapper (e.g. gzip)
boolean FileInputFormat#isSplittable();
2/24/13
InputFormat Interesting FactsMapper knows the file/offset/size of the split that it process
MapContext#getInputSplit()
Useful for later debugging on a local machine
2/24/13
InputFormat Interesting FactsPathFilter (included in InputFormat) specifies which files
to include or not into input data
PathFilter hiddenFileFilter = new PathFilter(){
public boolean accept(Path p){
String name = p.getName();
return !name.startsWith("_") && !name.startsWith(".");
}
};
2/24/13
RecordReaderExtract <key, value> pairs from corresponding InputSplit
Examples:
LineRecordReader
KeyValueRecordReader
SequenceFileRecordReader
2/24/13
RecordReader Logic Must handle a common situation when InputSplit and
HDFS block boundaries do not match
Image source: Hadoop: The Definitive Guide by Tom White
2/24/13
RecordReader Logic Exemplary solution – based on LineRecordReader
Skips* everything from its block until the fist '\n'
Reads from the second block until it sees '\n'
*except the very first block (an offset equals to 0)
Image source: Hadoop: The Definitive Guide by Tom White
2/24/13
Keys And ValuesKeys must implement WritableComparable interface
Since they are sorted before passing to the Reducers
Values must implement “at least” Writable interface
2/24/13
WritableComparables Hierarchy
Image source: Hadoop: The Definitive Guide by Tom White
2/24/13
Writable And WritableComparablepublic interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}
public interface WritableComparable<T> extends Writable, Comparable<T> {
}
public interface Comparable<T> {
public int compareTo(T o);
}
2/24/13
Example: SongWritableclass SongWritable implements Writable {
String title;
int year;
byte[] content;
…
public void write(DataOutput out) throws ... {
out.writeUTF(title);
out.writeInt(year);
out.writeInt(content.length);
out.write(content);
}
}
2/24/13
MapperTakes input in form of a <key, value> pair
Emits a set of intermediate <key, value> pairs
Stores them locally and later passes to the Reducers
But earlier: partition + sort + spill + merge
2/24/13
Mapper Methodsvoid setup(Context context) throws ... {}
protected void cleanup(Context context) throws ... {}
void map(KEYIN key, VALUEIN value, Context context) ... {
context.write((KEYOUT) key, (VALUEOUT) value);
}
public void run(Context context) throws ... {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
2/24/13
MapContext ObjectAllow the user map code to communicate with MapReduce system
public InputSplit getInputSplit();
public TaskAttemptID getTaskAttemptID();
public void setStatus(String msg);
public boolean nextKeyValue() throws ...;
public KEYIN getCurrentKey() throws ...;
public VALUEIN getCurrentValue() throws ...;
public void write(KEYOUT key, VALUEOUT value) throws ...;
public Counter getCounter(String groupName, String counterName);
2/24/13
Examples Of MappersImplement highly specialized Mappers and reuse/chain them
when possible
IdentityMapper
InverseMapper
RegexMapper
TokenCounterMapper
2/24/13
TokenCounterMapperpublic class TokenCounterMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
2/24/13
General AdvicesReuse Writable instead of creating a new one each time
Apache commons StringUtils class seems to be the most efficient for String tokenization
2/24/13
Chain Of MappersUse multiple Mapper classes within a single Map task
The output of the first Mapper becomes the input of the second, and so on until the last Mapper
The output of the last Mapper will be written to the task's output
Encourages implementation of reusable and highly specialized Mappers
2/24/13
Exemplary Chain Of Mappers JobConf mapAConf = new JobConf(false);
...
ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class,
Text.class, Text.class, true, mapAConf);
JobConf mapBConf = new JobConf(false);
...
ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class,
LongWritable.class, Text.class, false, mapBConf);
FileInputFormat.setInputPaths(conf, inDir);
FileOutputFormat.setOutputPath(conf, outDir);
JobClient jc = new JobClient(conf);
RunningJob job = jc.submitJob(conf);
2/24/13
PartitionerSpecifies which Reducer a given <key, value> pair is sent to
Desire even distribution of the intermediate data
Skewed data may overload a single reducer and make a whole job running longer
public abstract class Partitioner<KEY, VALUE> {
public abstract
int getPartition(KEY key, VALUE value, int numPartitions);
}
2/24/13
HashPartitionerThe default choice for general-purpose use cases
public int getPartition(K key, V value, int numReduceTasks) {
return
(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
2/24/13
TotalOrderPartitionerA partitioner that aims the total order of the output
2/24/13
TotalOrderPartitionerBefore job runs, it samples input data to provide fairly even
distribution over keys
2/24/13
TotalOrderPartitionerThree samplers
InputSampler.RandomSampler<K,V>
Sample from random points in the input
InputSampler.IntervalSampler<K,V>
Sample from s splits at regular intervals
InputSampler.SplitSampler<K,V>
Samples the first n records from s splits
2/24/13
ReducerGets list(<key, list(value)>)
Keys are sorted, but values for a given key are not sorted
Emits a set of output <key, value> pairs
2/24/13
Reducer Run Methodpublic void run(Context context) throws … {
setup(context);
while (context.nextKey()) {
reduce(context.getCurrentKey(),
context.getValues(), context);
}
cleanup(context);
}
2/24/13
Chain Of Mappers After A ReducerThe ChainReducer class allows to chain multiple Mapper classes after a Reducer within the Reducer task
Combined with ChainMapper, one could get [MAP+ / REDUCE MAP*]
ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class,
Text.class, Text.class, true, reduceConf);
ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class,
LongWritable.class, Text.class, false, null);
ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class,
LongWritable.class, LongWritable.class, true, null);
2/24/13
OutputFormat Class Hierarchy
Image source: Hadoop: The Definitive Guide by Tom White
2/24/13
MultipleOutputsMultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, LongWritable.class, Text.class);
MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class, LongWritable.class, Text.class);
public void reduce(WritableComparable key, Iterator<Writable> values, Context context) throws ... {
...
mos.write("text", , key, new Text("Hello"));
mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a");
mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b");
mos.write(key, new Text("value"), generateFileName(key, new Text("value")));
}
2/24/13
Other Useful FeaturesCombiner
Skipping bad records
Compression
Profiling
Isolation Runner
2/24/13
Job Class Methodspublic void setInputFormatClass(..);
public void setOutputFormatClass(..);
public void setMapperClass(..);
public void setCombinerClass(..);
public void setReducerClass(...);
public void setPartitionerClass(..);
public void setMapOutputKeyClass(..);
public void setMapOutputValueClass(..);
public void setOutputKeyClass(..);
public void setOutputValueClass(..);
public void setSortComparatorClass(..);
public void setGroupingComparatorClass(..);
public void setNumReduceTasks(int tasks);
public void setJobName(String name);
public float mapProgress();
public float reduceProgress();
public boolean isComplete();
public boolean isSuccessful();
public void killJob();
public void submit();
public boolean waitForCompletion(..);
2/24/13
ToolRunnerSupports parsing allows the user to specify configuration
options on the command linehadoop jar examples.jar SongCount
-D mapreduce.job.reduces=10
-D artist.gender=FEMALE
-files dictionary.dat
-jar math.jar,spotify.jar
songs counts
2/24/13
Side Data Distributionpublic class MyMapper<K, V> extends Mapper<K,V,V,K> {
String gender = null;
File dictionary = null;
protected void setup(Context context) throws … {
Configuration conf = context.getConfiguration();
gender = conf.get(“artist.gender”, “MALE”);
dictionary = new File(“dictionary.dat”);
}
2/24/13
public class WordCount extends Configured implements Tool {
public int run(String[] otherArgs) throws Exception {
if (args.length != 2) {
System.out.println("Usage: %s [options] <input> <output>", getClass().getSimpleName());
return -1;
}
Job job = new Job(getConf());
FileInputFormat.setInputPaths(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
...
return job.waitForCompletion(true); ? 0 : 1;
}
}
public static void main(String[] allArgs) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new WordCount(), allArgs);
System.exit(exitCode);
}
2/24/13
MRUnitBuilt on top of JUnit
Provides a mock InputSplit, Contex and other classes
Can test
The Mapper class,
The Reducer class,
The full MapReduce job
The pipeline of MapReduce jobs
2/24/13
MRUnit Examplepublic class IdentityMapTest extends TestCase {
private MapDriver<Text, Text, Text, Text> driver;
@Before
public void setUp() {
driver = new MapDriver<Text, Text, Text, Text>(new MyMapper<Text, Text, Text, Text>());
}
@Test
public void testMyMapper() {
driver
.withInput(new Text("foo"), new Text("bar"))
.withOutput(new Text("oof"), new Text("rab"))
.runTest();
}
}
2/24/13
Example: Secondary Sortreduce(key, Iterator<value>) method gets iterator
over values
These values are not sorted for a given key
Sometimes we want to get them sorted
Useful to find minimum or maximum value quickly
2/24/13
Secondary Sort Is TrickyA couple of custom classes are needed
WritableComparable
Partitioner
SortComparator (optional, but recommended)
GroupingComparator
2/24/13
Composite KeyLeverages “traditional” sorting mechanism of intermediate keys
Intermediate key becomes composite of the “natural” key and the value
(Disturbia, 1) → (Disturbia#1, 1)
(SOS, 4) → (SOS#4, 4)
(Disturbia, 7) → (Disturbia#7, 7)
(Fast car, 2) → (Fast car#2, 2)
(Fast car, 6) → (Fast car#6, 6)
(Disturbia, 4) → (Disturbia#4, 4)
(Fast car, 2) → (Fast car#2, 2)
2/24/13
Custom PartitionerHashPartitioner uses a hash on keys
The same titles may go to different reducers (because they are combined with ts in a key)
Use a custom partitioner that partitions only on first part of the key
int getPartition(TitleWithTs key, LongWritable value, int num) {
return hashParitioner.getPartition(key.title);
}
2/24/13
Ordering Of KeysKeys needs to be ordered before passing to the reducer
Orders by natural key and, for the same natural key, on the value portion of the key
Implement sorting in WritableComparable or use Comparator class
job.setSortComparatorClass(SongWithTsComparator.class);
2/24/13
Data Passed To The ReducerBy default, each unique key forces reduce() method(Disturbia#1, 1) → reduce method is invoked
(Disturbia#4, 4) → reduce method is invoked
(Disturbia#7, 7) → reduce method is invoked
(Fast car#2, 2) → reduce method is invoked
(Fast car#2, 2)
(Fast car#6, 6) → reduce method is invoked
(SOS#4, 4) → reduce method is invoked
2/24/13
Data Passed To The ReducerGroupingComparatorClass class determines which keys and
values are passed in a single call to the reduce method
Just look at the natural key when grouping(Disturbia#1, 1) → reduce method is invoked
(Disturbia#4, 4)
(Disturbia#7, 7)
(Fast car#2, 2) → reduce method is invoked
(Fast car#2, 2)
(Fast car#6, 6)
(SOS#4, 4) → reduce method is invoked
2/24/13
QuestionHow to calculate a median from a set of numbers using Java
MapReduce?
2/24/13
Question – A Possible AnswerImplement TotalSort, but
Each Reducer produce an additional file containing a pair
<minimum_value, number_of_values>
After the job ends, a single-thread application
Reads these files to build the index
Calculate which value in which file is the median
Finds this value in this file
Top Related