Apache PIG - User Defined Functions
-
Upload
christoph-bauer -
Category
Education
-
view
10.887 -
download
3
description
Transcript of Apache PIG - User Defined Functions
Apache Pig UDFsExtending Pig to solve complex tasks
UDF = User Defined Functions
Your speaker today:
Christoph Bauer
java developer 10+ years
one of the founders
Helping our clients to use and understand their (big) data
working in "BigData" since 2010
Why use PIG
● ad-hoc way for creating and executing map/reduce jobs
● simple, high-level language● more natural for analysts than map/reduce
Done.
http://leesfishandphotos.blogspot.de
Oh, wait...
UDFs to the rescue
Writing user defined functions (UDF)+ easy to use+ easy to code+ keep the power of PIG+ you can write them in java, python, ...
Do whatever you want
● image feature extraction● geo computations● data cleaning● retrieve web pages● natural language processing
...● much more...
User Defined Functions
● EvalFunc<T>public <T> exec(Tuple input)
● FilterFuncpublic Boolean exec(Tuple input)
● Aggregate Functionspublic interface Algebraic{ public String getInitial(); public String getIntermed(); public String getFinal();}
● Load/Store Functionspublic Tuple getNext()public void putNext(Tuple input);
What? Why?companyName companyAdress
companyAdresscompanyAddress
Net WorthNet Worth
Net WorthNet Worth
Net WorthNet Worth
Net WorthNet Worth
Net Worth
2010 | companyName | current Address | historical Net Worth
2011 | companyName | current Address | historical Net Worth
2012 | companyName | current Address | historical Net Worth
Exampler1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }
...apply UDF
r1, t1, q1:"v1", q2:"v4"
r1, t3, q1:"v1", q2:"v4"
r1, t5, q1:"v2", q2:"v4"
SNAPSHOTS(q1, t1 <= t < t6, 2), LATEST (q2)
LATESTpublic class LATEST extends EvalFunc<Tuple> {
public Tuple exec(Tuple input) throws IOException { }}
LATEST (contd.)public Tuple exec(Tuple input) throws IOException { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractLatestValueFromBag(bag); if (val != null) { result.set(i, val); } break; case DataType.MAP: // ... MAPs need different handling default: // warn ... } } return result;}
r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }
SNAPSHOTpublic class SNAPSHOTS extends EvalFunc<DataBag> { @Override public DataBag exec(Tuple input) throws IOException { List<Tuple> listOfTuples = new ArrayList<Tuple>();
DateTime dtCur = new DateTime(start); DateTime dtEnd = new DateTime(end).plus(1L); while (dtCur.isBefore(dtEnd)) { listOfTuples.add(snapshot(input, dtCur));
dtCur = dtCur.plus(period); } DataBag bag = factory.newDefaultBag(listOfTuples); return bag; }
SNAPSHOT (contd.)protected Tuple snapshot(Tuple input, long ts) throws... { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples + 1); result.set(0, ts);
for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractTSValueFromBag(bag, ts); result.set(i + 1, val); break; case DataType.MAP: // handle MAPs default: } } return result;}
r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }
PigLatin
REGISTER 'my-udf.jar'DEFINE LATEST myudf.Latest();DEFINE SNAPSHOT myudf.Snapshot ('2000-01-01 2013-01-01 1y');A = LOAD 'inputTable' AS (id, q1, q2);B = FOREACH A GENERATE id, SNAPSHOT(q1) AS SN, LATEST(q2) as CUR;C = FOREACH B GENERATE id, FLATTEN(SN), FLATTEN(CUR);STORE C INTO 'output.csv';
r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }
Passing parameters to UDFsDEFINE SNAPSHOT cool.udf.Snapshot ('2000-01-01 2013-01-01 1y');
...public SNAPSHOTS(String start, String end, String increment) { super(); this.start = Long.parseLong(start); this.end = Long.parseLong(end); this.increment = parseLong(increment);}
I didn't talk about
● UDFs run as a single instance in every mapper, reducer, ... use instance variables for locally shared objects
● Watch your heap when using Lucene Indexes, or implementing the Algebraic interface
● do implementpublic Schema outputSchema(Schema input)
● report progress when doing time consuming stuff
● Performance?
SNAPSHOT (contd.)@Overridepublic Schema outputSchema(Schema input) { List out = new ArrayList<Schema.FieldSchema>(); out.add(new FieldSchema("snapshot", DataType.LONG));
for (FieldSchema fieldSchema : input.getFields()) { String alias = fieldSchema.alias; byte type = fieldSchema.type; out.add(new FieldSchema(alias, type)); } Schema bagSchema = new Schema(out); try { return new Schema(new FieldSchema( getSchemaName( "snapshots", input), bagSchema, DataType.BAG)); } catch (FrontendException e) { } return null;}
Reality check
● These UDFs are in production,● Producing reports with up to 60GB● Data is stored in HBase
Wrapping it up
We at Oberbaum Concept developed a bunch of PIG Functions handling versioned data in HBase.● Rewrote HBaseStorage● UDFs for Snapshots, Latest
Right now we are trying to push our changes back into PIG.
Questions?