Apache PIG - User Defined Functions

Apache Pig UDFsExtending Pig to solve complex tasks

UDF = User Defined Functions

Your speaker today:

Christoph Bauer

java developer 10+ years

one of the founders

Helping our clients to use and understand their (big) data

working in "BigData" since 2010

Why use PIG

● ad-hoc way for creating and executing map/reduce jobs

● simple, high-level language● more natural for analysts than map/reduce

Done.

http://leesfishandphotos.blogspot.de

Oh, wait...

UDFs to the rescue

Writing user defined functions (UDF)+ easy to use+ easy to code+ keep the power of PIG+ you can write them in java, python, ...

Do whatever you want

● image feature extraction● geo computations● data cleaning● retrieve web pages● natural language processing

...● much more...

User Defined Functions

● EvalFunc<T>public <T> exec(Tuple input)

● FilterFuncpublic Boolean exec(Tuple input)

● Aggregate Functionspublic interface Algebraic{ public String getInitial(); public String getIntermed(); public String getFinal();}

● Load/Store Functionspublic Tuple getNext()public void putNext(Tuple input);

What? Why?companyName companyAdress

companyAdresscompanyAddress

Net WorthNet Worth

Net WorthNet Worth

Net WorthNet Worth

Net WorthNet Worth

Net Worth

2010 | companyName | current Address | historical Net Worth



Exampler1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }

...apply UDF

r1, t1, q1:"v1", q2:"v4"

r1, t3, q1:"v1", q2:"v4"

r1, t5, q1:"v2", q2:"v4"

SNAPSHOTS(q1, t1 <= t < t6, 2), LATEST (q2)

LATESTpublic class LATEST extends EvalFunc<Tuple> {

public Tuple exec(Tuple input) throws IOException { }}

LATEST (contd.)public Tuple exec(Tuple input) throws IOException { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractLatestValueFromBag(bag); if (val != null) { result.set(i, val); } break; case DataType.MAP: // ... MAPs need different handling default: // warn ... } } return result;}

r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }

SNAPSHOTpublic class SNAPSHOTS extends EvalFunc<DataBag> { @Override public DataBag exec(Tuple input) throws IOException { List<Tuple> listOfTuples = new ArrayList<Tuple>();

DateTime dtCur = new DateTime(start); DateTime dtEnd = new DateTime(end).plus(1L); while (dtCur.isBefore(dtEnd)) { listOfTuples.add(snapshot(input, dtCur));

dtCur = dtCur.plus(period); } DataBag bag = factory.newDefaultBag(listOfTuples); return bag; }

SNAPSHOT (contd.)protected Tuple snapshot(Tuple input, long ts) throws... { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples + 1); result.set(0, ts);

for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractTSValueFromBag(bag, ts); result.set(i + 1, val); break; case DataType.MAP: // handle MAPs default: } } return result;}

r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }

PigLatin

REGISTER 'my-udf.jar'DEFINE LATEST myudf.Latest();DEFINE SNAPSHOT myudf.Snapshot ('2000-01-01 2013-01-01 1y');A = LOAD 'inputTable' AS (id, q1, q2);B = FOREACH A GENERATE id, SNAPSHOT(q1) AS SN, LATEST(q2) as CUR;C = FOREACH B GENERATE id, FLATTEN(SN), FLATTEN(CUR);STORE C INTO 'output.csv';

r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] }

Passing parameters to UDFsDEFINE SNAPSHOT cool.udf.Snapshot ('2000-01-01 2013-01-01 1y');

...public SNAPSHOTS(String start, String end, String increment) { super(); this.start = Long.parseLong(start); this.end = Long.parseLong(end); this.increment = parseLong(increment);}

I didn't talk about

● UDFs run as a single instance in every mapper, reducer, ... use instance variables for locally shared objects

● Watch your heap when using Lucene Indexes, or implementing the Algebraic interface

● do implementpublic Schema outputSchema(Schema input)

● report progress when doing time consuming stuff

● Performance?

SNAPSHOT (contd.)@Overridepublic Schema outputSchema(Schema input) { List out = new ArrayList<Schema.FieldSchema>(); out.add(new FieldSchema("snapshot", DataType.LONG));

for (FieldSchema fieldSchema : input.getFields()) { String alias = fieldSchema.alias; byte type = fieldSchema.type; out.add(new FieldSchema(alias, type)); } Schema bagSchema = new Schema(out); try { return new Schema(new FieldSchema( getSchemaName( "snapshots", input), bagSchema, DataType.BAG)); } catch (FrontendException e) { } return null;}

Reality check

● These UDFs are in production,● Producing reports with up to 60GB● Data is stored in HBase

Wrapping it up

We at Oberbaum Concept developed a bunch of PIG Functions handling versioned data in HBase.● Rewrote HBaseStorage● UDFs for Snapshots, Latest

Right now we are trying to push our changes back into PIG.

Questions?

Thank you!

Christoph Bauer

[email protected]://www.xing.com/profile/Christoph_Bauer62

Apache PIG - User Defined Functions

Education

Transcript of Apache PIG - User Defined Functions