Cascading talk in Etsy (
-
Upload
jyotirmoy-sundi -
Category
Data & Analytics
-
view
513 -
download
1
description
Transcript of Cascading talk in Etsy (
![Page 1: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/1.jpg)
How AdMobius uses Cascading in AdTech Stack
Jyotirmoy Sundi Sr Data Engineer in Lotame
(Acquired by LOTAME on March, 2014)
![Page 2: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/2.jpg)
What does AdMobius do
AdMobius is a Mobile Audience Management Platform (MAMP). It helps advertiser identify mobile audiences by demographics and interest through standard, custom, private segments and reach them at scale.
![Page 3: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/3.jpg)
Target effectively across all platforms in multiple devices
Laptop
Mobile
Ipod
Ipad
Wearables
![Page 4: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/4.jpg)
Topics
Device graph building and scoring device links Cascading Taps for Hive, MySQL, HBase Modularized Testing Optimal Config Setups Running in YARN Conclusion
![Page 5: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/5.jpg)
AdMobius Stack
Cascading | Hive | Hbase | GiraphCascading | Hive | Hbase | Giraph
Hadoop | (Experimental Spark)Hadoop | (Experimental Spark)
RackspaceRackspace
YARN | MR1YARN | MR1
Custom WorkflowsCustom Workflows
![Page 6: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/6.jpg)
Why Cascading Easy custom aggregators.
• In the existing MR framework it was very difficult to write a series of complex aggregated logic and run them in scale before making sure of its correctness. You can do that in hive by UDFs or UDAFs but we found it much easier in Cascading.
Easy for Java Developers to understand• visualize and write complicated workflows though
the concept of pipes, taps, tuples.
![Page 7: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/7.jpg)
Workflow for audience profile scoring
![Page 8: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/8.jpg)
Driven
https://driven.cascading.io/index.html#/apps/D818DDDA9D9DC182E3228EDF8B9B05C2
![Page 9: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/9.jpg)
![Page 10: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/10.jpg)
Audience Profiling Cascading is used to do
complex aggregations create the device multi-dimensional vectors device pair scoring based on the vectors rule engine based filters
Size Total number of mobile devices ~ 2.7B ~500M devices in Giraph computation.
![Page 11: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/11.jpg)
Example: Parallel aggregation of values across multiple fields.
![Page 12: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/12.jpg)
Aggregations
No need to know group modes like in UDAF Buffer
use for more complex grouping operations
output multiple tuples per group Aggregator (simple aggregations, prebuilt
aggregators like SumBy, CountBy)
![Page 13: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/13.jpg)
public class MinGraphScoring extends BaseOperation implements Buffer{
@Override
public void operate(FlowProcess flowProcess, BufferCall bufferCall) {
Iterator<TupleEntry> arguments = bufferCall.getArgumentsIterator();
Graph g = new Graph();
while( arguments.hasNext() )
{
TupleEntry tpe = arguments.next();
ByteBuffer b = ByteBuffer.wrap((byte[])tpe.getObject("field1"););//use kyro serialization
g.put(b)
}
Node[] nodes = g.nodes;
//For each pair of nodes : i,j {
double minmaxscore = scoring(g,i,j)
Tuple t1 = new Tuple(nodes[i].id ,nodes[j].id ,minmaxscore);
bufferCall.getOutputCollector().add(t1);
}
}
![Page 14: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/14.jpg)
public class PotentialMatchAggregator extends BaseOperation<PotentialMatchAggregator.IDList> implements Aggregator<PotentialMatchAggregator.IDList> {
start(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) {
IDList idList = new IDList();
aggregatorCall.setContext(idList);
}
aggregate(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) {
TupleEntry arguments = aggregatorCall.getArguments();
IDList idList = aggregatorCall.getContext();
idList.updateDev(amid, match);
}
complete(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) {
IDList idList = aggregatorCall.getContext();
…...
}
![Page 15: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/15.jpg)
Joins
CoGroup: two pipes cant fit into memory
HashJoin when one of the pipes fit into memory
Pipe jointermsPipe = new HashJoin(termsPipe, new Fields("term_token"),dictionary, new Fields("word"), new Fields("app","term_token","score","d_count","index","word"), new InnerJoin());
CustomJoins and BloomJoin
![Page 16: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/16.jpg)
Custom Src/Sink Taps
Cascading has good support to read/write to/from different form of data sources. Slight tuning or change might be required but most of code already exists.
Hive (with different file formats), HBase, MySQL
http://www.cascading.org/extensions/
Set proper Config parameters while reading from source tap, example while reading from Hbase Tap,
String tableName = "device_ids";
String[] familyNames = new String[] { "id:type1", "id:type2", “id:type3”,...”id:typen” };
Scan scan = new Scan();
scan.setCacheBlocks(false);
scan.setCaching(10000);
scan.setBatch(10000);
![Page 17: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/17.jpg)
Hive Src TapsExampleWorkflow.java
Tap dmTap = new HiveTableTap(HiveTableTap.SchemeType.SEQUENCE_FILE, admoFPbase, admoFPBasePartitions, dmFullFilter);
HiveTableTap.java
public class HiveTableTap extends GlobHfs {
static Scheme getScheme(SchemeType st) {
if(st.equals(SchemeType.SEQUENCE_FILE))
return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class);
else if(st.equals(SchemeType.TEXT_TSV))
return new TextDelimited();
else
return null;
}
…..
}
![Page 18: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/18.jpg)
Hive Sink TapsExampleWorkflow.java
Tap srcDstIdsSinkTap = new Hfs(new AdmobiusWritableSequenceFile(new Fields("value"), (Class<? extends Writable>) Text.class),"/tmp/srcDstIdsSinkTap" , SinkMode.REPLACE);
HiveTableTap.java
public class HiveTableTap extends GlobHfs {
static Scheme getScheme(SchemeType st) {
if(st.equals(SchemeType.SEQUENCE_FILE))
return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class);
else if(st.equals(SchemeType.TEXT_TSV))
return new TextDelimited();
else
return null;
}
…..
}
conf.setOutputFormat( SequenceFileOutputFormat.class );
valueValue = (Writable) (new Text(tupleEntry.getObject( 0 ).toString().getBytes()));
![Page 19: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/19.jpg)
Hive table
CREATE TABLE CASCADING_HIVE_INTER
(
admo_id string,
segments string
)
PARTITIONED BY ( batch_id STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS SEQUENCEFILE
![Page 20: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/20.jpg)
Good Practices Use Checkpointing optimally Use subassemblies instead of rewriting logic.
For further control pass additional parameters to subassemblies.
Use Compression and SequenceFile() in sink taps to chain multiple cascading workflows.
Use Failure Traps to filter faulty records. Avoid creating too small or too long workflows.
Chain them in Oozie or similar workflow management engines Example: workflows with 10-20 MR jobs are good
![Page 21: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/21.jpg)
Some Properties for Optimal Performance
![Page 22: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/22.jpg)
Problems with improper configuration
1. Set compression parameters : Jobs would run slow and may take sometime double the time. Set the correct compression Type based on cluster configs
2. mapred.reduce.tasks : Its required to be set manually depending on the size of your job. Keeping it too low would slow down reducer jobs.
3. small file issue : The input split files read by mappers would be too small eventually bringing up more mappers then required.
4. Any custom configuration parameters : You should set it here and use getProperty to access them anywhere in the data workflow
properties.setProperty("min_cutoff_score", "0.7");
FlowConnector flowConnector = new HadoopFlowConnector(properties);
![Page 23: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/23.jpg)
Running in Yarn
Yarn deployment is smooth with cascading 2.5 Make sure the config properties are set as per
YARN as they are different from MR1. While running in in workflow engines like oozie ,
make sure properties are set for • mapred.job.classpath.files and mapred.cache.file
are set with all dependency files in colon separated formatted
![Page 24: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/24.jpg)
Cascading DSLs in other languages
Scalding (Scala)
PyCascading (Python)
cascading.jruby (Jruby)
Cascalog (Closure)
![Page 25: Cascading talk in Etsy (](https://reader033.fdocuments.net/reader033/viewer/2022061202/547b85c35806b5ef3f8b4658/html5/thumbnails/25.jpg)
Thank you for your time Q & A