Python in an Evolving Enterprise System (PyData SV 2013)
-
Upload
pydata -
Category
Technology
-
view
494 -
download
1
description
Transcript of Python in an Evolving Enterprise System (PyData SV 2013)
![Page 1: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/1.jpg)
DMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMD MMM8OOOOOOOOOOO8MMMM8OOOOOOOOOOOOOOODMMMMOOOOOOOOOOOOOOMMMN DMMIIIIIIIIIIIII$MMMM$IIIIIIIIIIIIIIIOMMMM7III?IIIIIIIIII7MM MMOIIIIIIIIIIIII7MMMMOIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMDIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMM7IIIIIIIIIIIIIII?MMMMMIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMM8IIIIIIIIIIIIIIIIMMMMMOIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIZMMMMMMIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIZMMMM8IIIIIIIIIIIIIIIII7MMMMMMOIIIIIIIIIMM MM8$$IIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIII?III8MMMMMMMZIIIIIIMM MMMMMMMMMMN87IIIIII8MMMMDIIIIIIIIIIIIIIIIIIIZMMMMMMMMMNZIIMM MMMMMMMMMMMMMMMMMOII$MMMMMIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMM MMMMMMMMMMMMMMMMMMMM8MMMMMZIIIIIIIIIIIIIIIIIIIIII8MMMMMMMMMM MMOIIIIIIII7NMMMMMMMMMMMMMMMIIIIIIIIIIIIIIIIIIIIII?IIII$ODMM MMOIIIIIIIIII?I8MMMMMMMMMMMMDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIZMMMMMMMMMM7II?IIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIII7NMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIDMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMM8IIIIIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMM$IIIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIIIIOMMMMMMMMMM8I?IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMMMMMMN7IIIIIIIIIIIMM MMMMMMD$IIIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMDZIIIIIIIMM MMMMMMMMMMMNIIIIIIIIIIIIIIIIIIIIIINMMMMDZNMMMMMMMMMMMMMMMMMM MMMMMMMMMMMMMM?IIIIIIIIIIIIIIIIIIIIMMMMMIII7NMMMMMMMMMMMMMMM MMOIII7DMMMMMMMM$IIIIIIIIIIIIIIIIIIZMMMMNIIIIIIII7$DNMMMMMMM MMOIIIIII7MMMMMMM7IIIIIIIIIIIIIIIIIIOMMMM8IIIIIIIIIIIIIIIIMM MMOIIIIIIIIIMMMMMMNIIIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIIIII7MMMMDIIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIMMMMMZIIIIIIIIIIIIIIIIOMMMM$IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIMM MMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIMMMMMI??IIIIIIIIIIIMM $MMIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIII8MM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 777777777777777777777777777777777777777777777777I77 MD$N? MMM8MN OMM8MZ OMMMDM MMDM+ MD~NO M= MM ZZMI +ZI M7 M O7 MO OM M OMMMMM 8M7 M= MM MN7 MM?+I M7 M O7 MO OM M O+ 8M7 M~ MM ZMI MDMMN MMNMM7 OMMMM OM M MMM8M MD:N8 MMMMMM MO8M M7 O7 M7 O7
![Page 2: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/2.jpg)
PYTHON IN AN EVOLVING ENTERPRISE SYSTEM EVALUATING INTEGRATION SOLUTIONS WITH HADOOP
DAVE HIMROD
STEVE KANNAN
ANGELICA PANDO
![Page 3: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/3.jpg)
Building today’s most powerful, open, and customizable advertising technology platform.
![Page 4: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/4.jpg)
Ad is served in <100 milliseconds
300x250
AUCTION REQUEST
AD RESPONSE BID: $2.50
ADVERTISER 1
BID: $3.25
ADVERTISER 2
BID: $4.10
ADVERTISER 3
APPNEXUS OPTIMIZATION
WINNING BID
![Page 5: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/5.jpg)
Evolution of AppNexus
PEOPLE 350 430 20
AD REQUESTS 45B 39B 100M FROM
MYSQL, HADOOP/HBASE, AEROSPIKE, NETEZZA, VERTICA
5000+ SERVERS
OF DATA EVERY DAY 38+ TB
UPTIME 99.99%
![Page 6: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/6.jpg)
Evolution of AppNexus
ENGINEERING HQ IN NYC
ENG OFFICES IN PORTLAND & SF
![Page 7: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/7.jpg)
Data-Driven Decisioning (D3)
DATA PIPELINE
D3 PROCESSING
Bidder Bidder Bidder BIDDERS
![Page 8: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/8.jpg)
Python at AppNexus Python enables us to scale our team and rapidly iterate and prototype technologies.
![Page 9: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/9.jpg)
CLUSTER 1PB NODES ACROSS SEVERAL CLUSTERS 862
40B BILLION LOG RECORDS DAILY
5.6B BILLION LOG RECORDS/HOUR AT PEAK
Hadoop enables us to do aggregations for reporting and other data pipeline jobs
Hadoop at AppNexus
![Page 10: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/10.jpg)
Data modeling today
Task Task Task Task logs logs logs logs VERTICA CACHE
HADOOP
DATA SERVICES
Σ DATA DRIVEN DECISIONING
BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR
![Page 11: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/11.jpg)
To enable the next generation of data modeling, we need to leverage our Hadoop cluster
![Page 12: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/12.jpg)
What are we trying to do
Access the data on Hadoop
Continue to use Python to model
à No consensus on the best solution
So we conducted our own research to evaluate integration options
![Page 13: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/13.jpg)
The budget problem
We have thousands of bidders buying billions of ads per hour in real-time auctions.
We need to create a model that can manipulate how our bidders spend their budgets and purchase ads.
![Page 14: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/14.jpg)
Data modeling today
Task Task Task Task logs logs logs logs VERTICA CACHE
HADOOP
DATA SERVICES
Σ DATA DRIVEN DECISIONING
BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR
DATA DRIVEN DECISIONING
![Page 15: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/15.jpg)
Test problem: Budget aggregation
SCENARIO: Each auction creates a row in a log.
timestamp, auction_id, object_type, object_id, method, value
We need to aggregate and model to update bidders.
![Page 16: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/16.jpg)
Method: Budget aggregation
STEP 1: De-duplicate records where
KEY: object_type, object_id, method, auction_id
STEP 2: Aggregate value where
KEY: object_type, object_id, method
![Page 17: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/17.jpg)
HARDWARE
• 300 GB of log data
• 5 nodes running Scientific Linux 6.3 (Carbon)
• Intel Xeon CPU @ 2.13 GHz, 4 cores
• 2 TB Disk
• CDH4
• 45 map, 35 reduce tasks at a time
![Page 18: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/18.jpg)
Research: Potential solutions
1. Native Java
2. Streaming ‒ no framework
3. mrjob
4. Happy / Jython / PyCascading
5. Pig + Jython UDF 6. Pydoop
7. Disco
8. Hadoopy / dumbo 9. Hipy
evaluating Hadoop
Effectively ORM for Hive
similar to mrjob
prohibitive installation
![Page 19: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/19.jpg)
Research: Criteria
1. Usability
2. Performance
3. Versatility / Flexibility
![Page 20: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/20.jpg)
Research: Native Java
Benchmark for comparison, using new Hadoop Java API
BudgetAgg.java Mapper class
BudgetAgg.java Reducer class
![Page 21: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/21.jpg)
Research: Native Java
USABILITY: › Not straightforward for analysts to implement, launch, or tweak
PERFORMANCE: › Fastest implementation. › Can further enhance by overriding comparators for grouping and sorting
![Page 22: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/22.jpg)
Research: Native Java
VERSATILITY / FLEXIBILITY:
› Ability to customize pretty much everything
› Custom Partitioner, Comparator, Grouping Comparator in our implementation
› Can use complex objects as keys or values
![Page 23: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/23.jpg)
Research: Streaming
Supplies an executable to Hadoop that reads from stdin and writes to stdout
mapper.py reducer.py
![Page 24: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/24.jpg)
Research: Streaming
USABILITY: › Key/value detection has to be done by the user › Still, straightforward for relatively simple jobs
hadoop jar /usr/lib/hadoop-0.23.0-mr1-cdh4b1/contrib/streaming/hadoop-*streaming*.jar \ -D stream.num.map.output.key.fields=4 \ -D num.key.fields.for.partition=3 \ -D mapred.reduce.tasks=35 \ -file mapper.py \ -mapper mapper.py \ -file reducer.py \ -reducer reducer_nongroup.py \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \ -input /logs/log_budget/v002/2013/03/06/19/ -output bidder_logs/streaming_output
![Page 25: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/25.jpg)
Research: Streaming
PERFORMANCE: › ~50% slower than Java
VERSATILITY / FLEXIBILITY: › Inputs in reducer are iterated line-by-line › Straightforward to get de-duplication and agg to work in a single step
![Page 26: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/26.jpg)
Research: mrjob
Open-source Python framework that wraps Hadoop Streaming
USABILITY: › “Simplified Java” › Great docs, actively developed python budget_agg.py -r hadoop --hadoop-bin /usr/bin/hadoop \
--jobconf stream.num.map.output.key.fields=4 \
--jobconf num.key.fields.for.partition=3 \
--jobconf mapred.reduce.tasks=35 \
--partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-o hdfs:///user/apando/budget_logs/mrjob_output \
hdfs:///logs/log_budget/v002/2013/03/06/19/
![Page 27: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/27.jpg)
Research: mrjob
PERFORMANCE: › Not much slower than Streaming if only using RawValueProtocol
![Page 28: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/28.jpg)
Research: mrjob
PERFORMANCE: › Involving objects or multiple steps slow it down a lot
VERSATILITY / FLEXIBILITY:
› Can define Input /Internal / Output protocols
![Page 29: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/29.jpg)
Research: Happy / Jython
HAPPY: › Full access to Java MapReduce API › Happy project is deprecated
› Depends on Hadoop 0.17
JYTHON: › Doesn’t work easily out of the box
› Relies on deprecated Jython compiler in Jython 2.2 › Limited to Jython implementation of Python
› Numpy/SciPy and Pandas unavailable
![Page 30: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/30.jpg)
Research: PyCascading
Python wrapper around Cascading framework for data processing workflow.
Uses Jython as high level language for defining workflows.
![Page 31: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/31.jpg)
Research: PyCascading
USABILITY: › Relatively new project › Cascading API is simple and intuitive › Job Planner abstracts details of MapReduce
PERFORMANCE: › Abstraction makes performance tuning challenging › Does not support Combiner operation › Dev time was fast, runtime was slow
![Page 32: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/32.jpg)
Research: PyCascading
VERSATILITY / FLEXIBILITY: › Allows Jython UDFs › Rich set of built-in functions: GroupBy, Join, Merge
![Page 33: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/33.jpg)
Research: Pig
Provides a high-level language for data analysis which is compiled into a sequence of MapReduce operations.
USABILITY:
![Page 34: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/34.jpg)
Research: Pig
USABILITY: › Powerful debugging and optimization tools (e.g. explain, illustrate)
› Automatically optimizes MapReduce operations: › Applies Combiner operations where applicable › Reorders and conflates data flow for efficiency
![Page 35: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/35.jpg)
Research: Pig
PERFORMANCE: › Pig compiler produces performant code › Complex operations might require manual optimization › Budget Aggregation require the implementation of a User Defined Function in Jython to eliminate unnecessary MapReduce step
![Page 36: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/36.jpg)
Research: Pig
VERSATILITY / FLEXIBILITY: USING PIG + JYTHON UDF
› PigLatin is expressive and can capture most use cases
› Define custom data operations in Jython called UDFs
› UDFs can implement custom loaders, partitioners, and other advanced features
![Page 37: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/37.jpg)
Research: Summary
0 50 100 150 200 250 300
Java
Streaming
MRJob
PyCascading
Pig
Running Time (minutes), Lines of Code
Running Time / Lines of Code for Implementations
Lines of Code
Running Time
![Page 38: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/38.jpg)
Research: Recommendations
• Pig and PyCascading enable complex pipelines to be expressed simply
• Pig is more mature and the most viable option for ad-hoc analysis
![Page 39: Python in an Evolving Enterprise System (PyData SV 2013)](https://reader035.fdocuments.net/reader035/viewer/2022081403/554a0780b4c905507a8b55c7/html5/thumbnails/39.jpg)
QUESTIONS
??????? ??:::::::?? ??:::::::::::? ?:::::????:::::? ?::::? ?::::? ?::::? ?::::? ?????? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ??::?? ???? ??? ??:?? ???
??????? ??:::::::?? ??:::::::::::? ?:::::????:::::? ?::::? ?::::? ?::::? ?::::? ?????? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ??::?? ???? ??? ??:?? ???