Large-Scale Data Processing with Hadoop and PHP (DPC11 2011-05-21)
Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)
-
Upload
david-zuelke -
Category
Technology
-
view
3.427 -
download
2
description
Transcript of Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)
![Page 1: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/1.jpg)
LARGE-SCALE DATA PROCESSING WITH HADOOP AND PHP
![Page 2: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/2.jpg)
David Zuelke
![Page 3: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/3.jpg)
David Zülke
![Page 4: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/4.jpg)
![Page 5: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/5.jpg)
http://en.wikipedia.org/wiki/File:München_Panorama.JPG
![Page 6: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/6.jpg)
Founder
![Page 8: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/8.jpg)
Lead Developer
![Page 11: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/11.jpg)
THE BIG DATA CHALLENGEDistributed And Parallel Computing
![Page 12: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/12.jpg)
we want to process data
![Page 13: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/13.jpg)
how much data exactly?
![Page 14: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/14.jpg)
SOME NUMBERS
• New data per day:
• 200 GB (March 2008)
• 2 TB (April 2009)
• 4 TB (October 2009)
• 12 TB (March 2010)
• Data processed per month: 400 PB (in 2007!)
• Average job size: 180 GB
![Page 15: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/15.jpg)
what if you have that much data?
![Page 16: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/16.jpg)
what if you have just 1% of that amount?
![Page 17: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/17.jpg)
“No Problemo”, you say?
![Page 18: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/18.jpg)
reading 180 GB sequentially off a disk will take ~45 minutes
![Page 19: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/19.jpg)
and you only have 16 to 64 GB of RAM per computer
![Page 20: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/20.jpg)
so you can't process everything at once
![Page 21: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/21.jpg)
general rule of modern computers:
![Page 22: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/22.jpg)
data can be processed much faster than it can be read
![Page 23: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/23.jpg)
solution: parallelize your I/O
![Page 24: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/24.jpg)
but now you need to coordinate what you’re doing
![Page 25: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/25.jpg)
and that’s hard
![Page 26: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/26.jpg)
what if a node dies?
![Page 27: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/27.jpg)
is data lost?will other nodes in the grid have to re-start?
how do you coordinate this?
![Page 28: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/28.jpg)
ENTER: OUR HEROIntroducing MapReduce
![Page 29: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/29.jpg)
in the olden days, the workload was distributed across a grid
![Page 30: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/30.jpg)
and the data was shipped around between nodes
![Page 31: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/31.jpg)
or even stored centrally on something like an SAN
![Page 32: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/32.jpg)
which was fine for small amounts of information
![Page 33: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/33.jpg)
but today, on the web, we have big data
![Page 34: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/34.jpg)
I/O bottleneck
![Page 35: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/35.jpg)
along came a Google publication in 2004
![Page 36: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/36.jpg)
MapReduce: Simplified Data Processing on Large Clustershttp://labs.google.com/papers/mapreduce.html
![Page 37: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/37.jpg)
now the data is distributed
![Page 38: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/38.jpg)
computing happens on the nodes where the data already is
![Page 39: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/39.jpg)
processes are isolated and don’t communicate (share-nothing)
![Page 40: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/40.jpg)
BASIC PRINCIPLE: MAPPER
• A Mapper reads records and emits <key, value> pairs
• Example: Apache access.log
• Each line is a record
• Extract client IP address and number of bytes transferred
• Emit IP address as key, number of bytes as value
• For hourly rotating logs, the job can be split across 24 nodes*
* In pratice, it’s a lot smarter than that
![Page 41: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/41.jpg)
BASIC PRINCIPLE: REDUCER
• A Reducer is given a key and all values for this specific key
• Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers
• Example: Apache access.log
• The Reducer is called once for each client IP (that’s our key), with a list of values (transferred bytes)
• We simply sum up the bytes to get the total traffic per IP!
![Page 42: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/42.jpg)
EXAMPLE OF MAPPED INPUT
IP Bytes
212.122.174.13 18271
212.122.174.13 191726
212.122.174.13 198
74.119.8.111 91272
74.119.8.111 8371
212.122.174.13 43
![Page 43: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/43.jpg)
REDUCER WILL RECEIVE THIS
IP Bytes
212.122.174.13
18271
212.122.174.13191726
212.122.174.13198
212.122.174.13
43
74.119.8.11191272
74.119.8.1118371
![Page 44: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/44.jpg)
AFTER REDUCTION
IP Bytes
212.122.174.13 210238
74.119.8.111 99643
![Page 45: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/45.jpg)
PSEUDOCODE
function map($line_number, $line_text) { $parts = parse_apache_log($line_text); emit($parts['ip'], $parts['bytes']);}
function reduce($key, $values) { $bytes = array_sum($values); emit($key, $bytes);}
212.122.174.13 21023874.119.8.111 99643
212.122.174.13 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /foo HTTP/1.1" 200 18271212.122.174.13 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /bar HTTP/1.1" 200 191726212.122.174.13 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /baz HTTP/1.1" 200 19874.119.8.111 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /egg HTTP/1.1" 200 4374.119.8.111 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /moo HTTP/1.1" 200 91272212.122.174.13 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /yay HTTP/1.1" 200 8371
![Page 46: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/46.jpg)
A YELLOW ELEPHANTIntroducing Apache Hadoop
![Page 48: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/48.jpg)
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.
Doug Cutting
![Page 49: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/49.jpg)
Hadoop is a MapReduce framework
![Page 50: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/50.jpg)
it allows us to focus on writing Mappers, Reducers etc.
![Page 51: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/51.jpg)
and it works extremely well
![Page 52: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/52.jpg)
how well exactly?
![Page 53: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/53.jpg)
HADOOP AT FACEBOOK (I)
• Predominantly used in combination with Hive (~95%)
• 8400 cores with ~12.5 PB of total storage
• 8 cores, 12 TB storage and 32 GB RAM per node
• 1x Gigabit Ethernet for each server in a rack
• 4x Gigabit Ethernet from rack switch to core
http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop
Hadoop is aware of racks and locality of nodes
![Page 54: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/54.jpg)
HADOOP AT FACEBOOK (II)
• Daily stats:
• 25 TB logged by Scribe
• 135 TB of compressed data scanned
• 7500+ Hive jobs
• ~80k compute hours
• New data per day:
• I/08: 200 GB
• II/09: 2 TB (compressed)
• III/09: 4 TB (compressed)
• I/10: 12 TB (compressed)
http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop
![Page 55: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/55.jpg)
HADOOP AT YAHOO!
• Over 25,000 computers with over 100,000 CPUs
• Biggest Cluster :
• 4000 Nodes
• 2x4 CPU cores each
• 16 GB RAM each
• Over 40% of jobs run using Pighttp://wiki.apache.org/hadoop/PoweredBy
![Page 56: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/56.jpg)
OTHER NOTABLE USERS
• Twitter (storage, logging, analysis. Heavy users of Pig)
• Rackspace (log analysis; data pumped into Lucene/Solr)
• LinkedIn (friend suggestions)
• Last.fm (charts, log analysis, A/B testing)
• The New York Times (converted 4 TB of scans using EC2)
![Page 57: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/57.jpg)
JOB PROCESSINGHow Hadoop Works
![Page 58: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/58.jpg)
Just like I already described! It’s MapReduce!\o/
![Page 59: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/59.jpg)
BASIC RULES
• Uses Input Formats to split up your data into single records
• You can optimize using combiners to reduce locally on a node
• Only possible in some cases, e.g. for max(), but not avg()
• You can control partitioning of map output yourself
• Rarely useful, the default partitioner (key hash) is enough
• And a million other things that really don’t matter right now ;)
![Page 60: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/60.jpg)
HDFSHadoop Distributed File System
![Page 61: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/61.jpg)
HDFS
• Stores data in blocks (default block size: 64 MB)
• Designed for very large data sets
• Designed for streaming rather than random reads
• Write-once, read-many (although appending is possible)
• Capable of compression and other cool things
![Page 62: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/62.jpg)
HDFS CONCEPTS
• Large blocks minimize amount of seeks, maximize throughput
• Blocks are stored redundantly (3 replicas as default)
• Aware of infrastructure characteristics (nodes, racks, ...)
• Datanodes hold blocks
• Namenode holds the metadata
Critical component for an HDFS cluster (HA, SPOF)
![Page 63: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/63.jpg)
there’s just one little problem
![Page 64: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/64.jpg)
you need to write Java code
![Page 65: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/65.jpg)
however, there is hope...
![Page 66: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/66.jpg)
STREAMINGHadoop Won’t Force Us To Use Java
![Page 67: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/67.jpg)
Hadoop Streaming can use any script as Mapper or Reducer
![Page 68: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/68.jpg)
many configuration options (parsers, formats, combining, …)
![Page 69: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/69.jpg)
it works using STDIN and STDOUT
![Page 70: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/70.jpg)
Mappers are streamed the records(usually by line: <line>\n)
and emit key/value pairs: <key>\t<value>\n
![Page 71: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/71.jpg)
Reducers are streamed key/value pairs:<keyA>\t<value1>\n<keyA>\t<value2>\n<keyA>\t<value3>\n<keyB>\t<value4>\n
![Page 72: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/72.jpg)
Caution: no separate Reducer processes per key(but keys are sorted)
![Page 73: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/73.jpg)
STREAMING WITH PHPIntroducing HadooPHP
![Page 74: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/74.jpg)
HADOOPHP
• A little framework to help with writing mapred jobs in PHP
• Takes care of input splitting, can do basic decoding et cetera
• Automatically detects and handles Hadoop settings such as key length or field separators
• Packages jobs as one .phar archive to ease deployment
• Also creates a ready-to-rock shell script to invoke the job
![Page 75: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/75.jpg)
written by
![Page 76: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/76.jpg)
![Page 77: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/77.jpg)
http://github.com/dzuelke/hadoophp
![Page 78: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/78.jpg)
DEMOHadoop Streaming & PHP in Action
![Page 79: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/79.jpg)
!e End
![Page 80: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/80.jpg)
RESOURCES
• http://www.cloudera.com/developers/learn-hadoop/
• Tom White: Hadoop. The Definitive Guide. O’Reilly, 2009
• http://www.cloudera.com/hadoop/
• Cloudera Distribution for Hadoop is easy to install and has all the stuff included: Hadoop, Hive, Flume, Sqoop, Oozie, …
![Page 81: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/81.jpg)
Questions?
![Page 82: Large-Scale Data Processing with Hadoop and PHP (IPC11 2011-10-11)](https://reader038.fdocuments.net/reader038/viewer/2022110303/54b7a77c4a79594b258b45af/html5/thumbnails/82.jpg)
THANK YOU!This was http://join.in/3861
by @dzuelke.Contact me or hire us: