LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf ·...
Transcript of LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf ·...
![Page 1: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/1.jpg)
1
Part 2
Karin Breitman Brazil R&D Center
![Page 2: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/2.jpg)
2
![Page 3: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/3.jpg)
3 Data Collection
Raw data storage
ETL RDBMS
BI
![Page 4: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/4.jpg)
4 Data Collection
Raw data storage
ETL RDBMS
BI
![Page 5: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/5.jpg)
5
Genesis - Google
![Page 6: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/6.jpg)
6
Hadoop • Distributed system for data storage and
processing (open source under the Apache license).
![Page 7: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/7.jpg)
7
Hadoop
• Storage & Compute in 1 Framework • Open Source Project of the Apache Software Foundation • Written in Java
HDFS MapReduce
Two Core Components
Storage in the Hadoop Distributed File System
Compute via the MapReduce distributed processing platform
![Page 8: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/8.jpg)
8
![Page 9: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/9.jpg)
9
that said…. RDBMS • Schema on write
• Reads are fast
• Adjusting required
• Structured
• Good for: – OLAP – ACID transactions – Operational data store
Hadoop • Schema on read
• Writes are fast
• Ingested as is
• Loosely structured
• Good for: – Data discovery – Unstructured data – Massive storage
![Page 10: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/10.jpg)
10
And.. • Hadoop is a paradigm shift in the way we think about and manage data
• Traditional solutions were not designed with growth in mind
• Big-Data accelerates this problem dramatically
Category Traditional RDBMS Hadoop
Scalability
Resource constrained Linear Expansion
Re-architecture Seamless addition & subtraction of nodes
~ 10TB ~ 5PB
Fault Tolerance
After thought, many critical points of failure
Designed in, tasks are automatically restarted
Problem Space
Transactional, OLTP Batch, OLAP
Inability to incorporate new sources No bounds
![Page 11: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/11.jpg)
11
More importantly
• Structural changes to RDBMS (ex. Add a new column) are really, really hard!
![Page 12: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/12.jpg)
12
HDFS Concepts • Performs best with a ‘modest’ number of large files
– Millions, rather than billions, of files – Each file typically 100Mb or more
• Files in HDFS are ‘write once’ – No random writes to files are allowed – Append support is available – HDFS is optimized for large, streaming reads of files – Rather than random reads
![Page 13: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/13.jpg)
13
HDFS • Hadoop Distributed File System
– Data is organized into files & directories – Files are divided into blocks, typically 64-128MB each, and
distributed across cluster nodes – Block placement is known at runtime by map-reduce so
computation can be co-located with data – Blocks are replicated (default is 3 copies) to handle failure – Checksums are used to ensure data integrity
• Replication is the one and only strategy for error handling, recovery and fault tolerance
![Page 14: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/14.jpg)
14
Hadoop Architecture - HDFS • Block level storage • N-Node replication • Namenode for
– File system index (EditLog) – Access coordination
• Datanode for – Data Block Management – Job Execution (MapReduce)
• Automated Fault Tolerance
Put
![Page 15: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/15.jpg)
15
NameNode • Provides a centralized, repository for the
namespace – A index of what files are stored in which blocks
• Responds to client requests (map-reduce jobs) by coordinating distribution of tasks
![Page 16: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/16.jpg)
16
Hadoop treats all nodes as Data Nodes, meaning that they can store data, but designates at least one node to be the Name Node.
Hadoop File System is classified as a “distributed” file system because it manages the storage across a network of machines and the files are distributed across several nodes, in the same or different racks or clusters.
For each Hadoop file, the Name Node decides in which disk each one of the copies of each one of the File Blocks will reside and keeps track of all that information in tables stored locally in its local disks.
![Page 17: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/17.jpg)
17
When a node fails, the Name Node identifies all the file blocks that have been affected; retrieves copies of these file blocks from other healthy nodes;
Finds new nodes to store another copy of them; stores these other copies there; and updates this information in its tables.
![Page 18: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/18.jpg)
18
When an application needs to read a file, it first connects to the Name Node to get the addresses for the disk blocks where the file blocks are and the application can then read these blocks directly without going through the Name Node again.
One of the common concerns about the Hadoop Distributed File System is the fact that the Name Node can become a single point of failure.
![Page 19: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/19.jpg)
19
File System Browser
![Page 20: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/20.jpg)
20
MapReduce
![Page 21: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/21.jpg)
21
Map Reduce Framework • Map step
– Input records are parsed into intermediate key/value pairs
– Multiple Maps per Node • 10TB => 128MB/Blk => 82K Maps
• Reduce step – Each Reducer handles all like keys – 3 Steps
• Shuffle: All like keys are retrieved from each Mapper
• Sort: Intermediate keys are sorted prior to reduce
• Reduce: Values are processed
![Page 22: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/22.jpg)
22
Map Reduce
![Page 23: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/23.jpg)
23
MapReduce programming with Java • Very low level access to hadoop api’s
– Ultimately not the best/easiest way to interact for a Data Scientist – Components: – Mapper: Class & method (map) called by framework to process (parse) the
src data line by line – Reducer: Class & method (map) called by framework to process (combine)
the output of the Mappers and build the final output – Job: Runtime context for hadoop
![Page 24: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/24.jpg)
24
Reduce Task • After the Map phase is over, all the intermediate values for a
given intermediate key are combined together into a list • This list is given to a Reducer
– There may be a single Reducer, or multiple Reducers – This is specified as part of the job configuration (see later) – All values associated with a particular intermediate key are
guaranteed to go to the same Reducer – The intermediate keys, and their value lists, are passed to the
Reducer in sorted key order – This step is known as the ‘shuffle and sort’
• The Reducer outputs zero or more final key/value pairs – These are written to HDFS
![Page 25: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/25.jpg)
25
Bestiary
HDFS
…..
MapReduce
java python
stream
Packages Hive
HBase
Pig
Analytics Mahout
R
![Page 26: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/26.jpg)
26
Additional Slides
![Page 27: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/27.jpg)
27
Hive • Pseudo database on top of HDFS • Stores data on hdfs (/user/hive/warehouse) • Each table has a directory with files underneath • Files are delimited, Sequence Files, Map parts, Reduce parts • has a command line interface and Thrift server • stores metadata in derby (default) or MySQL or Postgres • Best place for syntax:
– http://hive.apache.org and view the manual
• Ability to create UDFs
![Page 28: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/28.jpg)
28
Pig • Provides a mechanism for using MapReduce without
programming in Java – Utilizes HDFS & MapReduce
• Allows for a more intuitive means to specify data flows – High-level sequential, data flow language – Pig Latin, flow expression – Python integration
• Comfortable for researchers who are familiar with Perl & Python • Pig is easier to learn & execute, but more limited in
scope of functionality than java
![Page 29: LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf · • Open Source Project of the Apache Software Foundation • Written in Java HDFS](https://reader033.fdocuments.net/reader033/viewer/2022043008/5f9936b6ce4d300cd90deff9/html5/thumbnails/29.jpg)
29
Mahout • Important stuff first: most common pronunciation is “Ma-h-
out” – rhymes with ‘trout’ • Machine Learning Library that Runs on HDFS • 4 Primary Use Cases:
– Recommendation Mining – People who like X, also like Y – Clustering – Topic based association – Classification – Assign new docs to existing categories – Frequent Itemset Mining – Which things will appear together