Computer Systems Performance Evaluation CSCI 8710 Kraemer Fall 2008.
CSci 5707, Fall 2013
description
Transcript of CSci 5707, Fall 2013
CSci 5707, Fall 2013
MapReducevs.
Parallel DBMS
Hamid Safizadeh, Otelia Buffington
University of Minnesota
2
MapReduce Idea
Mapping
map (k1, v1) list (k2, v2)
Reducing
reduce (k2, list(v2)) list (v2)
Pseudo-code for counting the number of occurrences of each word in a large collection of documents
Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clustering, OSDI’08
3
MapReduce Example
Calculation of the number of occurrences of each word
http://aimotion.blogspot.com/2010/08/mapreduce-with-mongodb-and-python.html
4
MapReduce Architecture
Execution overview
Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clustering, OSDI’08
5
MapReduce or Parallel DBMS
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., and Stonebraker, M., “A comparison of approaches to large-scale data analysis”, ACM SIGMOD International Conference, 2009 (http://database.cs.brown.edu/projects/mapreduce-vs-dbms)
Dean, J., and Ghemawat, S., “MapReduce: A flexible data processing tool”, Communications of the ACM, Vol. 53, 2010 (DOI: 10.1145/1629175.1629198)
MapReduce Design Properties
6
Heterogeneous Systems Processing and combining data from a wide variety of storage systems
(such as relational databases, file systems, etc.)
Fault Tolerance Providing fine-grain fault tolerance for large jobs (Failure in middle of a
multi-hour execution does not require restarting the job from scratch)
Complex Functions Simple Map and Reduce functions with straightforward SQL equivalents Offering a better framework for some complicated tasks
Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, Communications of the ACM, Vol. 53, 2010
MapReduce Design Properties
7
Performance Loading data: Startup overhead for MapReduce Reading data: Full scan over large data files Merging results: A MapReduce as the next consumer
Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, Communications of the ACM, Vol. 53, 2010
Cost Hardware: Network workstations Software: Open source (Hodoop) Communication: Network system
Companies Using Hodoop
8
Facebook Yahoo! Google Amazon Twitter