HADOOP

36
INTRODUCTION TO BIGDATA & HADOOP Harinderjit Kaur M.Tech(CSE) PIT KAPURHALA

Transcript of HADOOP

  1. 1. Harinderjit Kaur M.Tech(CSE) PIT KAPURHALA
  2. 2. What is the Need of Big data Technology when we have robust, high-performing, relational database management system ?
  3. 3. RDBMS Data Stored in structured format like PK, Rows, Columns, Tuples and FK . It was for just Transactional data analysis. Later using Data warehouse for offline data. Massive use of Internet and Social Networking(FB, Linkdin) Data become less structured.
  4. 4. Big Data is similar to small data, but bigger Datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, and analytics. What is Big Data?
  5. 5. 3 Vs of Big Data Volume Data quantity Velocity Data Speed Variety Data Types
  6. 6. Hadoop History 2003 Doug Cutting was creating Nutch Open Source Google Web Crawler Indexer Crawler and Indexing processing was difficult Massive storage and processing problem In 2003 Google publishes GFS paper and in 2004 MapReduce paper Based on Googles paper, Doug redesign Nutch and delivered it in 2006 as Hadoop.
  7. 7. What is Hadoop? Framework of tools Open source maintained by and under Apache License Support running apps for BigData Addressing the BigData challenges: Variety VelocityVolume
  8. 8. What is Hadoop? Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Large datasets Terabytes or petabytes of data Large clusters hundreds or thousands of nodes Hadoop is open-source implementation for Google MapReduce Hadoop is based on a simple programming model called MapReduce an open source software framework written in Java.
  9. 9. Hadoop makes it easier to store, process and analyze lot of data on commodity hardware!
  10. 10. Apache Hadoop Developer(s) Apache Software Foundation Initial release December 10, 2011; Stable release 2.6.0 / Nov 18, 2014 Development status Active Written in Java
  11. 11. Operating system Cross-platform Type Distributed file system License Apache License 2.0 Website hadoop.apache.org
  12. 12. Characteristics of Hadoop Scalable A cluster can be expanded by adding new servers or resources without having to move, reformat, or change the dependent analytic workflows or applications. Cost effective Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
  13. 13. Flexible Hadoop is schema-less and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analysis than any one system can provide. Fault tolerant When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
  14. 14. Hadoop Master/Slave Architecture Hadoop is designed as a master-slave shared-nothing architecture 16 Master node (single node) Many slave nodes
  15. 15. Hadoop Components HDFS Storage Self-healing high-bandwidth clustered storage MapReduce Processing Fault-tolerant distributed processing
  16. 16. HDFS Basics HDFS (Hadoop Distributed File System) is a file system written in Java Sits on top of a native file system Provides redundant storage for massive amounts of data
  17. 17. Main Properties of HDFS Large: A HDFS instance may consist of thousands of server machines, each storing part of the file systems data Replication: Each data block is replicated many times (default is 3) Failure: Failure is the norm rather than exception Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS Namenode is consistently checking Datanodes 19
  18. 18. Hadoop Distributed File System (HDFS) 20 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F 1 2 3 4 5 Blocks (64 MB)
  19. 19. HDFS Data Data is split into blocks and stored on multiple nodes in the cluster. Each block is usually 64 MB or 128 MB Each block is replicated multiple times. Replicas stored on different data nodes Large files, 100 MB+
  20. 20. 2 Kinds of Nodes Master Nodes Slave Nodes
  21. 21. Master Nodes NameNode only 1 per cluster metadata server and database JobTracker only 1 per cluster job scheduler
  22. 22. Slave Nodes DataNodes 1-4000 per cluster block data storage TaskTrackers 1-4000 per cluster task execution
  23. 23. NameNode A single NameNode stores all metadata Filenames, locations on DataNodes of each block, owner, group, etc. All information maintained in RAM for fast lookup File system metadata size is limited to the amount of available RAM on the NameNode
  24. 24. Data Node DataNodes store file contents Different blocks of the same file will be stored on different DataNodes Same block is stored on three (or more) DataNodes for redundancy
  25. 25. MapReduce
  26. 26. MapReduce Programming model used by Google Input: a set of key/value pairs User supplies two functions: map(k,v) list(k1,v1) reduce(k1, list(v1)) v2 Map Process a key/value pair to generate intermediate key/value pairs Reduce Merge all intermediate values associated with the same key
  27. 27. MapReduce JobTracker MapReduce job submitted by client computer Master node TaskTracker Slave node Task instance TaskTracker Slave node Task instance TaskTracker Slave node Task instance
  28. 28. Properties of MapReduce Engine Job Tracker is the master node (runs with the namenode) Receives the users job Decides on how many tasks will run (number of mappers) 30 This file has 5 Blocks run 5 map tasks Node 1 Node 2 Node 3
  29. 29. Properties of MapReduce Engine (Contd) Task Tracker is the slave node (runs on each datanode) Receives the task from Job Tracker Runs the task until completion (either map or reduce task) Always in communication with the Job Tracker reporting progress 31 Reduce Reduce Reduce Map Map Map Map Parse-hash Parse-hash Parse-hash Parse-hash In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks
  30. 30. How Map and Reduce Work Together Map returns information Reduce accepts information Reduce applies a user defined function to reduce the amount of data
  31. 31. MapReduce Example - WordCount
  32. 32. Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job
  33. 33. Hadoop Workflow Hadoop Cluster You 1. Load data into HDFS 2. Develop code locally 3. Submit MapReduce job 3a. Go back to Step 2 4. Retrieve data from HDFS
  34. 34. Thank you