Big datatraining ranga_1
-
Upload
ranga-vadlamudi -
Category
Technology
-
view
103 -
download
0
description
Transcript of Big datatraining ranga_1
BIG DATA TRAINING
Ranga Vadlamudi March 2014
What is Big Data
• Volume: Large Amounts of Data at rest
• Velocity: milliseconds to seconds to respond
• Variety: Data in many forms (Structured,
Unstructured, MulEmedia, Text etc.)
• Veracity: Data in doubt
• 30 billion pieces of content a month
• 1 Peta byte of content every day
• 2 Billion videos watched everyday
• 3 Billion people will be online
• Sharing 8 zeQabytes of data
CAP THEOREM (Consistency, Availability, ParEEon)
Big Data SoluEons
Big Data
Real Time Querying
Batch Querying
Mining & AnalyEcs
Machine Learning
Storage
Technology
Background • Underlying Technology invented by Google • Google Big-‐Table & Google File System • Doug Cu\ng created NUTCH and Hadoop was spun off at Yahoo
• Yahoo played a key role in developing Hadoop for enterprise applicaEons
Hadoop • Is a framework • Built on commodity hardware • Implements computaEonal paradigm called Map-‐Reduce
• Provides a distributed file system called HDFS to store data
• Node failures are automaEcally handled
Data Becomes BoQleneck
• Ge\ng data to processors is expensive • Typical disk data transfer rate 75MB/sec • 100GB data transfer : 22mins approx. • New approach is needed
Hadoop Solves • Problems where you have lot of data • Mixture of complex and structured data • Speeds up computaEons by distribuEon • Mantra is take computaEon to the data, don’t bring data to computaEon
Hadoop DistribuEons
Hadoop Architecture • Master Slave philosophy • Designed to run on large number of machines • Machines don’t share memory or disk
• Rack them up and run Hadoop on each machine
Hadoop Architecture • Data is divided and spread across servers • Hadoop keeps track of where the data is • Hadoop replicates data to mulEple copies to avoid single point of failure
• MapReduce is a programming model to process large sets of data in parallel
• Map the operaEon out to all servers • Shuffle the results • Reduce the results back into one result set
Hadoop Components
HDFS (Hadoop File System
HDFS • Distributed file system • Highly fault tolerant • HDFS instance can span across many servers • Has large datasets into terabytes to petabytes • Moving computaEon is cheaper than moving data
• Large block sizes (128MB for example)
HDFS Layout
Cloudera Manager
• Management sogware to manage Hadoop ecosystem
• Helps install, manage and maintain a cluster • Resource consumpEon tracking • ProacEve health checks • AlerEng • Config changes
Cloudera CapabiliEes
Demo Cloudera Demo Cassandra Demo Mongo DB
QuesEons?