Big datatraining ranga_1

BIG DATA TRAINING

Ranga Vadlamudi March 2014

What is Big Data

•  Volume: Large Amounts of Data at rest

•  Velocity: milliseconds to seconds to respond

•  Variety: Data in many forms (Structured,

Unstructured, MulEmedia, Text etc.)

•  Veracity: Data in doubt

•  30 billion pieces of content a month

•  1 Peta byte of content every day

•  2 Billion videos watched everyday

•  3 Billion people will be online

•  Sharing 8 zeQabytes of data

CAP THEOREM (Consistency, Availability, ParEEon)

Big Data SoluEons

Big Data

Real Time Querying

Batch Querying

Mining & AnalyEcs

Machine Learning

Storage

Technology

Background •  Underlying Technology invented by Google •  Google Big-‐Table & Google File System •  Doug Cu\ng created NUTCH and Hadoop was spun off at Yahoo

•  Yahoo played a key role in developing Hadoop for enterprise applicaEons

Hadoop •  Is a framework •  Built on commodity hardware •  Implements computaEonal paradigm called Map-‐Reduce

•  Provides a distributed file system called HDFS to store data

•  Node failures are automaEcally handled

Data Becomes BoQleneck

•  Ge\ng data to processors is expensive •  Typical disk data transfer rate 75MB/sec •  100GB data transfer : 22mins approx. •  New approach is needed

Hadoop Solves •  Problems where you have lot of data •  Mixture of complex and structured data •  Speeds up computaEons by distribuEon •  Mantra is take computaEon to the data, don’t bring data to computaEon

Hadoop DistribuEons

Hadoop Architecture •  Master Slave philosophy •  Designed to run on large number of machines •  Machines don’t share memory or disk

•  Rack them up and run Hadoop on each machine

Hadoop Architecture •  Data is divided and spread across servers •  Hadoop keeps track of where the data is •  Hadoop replicates data to mulEple copies to avoid single point of failure

•  MapReduce is a programming model to process large sets of data in parallel

•  Map the operaEon out to all servers •  Shuffle the results •  Reduce the results back into one result set

Hadoop Components

HDFS (Hadoop File System

HDFS •  Distributed file system •  Highly fault tolerant •  HDFS instance can span across many servers •  Has large datasets into terabytes to petabytes •  Moving computaEon is cheaper than moving data

•  Large block sizes (128MB for example)

HDFS Layout

Cloudera Manager

•  Management sogware to manage Hadoop ecosystem

•  Helps install, manage and maintain a cluster •  Resource consumpEon tracking •  ProacEve health checks •  AlerEng •  Config changes

Cloudera CapabiliEes

Demo Cloudera Demo Cassandra Demo Mongo DB

QuesEons?

Big datatraining ranga_1

Technology

Transcript of Big datatraining ranga_1