All About Big Data

All About Big Data By

Sai Venkatesh Attaluri

Head – BD & Big Data Analytics

Netxcell Limited

Big data is a collection of data sets so large and complex that it becomes

difficult to process using on-hand database management tools. The

challenges include capture, curation, storage, search, sharing, analysis,

and visualization. The trend to larger data sets is due to the additional

information derivable from analysis of a single large set of related data, as

compared to separate smaller sets with the same total amount of data,

allowing correlations to be found to "spot business trends, determine

quality of research, prevent diseases, link legal citations, combat crime,

and determine real-time roadway traffic conditions. (Wikipedia)

“Any fool can make things bigger, more complex, and more violent. It takes

a touch of genius-and a lot of courage-to move in the opposite direction.” -

Albert Einstein

Big Data - Definition

Simplifying the Definition

• Big data refers to data that is too big to fit on a single

server, too unstructured to fit into a row-and-column

database, or too continuously flowing to fit into a

static data warehouse. - - Thomas H Davenport

• Put another way, big data is the realization of greater

business intelligence by storing, processing, and

analyzing data that was previously ignored due to the

limitations of traditional data management

technologies

About Big Data

Every second of every day, businesses generate more data. Researchers

at IDC estimate that by the end of 2013, the amount of stored data will

exceed 4 zettabytes, or 4 billion terabytes.

All of that big data represents a big Opportunity for organizations.

Big data is a term applied to data sets whose size is beyond the ability of

commonly used software tools to capture, manage, and process the data

within a tolerable elapsed time.

In simplest terms, "Big Data" refers to the tools, processes and

procedures allowing an organization to create, manipulate, and manage

extremely large data sets. This means terabytes, petabytes or even large

collection of data such as zettabytes.

How Does Big Data Differ from

Traditional Transactional Systems?

Traditional Transaction Systems Big Data

TTS are designed and implemented

to track information whose format is

and use are known ahead of time

Big Data Systems are deployed when

the questions to be asked and the

data for ats to be exa i ed are t known ahead of time.

Data that resides within the fixed

confines of a record or file is known

as structured data.

Data that comes from a variety of

sources, such as emails, text

documents, videos, photos, audio

files, and social media posts, is

referred to as unstructured data.

Does t support unstructured data

Structured Data even in large

volumes can be entered, stored,

queried, and analyzed in a simple and

straightforward manner, this type of

data is best served by a Traditional

Transaction Database.

TTS Vs Big Data

Traditional Transaction Systems Big Data

Companies whose data workloads are

constant and predictable will be better

served by a traditional database.

Companies challenged by increasing data

demands will want to take advantage

of Big Data s scalable infrastructure.

Scalability allows servers to be added on

demand to accommodate growing

workloads.

In cases where organizations rely on time-

sensitive data analysis, a traditional

database is the better fit. That s because

shorter time-to -insight is t about

analyzing large unstructured datasets. It s

about analyzing smaller data sets in real

or near-real time, which is what

traditional databases are well equipped to

do.

Big Data is designed for large distributed

data processing that addresses every file

in the database. And that type of

processing takes time. For tasks where

fast performance is t critical, such as

running end-of-day reports to review daily

transactions, scanning historical data, and

performing analytics where a slower time-

to-insight is acceptable, Big Data is ideal.

TTS Vs Big Data..Continued

Unfortunately, extracting valuable information from big

data is t as easy as it sounds. Big data amplifies any

existing problems in your infrastructure, processes or

even the data itself.

It is also misrepresented by the media making it difficult

for organizations to determine investing in Big Data will

bring expected results and make it possible to improve

efficiency, bring out better products and services.

Misconception of big data

The Promise of Big Data

Companies recognize that Big data contains valuable information such as

Obtain Actionable Insights

Product Performance,

Deepen Customer Relationships

Understanding Customer Behavior,

Prevent Threats & Fraud

Identify New Revenue Opportunities.

80-90% of data produced today is unstructured

11

Evolution of big data

12

Big Data

Volume

Variety

Velocity

Veracity

The 4 V s

To make the most of the information in their systems, companies

must successfully deal with the 4 V s that distinguish big data:

1. Variety

2. Volume

3. Velocity and

4. Veracity.

The first three—variety, volume and velocity— define big data;

when you have a large volume of data coming in from a wide

variety of applications and formats and it s moving and changing at

a rapid velocity, that s when you know you have big data.

Definition of V s

Volume

– Big Data tools and services are designed to manage extremely large and

growing sources of data that require capabilities beyond that found in

traditional database engines. Ex: Extreme Large Volumes of Data

Variety

– Big Data Tools manage an extensive variety of data as well. This means

having the capability to manage structured data, very much like the

capabilities offered by a database engine. They go beyond supporting

structured data to working with both non-structured data, such as

documents, spreadsheets, presentation decks and the like; and log data

coming from operating systems, database engines, application framework,

retail point of sale systems, mobile communications systems and more.

Ex: Structured, Unstructured, images, documents, etc

Definition of V s

Velocity

– Ability to gather, analyze and report on rapidly changing sets of data. In

some cases, this means having the capability to manage data that changes

so rapidly that the updated data cannot be saved to traditional disk drives

before it is changed again.

Simple Term: Quickly Moving Data

Veracity

– Veracity is a measure of the accuracy and trustworthiness of your data.

Veracity is a goal one that the variety, volume and velocity of big data make

harder to achieve.

Simple Term: Trust and integrity

Definition of V s

• 2.5 quintillion bytes of data are generated every day!

– A quintillion is 1018

• Data come from many quarters.

– Social media sites

– Sensors

– Digital photos

– Business transactions

– Location-based data

Lots of Data

Style of Data Source of Data Industry Affected Function Affected

Large Volume Online Financial Services Marketing

Unstructured Video Health Care Supply Chain

Continuous Flow Sensor Manufacturing Human Resources

Multiple Formats Genomic Travel / Transport Finance

• Aspects of the way in which users want to interact with their data…

– Totality: Users have an increased desire to process and analyze

all available data

– Exploration: Users apply analytic approaches where the schema

is defined in response to the nature of the query

– Frequency: Users have a desire to increase the rate of analysis

in order to generate more accurate and timely business

intelligence

– Dependency: Users eed to balance investment in existing

technologies and skills with the adoption of new techniques

• So in a Nutshell, Big Data is about better analytics

The Need of Big Data

Term Time Frame Specific Meaning

Decision Support 1970-1985 Use of data analysis to support

decision making

Executive Support 1980-1990 Focus on Data Analysis for

decisions by Senior Executives

Online Analytical

Processing (OLAP) 1990-2000

Software for analyzing

multidimensional data tables

Business Intelligence 1989-2005

Tools to support data driven

decisions, with emphasis on

reporting.

Analytics 2006-2010

Focus on Statistical and

Mathematical analysis for

decisions

Big Data 2010-Present &

Next 10 Years

Focus on very large,

unstructured, fast-moving data

Terminology For Using and Analyzing data

Your company can take advantage of the opportunities available in big data only

when you have processes and solutions that can handle all 4 V's.

Many of the previous attempts to address the need to gather information from

the rapidly growing, rapidly changing and broad types of data have been based

upon the use of special-purpose, complex and highly expensive computing

systems. Today's Big Data Solutions are built upon a different foundation.

Rather than trying to use a very powerful, dedicated database system, cluster of

inexpensive, powerful, industry standard (X86) systems are harnessed to attack

these very small problems.

The clustered approach uses commodity systems, storage, and memory. It also

adds the benefit of being more reliable. The failure of any single system in the

cluster will not stop processing.

Technology Shift

Gartner s Visualization on Big Data

• Problems:

– Although there is a massive spike available data, the percentage of the

data that an enterprise can understand is on the decline

– The data that the enterprise is trying to understand is saturated with

both useful signals and lots of noise

Big Data – Conundrum

Benefits of Big Data

Big Data Platform Manifesto

High Level Architecture of Recognizer

Big Data Platform on Hadoop FM

API s to 3rd Party

API s Enterprise

Recommendation Engine

OBD

IVR

DATA

PCA

Greybox

Others

Historical Data

Business Intelligence

Churn Prediction

Predictive Analysis

Medium Level Architecture of Recognizer

What is Hadoop?

• Hadoop is a free software framework that is developed by

Apache Software Foundation to support distributed

processing of data. Initially, Java™ language was used to

develop Hadoop, but today many other languages are used

for scripting Hadoop. Hadoop is used as the core platform to

structure Big Data and helps in performing data analytics.

• This Distributed processing framework designed to harness

together the power of many computers, each having its own

processing and storage, and provide the capability to quickly

process large, distributed data sets.

Hadoop Distributed File System (HDFS)

• Hadoop Distributed File System designed to support large data sets to made up of rapidly changing structured and non-structured data.

MapReduce

• MapReduce is a tool designed to allow analysts and developers to rapidly shift through massive amounts of data to examine only those data items that match a specified set of criteria.

Introduction to Hadoop

Hadoop Components Hadoop Components

Sqoop Flume

ZooKeeper Oozie

Pig Mahout

R Connectors Hive

Map Reduce HDFS

Hbase MongoDB

Cloudera Horton Works

Kafka Yarn

Cassandra VMware Player

SQL NOSQL

MetaStore Scala

Query Compiler Hadoop Cluster

Execution Engine Ambari

Hadoop Architecture & Components

Apache Hadoop Architecture

• Hadoop is a master and slave architecture that includes the NameNode as the master and the DataNode as the slave.

Apache Sqoop

• Apache Sqoop is a command-line tool for transferring data between relational databases and Hadoop. Sqoop, similar to other ETL tools, uses schema metadata to infer data types and ensure type-safe data handling when the data moves from the source to Hadoop.

Apache HBase

• Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS). HBase is designed to support high table-update rates and to scale out horizontally in distributed compute clusters. Its focus on scale enables it to support very large database tables

Apache Zookeeper

• Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data.

Let Us See Hadoop Components

Apache Hive

• Hive is an open-source data warehousing system used to analyze a large amount of dataset that is stored in Hadoop files. It has three key functions like summarization of data, query, and analysis.

HDFS

• The Hadoop Distributed File System (HDFS) is a distributed file system that shares some of the features of other distributed file systems. It is used for storing and retrieving unstructured data.

MapReduce

• The MapReduce is a core component of Hadoop, and is responsible for processing jobs in distributed mode.

Pig

• The Apache Pig is a platform which helps to analyze large datasets that includes high-level language required to express data analysis programs. Pig is one of the components of the Hadoop eco-system.

Let Us See Hadoop Components – Contd..

NoSQL (Not Only SQL database)

• NoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data. NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that's stored remotely on multiple virtual servers in the cloud.

MongoDB

• MongoDB database management system is designed for running modern applications that rely on structured and unstructured data and support rapidly changing data..

Apache Cassandra

• Apache Cassandra is a free, open-source, distributed storage system for managing large amounts of structured data. It differs from traditional relational database management systems in some significant ways. Cassandra is designed to scale to a very large size across many commodity servers, with no single point of failure, and provides a simple schema-optional data model designed to allow maximum power and performance at scale.

Apache Hadoop YARN (Yet Another Resource Negotiator)

• Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology. YARN is one of the key features in the second-generation Hadoop 2 version of the Apache Software Foundation's open


Oozie

• Oozie is a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs. Oozie is implemented as a Java Web-Application that runs in a Java Servlet-Container.

Apache Ambari

• The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

Flume

• Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.

Cloudera Impala

• Cloudera Impala is a query engine that runs on Apache Hadoop. Impala brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation.


Apache Spark

• Apache Spark is an open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers. Apache Spark can process data from a variety of data repositories, including the Hadoop Distributed File System (HDFS), NoSQL databases and relational data stores such as Apache Hive.

Scala (Scalable Language)

• Scala (Scalable Language) is a software programming language that mixes object-oriented methods with functional programming capabilities that support a more concise style of programming than other general-purpose languages like Java, reducing the amount of code developers have to write..

Apache Kafka

• Apache Kafka is a distributed publish-subscribe messaging system designed to replace traditional message brokers. Originally created and developed by LinkedIn, then open sourced in 2011, Kafka is currently developed by the Apache Software Foundation to exploit new data infrastructures made possible by massively parallel commodity clusters

Jaspersoft

• Jaspersoft provides the most flexible, cost-effective, and widely-deployed business intelligence software in the world, enabling better decision making through highly interactive Web-based reports, dashboards, and analysis.


Hadoop Cluster

• A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment.

Distributed File System

• A distributed file system is a client/server-based application that allows clients to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file, which is cached on the user's computer while the data is being processed and is then returned to the server.

Catastrophic Failure

• Catastrophic failure is a complete, sudden, often unexpected breakdown in a machine, electronic system, computer or network. Such a breakdown may occur as a result of a hardware event such as a disk drive crash, memory chip failure or surge on the power line. Catastrophic failure can also be caused by software conflicts or malware. Sometimes a single component in a critical location fails, resulting in downtime for the entire system.

Python

• Python is an interpreted, object-oriented programming language similar to PERL, that has gained popularity because of its clear syntax and readability. Python is said to be relatively easy to learn and portable, meaning its statements can be interpreted in a number of operating systems.


Hadoop Architecture & Components

The ‘ E viro e t • R is an integrated suite of software facilities for data analysis

and graphics. Among other things it has

• An effective data handling and storage facility,

• A suite of operators for calculations on arrays, in particular matrices,

• A large, coherent, integrated collection of intermediate tools for data analysis,

• As a set of statistical methodologies and models.

• As a graphical tool, facilitates data analysis and display either directly at the computer or on hardcopy, and

• As a well developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities.

An Introduction to ‘

Thank you

All About Big Data

Documents

Transcript of All About Big Data