Big Data Presentation

Big Data

Dr. Manish Pokharel

June 08, 2013

8/3/2015 1

Contents

Introduction

Data Volume

Few facts of Big Data

Analysis

Challenges in Big Data Challenges in Big Data

Handling Big Data

Research Areas in Big Data

8/3/2015 2

Preamble : The Evolution of Data

1. In the past, the most difficult problem for businesses was

how to store all the data.

2. The challenge now is no longer to store large amounts of

information, but to understand and analyze this data.

3. By harnessing this data through sophisticated analytics, 3. By harnessing this data through sophisticated analytics,

and by presenting the key metrics in an efficient, easily

discernable fashion, we are afforded unprecedented

understanding and insight into our data.

8/3/2015 3

The Evolution of Data

1. Unlocking the true value of this massive amount of

information will require new systems for centralizing,

aggregating, analyzing, and visualizing these enormous

data sets. In particular analyzing and understanding

petabytes of structured and unstructured data poses the petabytes of structured and unstructured data poses the

following unique challenges:

1. Scalability

2. Robustness

3. Diversity

4. Analytics

5. Visualization of the Data

8/3/2015 4

Introduction

We are awash in a flood of data today.

We have entered an era of Big Data.

Handling more than 30 PB ( 30 x 1125899906842624 bytes) in

a day has become a common phenomenon in most of the

international companies now days.international companies now days.

In USA only more than 848 PB of data was produced by the

government.

So, we can not run away or ignore the presence of huge data

We have to think in a different way to handle these huge data

8/3/2015 5

Data Volume

8/3/2015 6

Continue

As we know we need the data to convert it into information

so that we can make a good decision based upon the data.

In a broad range of application areas, data is being collected

at unprecedented scale.

Decisions that previously were based on guesswork, or on Decisions that previously were based on guesswork, or on

painstakingly constructed models of reality, can now be made

based on the data itself.

8/3/2015 7

Continue

Big Data analysis now drives nearly every aspect of our

modern society, including mobile services, retail,

manufacturing, financial services, life sciences, and physical

sciences.

Big Data is an entity that is very big in size, very fast in speed

of interpreting, and various types of structures which is not of interpreting, and various types of structures which is not

easily processed by the traditional database management

tools.

It refers to data sets whose size is beyond capabilities of the

current database technology

8/3/2015 8

Continue

Big Data is the massive data that comes from different sources

which is characterized by three Vs such as: Volume, Velocity

and Variety.

Volume

Velocity Velocity

Variety

8/3/2015 9

Continue

Variety

Up to 85 percent of an organizations data is unstructured not

numeric but it still must be folded into quantitative analysis and

decision making.

Example: Text, video, audio and other unstructured data require

different architecture and technologies for analysis.different architecture and technologies for analysis.

Velocity

Initiatives such as the use of RFID tags and smart metering are driving

an ever greater need to deal with the torrent of data in near real time.

This, coupled with the need and drive to be more agile and deliver

insight quicker, is putting tremendous pressure on organizations to

build the necessary infrastructure and skill base to react quickly

enough.

8/3/2015 10

Continue

Variability

In addition to the speed at which data comes your way, the data flows

can be highly variable with daily, seasonal and event-triggered peak

loads that can be challenging to manage.

Complexity

Difficulties dealing with data increase with the expanding universe of

data sources and are compounded by the need to link, match and

transform data across business entities and systems.

Organizations need to understand relationships, such as complex

hierarchies and data linkages, among all data.

8/3/2015 11

Continue

Big Data can also be considered as a phenomenon that

describes large volumes of high velocity with high complexity

and variable data.

Big Data technologies as a new generation of technologies

and architectures, designed to economically extract value and architectures, designed to economically extract value

from very large volumes of a wide variety of data by enabling

high-velocity capture, discovery, and/or analysis.

There are three main characteristics of Big Data: the data

itself, the analytics of the data, and the presentation of the

results of the analytics.

8/3/2015 12

Big data is a relative term describing a situation where the volume, velocity and variety of data exceed an organizations storage or compute capacity for accurate and timely decision making.

Continue

Big data has special characters, it requires special technologies to capture, to extract, to integrate, to analyze, and to interpret it.

Extracting the meaning from the big data is not impossible but the fact is that it is not easy.

Since the big data is never in rest and the size is increasing Since the big data is never in rest and the size is increasing very fast, an ultra-high speed messaging technology is required in real time for streaming data capture and monitoring continuously.

The heterogeneity nature of incoming data, its increasing trends in volume, need of quick interpretation, and the security are the prime challenges of big data.

8/3/2015 13

Continue

While the potential benefits of Big Data are real and

significant, and some initial successes have already been

achieved, there remain many technical challenges that must

be addressed to fully realize this potential.

The sheer size of the data, of course, is a major challenge, and The sheer size of the data, of course, is a major challenge, and

is the one that is most easily recognized.

Industry analysis companies like to point out that there are

challenges not just in Volume, but also in Variety and Velocity,

and that companies should not focus on just the first of these.

8/3/2015 14

Continue

By Variety, they usually mean heterogeneity of data types,

representation, and semantic interpretation.

By Velocity, they mean both the rate at which data arrive and

the time in which it must be acted upon.

While these three are important, this short list fails to include While these three are important, this short list fails to include

additional important requirements such as privacy and

usability.

8/3/2015 15

Source: IDC's Digital Universe Study, sponsored by EMC, December 2012 8/3/2015 16

Few facts on Big Data!!

From 2005 to 2020, the digital universe will grow by a factor

of 300, from 130 Exabyte to 40,000 Exabyte, or 40 trillion

gigabytes (more than 5,200 gigabytes for every man, woman,

and child in 2020)

By 2020, the digital universe will about double every two By 2020, the digital universe will about double every two

years

The investment in managing, containing, studying, and storing

the bits in the digital universe will only grow by 40% between

2012 and 2020

As a result, the investment per gigabyte during that same

period will drop from $2.00 to $0.20

8/3/2015 17

Continue

Between 2012 and 2020, emerging markets' share of the

expanding digital universe will grow from 36% to 62%.

A majority of the information in the digital universe, 68% in

2012, is created and consumed by consumers watching

digital TV, interacting with social media, sending camera digital TV, interacting with social media, sending camera

phone images and videos between devices and around the

Internet, and so on.

Yet enterprises have liability or responsibility for nearly 80% of

the information in the digital universe.

8/3/2015 18

Continue

It is estimated that by 2020, as much as 33% of the digital

universe will contain information that might be valuable if

analyzed.

By 2020, nearly 40% of the information in the digital universe

will be "touched" by cloud computing providers meaning

that a byte will be stored or processed in a cloud somewhere that a byte will be stored or processed in a cloud somewhere

in its journey from originator to disposal.

8/3/2015 19

Continue

The proportion of data in the digital universe that requires

protection is growing faster than the digital universe itself,

from less than a third in 2010 to more than 40% in 2020.

The amount of information individuals create themselves

writing documents, taking pictures, downloading music, etc. writing documents, taking pictures, downloading music, etc.

is far less than the amount of information being created

about them in the digital universe.

8/3/2015 20

Big Data Analysis

The analysis of Big Data involves multiple distinct phases as

shown in the figure in the next slide ,each of which introduces

challenges.

Many people unfortunately focus just on the

analysis/modeling phase: while that phase is crucial, it is of

little use without the other phases of the data analysis little use without the other phases of the data analysis

pipeline.

Even in the analysis phase, which has received much

attention, there are poorly understood complexities in the

context of multi-tenanted clusters where several users

programs run concurrently.

Many significant challenges extend beyond the analysis

phase.

8/3/2015 21

The Big Data Analysis Pipelines

8/3/2015 22

Continue

Data Acquisition and Recording

Big Data does not arise out of a vacuum: it is recorded

from some data generating source.

Much of this data is of no interest, and it can be filtered

and compressed by orders of magnitude. and compressed by orders of magnitude.

One challenge is to define these filters in such a way that

they do not discard useful information.

The second challenge is to automatically generate the right

metadata to describe what data is recorded and how it is

recorded and measured.

8/3/2015 23

Continue

Information Extraction and Cleaning

The information collected will not be in a format ready for

analysis.

We require an information extraction process that pulls

out the required information from the underlying sources out the required information from the underlying sources

and expresses it in a structured form suitable for analysis.

8/3/2015 24

Continue

Data Integration, Aggregation, and Representation

Given the heterogeneity of the flood of data, it is not

enough merely to record it and throw it into a repository.

Data analysis is considerably more challenging than simply

locating, identifying, understanding, and citing data.locating, identifying, understanding, and citing data.

For effective large-scale analysis all of this has to happen

in a completely automated manner.

8/3/2015 25

Continue

Query Processing, Data Modeling, and Analysis

Methods for querying and mining Big Data are

fundamentally different from traditional statistical analysis

on small samples.

Big Data is often noisy, dynamic, heterogeneous, inter- Big Data is often noisy, dynamic, heterogeneous, inter-

related and untrustworthy.

Interpretation

Having the ability to analyze Big Data is of limited value if

users cannot understand the analysis.

Ultimately, a decision-maker, provided with the result of

analysis, has to interpret these results.

8/3/2015 26

Challenges in Big Data Analysis

Heterogeneity and Incompleteness

Scale

Timeliness

Privacy

Human Collaboration Human Collaboration

8/3/2015 27

Managing Big Data

The Classic architectures potential bottleneck is the database

server while faced with peak workloads.

A database server has restriction of scalability and cost, which

are two important goals of big data processing.

Big Data Architecture with following three key aspects: Big Data Architecture with following three key aspects:

Distributed file system,

Non-structural and semi-structured data storage

Cloud platform.

8/3/2015 28

Handling Big Data

Algorithms

Clustering

Association Learning

Parameter Estimation

Recommendation Engine Recommendation Engine

Classification

Similarity Matching

Neural Network

Genetic Algorithms etc

8/3/2015 29

Common Aspects

Analytics /Machine Learning

Learning insights from data

Big Data

Handling massive data volume

Can be combined or used separately Can be combined or used separately

8/3/2015 30

Approach of Solving(Processing) Big Data !

Existing Data base approach is not appropriate! So, we can

use following approaches

Map Reduce

Cloud Computing

8/3/2015 31

Big Data in E-Government System

The government provides services to the citizen.

Now a days, most of the services are to be provided in real time or on-fly such as: Disaster Management, Traffic Control, Crime Control etc.

For that, government needs to make a quick decision based upon the various data from various sources in various formats.

Government should strive to understand the Art of the Possible enabled Government should strive to understand the Art of the Possible enabled by advances in techniques and technologies to manage and exploit Big Data.

Hence, the government has to be smart enough to handle huge volume of data, in high velocity, for variety of data.

Government has to explore the possibility of breaking the problems into smaller sub-problems. [i.e. Divide and Conquer]

Assign these sub-problems for different workers and manage the entire problems to be solved.[Map Reduce]

8/3/2015 32

Map Reduce

Map Reduce is a framework which is popularized by Google

that processes the set of individual problems parallel.

Map Reduce is a programming model that allows easy

development of scalable parallel applications to process big

data on large clusters of commodity machines .data on large clusters of commodity machines .

It is a simple but provides good scalability and fault tolerance

for massive data processing .

The philosophy of Map Reduce is based upon Divide and

Conquer to solve the big problem by decomposing it into

small problems.

8/3/2015 33

Continue

Mapping and Reducing are two main functions of Map Reduce.

The Mapping takes the problem as an input, breaks it into many manageable small problems in (key, value) pairs and assigns them to the different computers.

The function is executed in each computer in parallel that The function is executed in each computer in parallel that produces a list of [Key1, list (Value1)] pairs whereas the Reducing collects the processed small problems and combines them in a defined format before processing.

The Reducing function is executed at the end that produces [list (Value2)].

The features such as simplicity, flexibility, fault tolerance and high scalability have made Map Reduce very successful in managing the big data.

8/3/2015 34

Map Reduce in Connected Government

8/3/2015 35

Map Reduce in Connected Government

Ministry A

Ministry B

Services...

Services...

Ministry A

Ministry A

Ministry B

Ministry C

Ministry Z

Ministry Z

Service A

Service A1

Service C

Service B

Service Z

Service Z1

M

A

P

Service A

Service B

Service Z

Service A1'

Service C

List[Services 1]

List[Services 2]

R

E

D

U

C

E

Service 10

Service 20

8/3/2015 36

Ministry Z

Services...

Ministry A

Ministry Z

Ministry Z

Service A2

Service Z1

Service Z2

M

A

P

Service C

Service Z1

Service A2

Service Z2 List[Services 3]

R

E

D

U

C

E

Service 20

Service 30

Government Cluster

Government Shuffling and RearrangingConnected

Government

Cloud Computing

A cloud computing is the type of parallel and distributed

system consisting of a collection of inter-connected and

virtualized computers that are dynamically provisioned and

presented as one or more unified computing resources based

upon the service level agreements [SLA] established through upon the service level agreements [SLA] established through

negotiation between service provider and service user.

8/3/2015 37

Continue

8/3/2015 38

Few Research Topics in Big Data

Security in Big Data

Data Acquisition in Big Data

Data Visualization in Big Data

Managing data effectively in Big Data

Performance level in Big Data Performance level in Big Data

8/3/2015 39

Conclusion

Big Data has become phenomenon in ICT world

We cannot run away from the presence of Big Data

There are still many research areas in Big Data

8/3/2015 40

Thank You Very Much!!!

8/3/2015 41

Few more slides if you need!

8/3/2015 42

Apache Hadoop

Apache Hadoop was developed to overcome the deficiencies mentioned previously of prior storage and analytics architectures (e.g. SANS, Sharding, Parallel Databases etc).

The Apache Hadoop software library framework allows for distributed processing of large datasets across clusters of computers on commodity hardware.computers on commodity hardware.

This solution is designed for flexibility and scalability, with an architecture that scales to thousands of servers and petabytesof data.

The library detects and handles failures at the application layer, delivering a high-availability service on commodity hardware.

8/3/2015 43

Hadoop

Hadoop is a Platform which enables you to store and analyze large volumes of data.

Hadoop is batch oriented (high throughput and low latency) and strongly consistent (data is always available).

Hadoop is best utilized for:Hadoop is best utilized for:

Large scale batch analytics

Unstructured or semi-structured data

Flat files

Hadoop is comprised of two major subsystems

HDFS (File System)

Map Reduce

8/3/2015 44

Thank you very much!!!

8/3/2015 45

Big Data Presentation

Documents

Transcript of Big Data Presentation