Big Data Presentation

45
Big Data Dr. Manish Pokharel June 08, 2013 8/3/2015 1

description

Big data

Transcript of Big Data Presentation

  • Big Data

    Dr. Manish Pokharel

    June 08, 2013

    8/3/2015 1

  • Contents

    Introduction

    Data Volume

    Few facts of Big Data

    Analysis

    Challenges in Big Data Challenges in Big Data

    Handling Big Data

    Research Areas in Big Data

    8/3/2015 2

  • Preamble : The Evolution of Data

    1. In the past, the most difficult problem for businesses was

    how to store all the data.

    2. The challenge now is no longer to store large amounts of

    information, but to understand and analyze this data.

    3. By harnessing this data through sophisticated analytics, 3. By harnessing this data through sophisticated analytics,

    and by presenting the key metrics in an efficient, easily

    discernable fashion, we are afforded unprecedented

    understanding and insight into our data.

    8/3/2015 3

  • The Evolution of Data

    1. Unlocking the true value of this massive amount of

    information will require new systems for centralizing,

    aggregating, analyzing, and visualizing these enormous

    data sets. In particular analyzing and understanding

    petabytes of structured and unstructured data poses the petabytes of structured and unstructured data poses the

    following unique challenges:

    1. Scalability

    2. Robustness

    3. Diversity

    4. Analytics

    5. Visualization of the Data

    8/3/2015 4

  • Introduction

    We are awash in a flood of data today.

    We have entered an era of Big Data.

    Handling more than 30 PB ( 30 x 1125899906842624 bytes) in

    a day has become a common phenomenon in most of the

    international companies now days.international companies now days.

    In USA only more than 848 PB of data was produced by the

    government.

    So, we can not run away or ignore the presence of huge data

    We have to think in a different way to handle these huge data

    8/3/2015 5

  • Data Volume

    8/3/2015 6

  • Continue

    As we know we need the data to convert it into information

    so that we can make a good decision based upon the data.

    In a broad range of application areas, data is being collected

    at unprecedented scale.

    Decisions that previously were based on guesswork, or on Decisions that previously were based on guesswork, or on

    painstakingly constructed models of reality, can now be made

    based on the data itself.

    8/3/2015 7

  • Continue

    Big Data analysis now drives nearly every aspect of our

    modern society, including mobile services, retail,

    manufacturing, financial services, life sciences, and physical

    sciences.

    Big Data is an entity that is very big in size, very fast in speed

    of interpreting, and various types of structures which is not of interpreting, and various types of structures which is not

    easily processed by the traditional database management

    tools.

    It refers to data sets whose size is beyond capabilities of the

    current database technology

    8/3/2015 8

  • Continue

    Big Data is the massive data that comes from different sources

    which is characterized by three Vs such as: Volume, Velocity

    and Variety.

    Volume

    Velocity Velocity

    Variety

    8/3/2015 9

  • Continue

    Variety

    Up to 85 percent of an organizations data is unstructured not

    numeric but it still must be folded into quantitative analysis and

    decision making.

    Example: Text, video, audio and other unstructured data require

    different architecture and technologies for analysis.different architecture and technologies for analysis.

    Velocity

    Initiatives such as the use of RFID tags and smart metering are driving

    an ever greater need to deal with the torrent of data in near real time.

    This, coupled with the need and drive to be more agile and deliver

    insight quicker, is putting tremendous pressure on organizations to

    build the necessary infrastructure and skill base to react quickly

    enough.

    8/3/2015 10

  • Continue

    Variability

    In addition to the speed at which data comes your way, the data flows

    can be highly variable with daily, seasonal and event-triggered peak

    loads that can be challenging to manage.

    Complexity

    Difficulties dealing with data increase with the expanding universe of

    data sources and are compounded by the need to link, match and

    transform data across business entities and systems.

    Organizations need to understand relationships, such as complex

    hierarchies and data linkages, among all data.

    8/3/2015 11

  • Continue

    Big Data can also be considered as a phenomenon that

    describes large volumes of high velocity with high complexity

    and variable data.

    Big Data technologies as a new generation of technologies

    and architectures, designed to economically extract value and architectures, designed to economically extract value

    from very large volumes of a wide variety of data by enabling

    high-velocity capture, discovery, and/or analysis.

    There are three main characteristics of Big Data: the data

    itself, the analytics of the data, and the presentation of the

    results of the analytics.

    8/3/2015 12

    Big data is a relative term describing a situation where the volume, velocity and variety of data exceed an organizations storage or compute capacity for accurate and timely decision making.

  • Continue

    Big data has special characters, it requires special technologies to capture, to extract, to integrate, to analyze, and to interpret it.

    Extracting the meaning from the big data is not impossible but the fact is that it is not easy.

    Since the big data is never in rest and the size is increasing Since the big data is never in rest and the size is increasing very fast, an ultra-high speed messaging technology is required in real time for streaming data capture and monitoring continuously.

    The heterogeneity nature of incoming data, its increasing trends in volume, need of quick interpretation, and the security are the prime challenges of big data.

    8/3/2015 13

  • Continue

    While the potential benefits of Big Data are real and

    significant, and some initial successes have already been

    achieved, there remain many technical challenges that must

    be addressed to fully realize this potential.

    The sheer size of the data, of course, is a major challenge, and The sheer size of the data, of course, is a major challenge, and

    is the one that is most easily recognized.

    Industry analysis companies like to point out that there are

    challenges not just in Volume, but also in Variety and Velocity,

    and that companies should not focus on just the first of these.

    8/3/2015 14

  • Continue

    By Variety, they usually mean heterogeneity of data types,

    representation, and semantic interpretation.

    By Velocity, they mean both the rate at which data arrive and

    the time in which it must be acted upon.

    While these three are important, this short list fails to include While these three are important, this short list fails to include

    additional important requirements such as privacy and

    usability.

    8/3/2015 15

  • Source: IDC's Digital Universe Study, sponsored by EMC, December 2012 8/3/2015 16

  • Few facts on Big Data!!

    From 2005 to 2020, the digital universe will grow by a factor

    of 300, from 130 Exabyte to 40,000 Exabyte, or 40 trillion

    gigabytes (more than 5,200 gigabytes for every man, woman,

    and child in 2020)

    By 2020, the digital universe will about double every two By 2020, the digital universe will about double every two

    years

    The investment in managing, containing, studying, and storing

    the bits in the digital universe will only grow by 40% between

    2012 and 2020

    As a result, the investment per gigabyte during that same

    period will drop from $2.00 to $0.20

    8/3/2015 17

  • Continue

    Between 2012 and 2020, emerging markets' share of the

    expanding digital universe will grow from 36% to 62%.

    A majority of the information in the digital universe, 68% in

    2012, is created and consumed by consumers watching

    digital TV, interacting with social media, sending camera digital TV, interacting with social media, sending camera

    phone images and videos between devices and around the

    Internet, and so on.

    Yet enterprises have liability or responsibility for nearly 80% of

    the information in the digital universe.

    8/3/2015 18

  • Continue

    It is estimated that by 2020, as much as 33% of the digital

    universe will contain information that might be valuable if

    analyzed.

    By 2020, nearly 40% of the information in the digital universe

    will be "touched" by cloud computing providers meaning

    that a byte will be stored or processed in a cloud somewhere that a byte will be stored or processed in a cloud somewhere

    in its journey from originator to disposal.

    8/3/2015 19

  • Continue

    The proportion of data in the digital universe that requires

    protection is growing faster than the digital universe itself,

    from less than a third in 2010 to more than 40% in 2020.

    The amount of information individuals create themselves

    writing documents, taking pictures, downloading music, etc. writing documents, taking pictures, downloading music, etc.

    is far less than the amount of information being created

    about them in the digital universe.

    8/3/2015 20

  • Big Data Analysis

    The analysis of Big Data involves multiple distinct phases as

    shown in the figure in the next slide ,each of which introduces

    challenges.

    Many people unfortunately focus just on the

    analysis/modeling phase: while that phase is crucial, it is of

    little use without the other phases of the data analysis little use without the other phases of the data analysis

    pipeline.

    Even in the analysis phase, which has received much

    attention, there are poorly understood complexities in the

    context of multi-tenanted clusters where several users

    programs run concurrently.

    Many significant challenges extend beyond the analysis

    phase.

    8/3/2015 21

  • The Big Data Analysis Pipelines

    8/3/2015 22

  • Continue

    Data Acquisition and Recording

    Big Data does not arise out of a vacuum: it is recorded

    from some data generating source.

    Much of this data is of no interest, and it can be filtered

    and compressed by orders of magnitude. and compressed by orders of magnitude.

    One challenge is to define these filters in such a way that

    they do not discard useful information.

    The second challenge is to automatically generate the right

    metadata to describe what data is recorded and how it is

    recorded and measured.

    8/3/2015 23

  • Continue

    Information Extraction and Cleaning

    The information collected will not be in a format ready for

    analysis.

    We require an information extraction process that pulls

    out the required information from the underlying sources out the required information from the underlying sources

    and expresses it in a structured form suitable for analysis.

    8/3/2015 24

  • Continue

    Data Integration, Aggregation, and Representation

    Given the heterogeneity of the flood of data, it is not

    enough merely to record it and throw it into a repository.

    Data analysis is considerably more challenging than simply

    locating, identifying, understanding, and citing data.locating, identifying, understanding, and citing data.

    For effective large-scale analysis all of this has to happen

    in a completely automated manner.

    8/3/2015 25

  • Continue

    Query Processing, Data Modeling, and Analysis

    Methods for querying and mining Big Data are

    fundamentally different from traditional statistical analysis

    on small samples.

    Big Data is often noisy, dynamic, heterogeneous, inter- Big Data is often noisy, dynamic, heterogeneous, inter-

    related and untrustworthy.

    Interpretation

    Having the ability to analyze Big Data is of limited value if

    users cannot understand the analysis.

    Ultimately, a decision-maker, provided with the result of

    analysis, has to interpret these results.

    8/3/2015 26

  • Challenges in Big Data Analysis

    Heterogeneity and Incompleteness

    Scale

    Timeliness

    Privacy

    Human Collaboration Human Collaboration

    8/3/2015 27

  • Managing Big Data

    The Classic architectures potential bottleneck is the database

    server while faced with peak workloads.

    A database server has restriction of scalability and cost, which

    are two important goals of big data processing.

    Big Data Architecture with following three key aspects: Big Data Architecture with following three key aspects:

    Distributed file system,

    Non-structural and semi-structured data storage

    Cloud platform.

    8/3/2015 28

  • Handling Big Data

    Algorithms

    Clustering

    Association Learning

    Parameter Estimation

    Recommendation Engine Recommendation Engine

    Classification

    Similarity Matching

    Neural Network

    Genetic Algorithms etc

    8/3/2015 29

  • Common Aspects

    Analytics /Machine Learning

    Learning insights from data

    Big Data

    Handling massive data volume

    Can be combined or used separately Can be combined or used separately

    8/3/2015 30

  • Approach of Solving(Processing) Big Data !

    Existing Data base approach is not appropriate! So, we can

    use following approaches

    Map Reduce

    Cloud Computing

    8/3/2015 31

  • Big Data in E-Government System

    The government provides services to the citizen.

    Now a days, most of the services are to be provided in real time or on-fly such as: Disaster Management, Traffic Control, Crime Control etc.

    For that, government needs to make a quick decision based upon the various data from various sources in various formats.

    Government should strive to understand the Art of the Possible enabled Government should strive to understand the Art of the Possible enabled by advances in techniques and technologies to manage and exploit Big Data.

    Hence, the government has to be smart enough to handle huge volume of data, in high velocity, for variety of data.

    Government has to explore the possibility of breaking the problems into smaller sub-problems. [i.e. Divide and Conquer]

    Assign these sub-problems for different workers and manage the entire problems to be solved.[Map Reduce]

    8/3/2015 32

  • Map Reduce

    Map Reduce is a framework which is popularized by Google

    that processes the set of individual problems parallel.

    Map Reduce is a programming model that allows easy

    development of scalable parallel applications to process big

    data on large clusters of commodity machines .data on large clusters of commodity machines .

    It is a simple but provides good scalability and fault tolerance

    for massive data processing .

    The philosophy of Map Reduce is based upon Divide and

    Conquer to solve the big problem by decomposing it into

    small problems.

    8/3/2015 33

  • Continue

    Mapping and Reducing are two main functions of Map Reduce.

    The Mapping takes the problem as an input, breaks it into many manageable small problems in (key, value) pairs and assigns them to the different computers.

    The function is executed in each computer in parallel that The function is executed in each computer in parallel that produces a list of [Key1, list (Value1)] pairs whereas the Reducing collects the processed small problems and combines them in a defined format before processing.

    The Reducing function is executed at the end that produces [list (Value2)].

    The features such as simplicity, flexibility, fault tolerance and high scalability have made Map Reduce very successful in managing the big data.

    8/3/2015 34

  • Map Reduce in Connected Government

    8/3/2015 35

  • Map Reduce in Connected Government

    Ministry A

    Ministry B

    Services...

    Services...

    Ministry A

    Ministry A

    Ministry B

    Ministry C

    Ministry Z

    Ministry Z

    Service A

    Service A1

    Service C

    Service B

    Service Z

    Service Z1

    M

    A

    P

    Service A

    Service B

    Service Z

    Service A1'

    Service C

    List[Services 1]

    List[Services 2]

    R

    E

    D

    U

    C

    E

    Service 10

    Service 20

    8/3/2015 36

    Ministry Z

    Services...

    Ministry A

    Ministry Z

    Ministry Z

    Service A2

    Service Z1

    Service Z2

    M

    A

    P

    Service C

    Service Z1

    Service A2

    Service Z2 List[Services 3]

    R

    E

    D

    U

    C

    E

    Service 20

    Service 30

    Government Cluster

    Government Shuffling and RearrangingConnected

    Government

  • Cloud Computing

    A cloud computing is the type of parallel and distributed

    system consisting of a collection of inter-connected and

    virtualized computers that are dynamically provisioned and

    presented as one or more unified computing resources based

    upon the service level agreements [SLA] established through upon the service level agreements [SLA] established through

    negotiation between service provider and service user.

    8/3/2015 37

  • Continue

    8/3/2015 38

  • Few Research Topics in Big Data

    Security in Big Data

    Data Acquisition in Big Data

    Data Visualization in Big Data

    Managing data effectively in Big Data

    Performance level in Big Data Performance level in Big Data

    8/3/2015 39

  • Conclusion

    Big Data has become phenomenon in ICT world

    We cannot run away from the presence of Big Data

    There are still many research areas in Big Data

    8/3/2015 40

  • Thank You Very Much!!!

    8/3/2015 41

  • Few more slides if you need!

    8/3/2015 42

  • Apache Hadoop

    Apache Hadoop was developed to overcome the deficiencies mentioned previously of prior storage and analytics architectures (e.g. SANS, Sharding, Parallel Databases etc).

    The Apache Hadoop software library framework allows for distributed processing of large datasets across clusters of computers on commodity hardware.computers on commodity hardware.

    This solution is designed for flexibility and scalability, with an architecture that scales to thousands of servers and petabytesof data.

    The library detects and handles failures at the application layer, delivering a high-availability service on commodity hardware.

    8/3/2015 43

  • Hadoop

    Hadoop is a Platform which enables you to store and analyze large volumes of data.

    Hadoop is batch oriented (high throughput and low latency) and strongly consistent (data is always available).

    Hadoop is best utilized for:Hadoop is best utilized for:

    Large scale batch analytics

    Unstructured or semi-structured data

    Flat files

    Hadoop is comprised of two major subsystems

    HDFS (File System)

    Map Reduce

    8/3/2015 44

  • Thank you very much!!!

    8/3/2015 45