Big Data in Clouds - UvA · Big Data in Clouds Cloud based platforms and tools for Big Data...

Big Data in Clouds

Cloud based platforms and tools for Big Data

applications

Data Centric Computing

UvA workshop 3 November 2015

Yuri Demchenko


Workshop UvA 2015Cloudscape for Big Data 1

Outline

• Big Data and Data Centric Computing

– Need for new paradigm?

• Cloud Computing as a platform of choice for Big

Data applications

• Big Data Stack and cloud advantages

• Cloud platforms for Big Data

• Questions and discussion



Acknowledgement

Slides 3, 14-18 are credit to David Bernstein

Cloud Computing is the New Pervasive Ubiquitous

Intelligence & Communications Platform



3

Education

Entertainment

Infrastructure

Relationships

Day to Day Life

TransportationCommerce

Communications

Slide is credit to David Bernstein

Public Clouds are All Over the Place



Cloud Centers are All Over the Place, from datacentermap.com November 2015

http://www.datacentermap.com/cloud.html

http://www.datacentermap.com/cloud.html

Cloud Computing as a key IT technology

transformation factor

• Cloud Computing is powering modern business and facilitating new technologies development that require elastic computing resources on-demand

– Mobile applications

– Big Data applications

– Internet of Things

– Changes telecom market landscape

• Demand from other technologies accelerate Cloud Computing development – Big Data and Data Intensive Applications, Research Infrastructures, Mobile

technologies, IoT

• Cloud Computing is increasing business agility and speed up new services/technologies development

– However requires IT platforms/solutions transformation and applications re-factoring/re-design



Source: VMware Business white paper

CloudIT Agility Business

Agility

Corporate

Performance

Gartner's 2014 Hype Cycle



Source: Gartner (August 2014)

[ref] Gartner's 2014 Hype Cycle for Emerging Technologies Maps the Journey to Digital Business. http://www.gartner.com/newsroom/id/2819918

Cloud Computing:

Approaching a maturity stage

Cisco Cloud Index 2013-2018: Cloud Service

Delivery Models Market Share



Cloud workload processed by different cloud service delivery models by 2018

• Software-as-a-Service (SaaS): 59 percent, up from 41 percent in 2013

• Infrastructure-as-a-Service (PaaS): 28 percent, down from 44 percent in 2013

• Platform-as-a-Service (IaaS): 13 percent, down from 15 percent in 2013

[ref] Cisco Global Cloud Index: Forecast and Methodology, 2013–2018 [online]

http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html

Data Center

Virtualization and

Cloud Computing

Growth by 2018

• 78% of workloads

will be processed

by cloud data

centers

• 22% will be

processed by

traditional data

centers.

http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html

Big Data is Driving Cloud Usage



Data is becoming the new raw material of business:

an economic input almost on a par with capital and

labour. “Every day I wake up and ask, ‘how can I flow

data better, manage data better, analyse data

better?” says Rollin Ford, the CIO of Wal-Mart.Source: Data, Data Everywhere, The Economist, February 25,

2010

There were 5 exabytes of information

created between the dawn of civilization

through 2003, but that much information

is now created every 2 days, and the

pace is increasingEric Schmidt, Google CEO, Techonomy

Conference, August 4, 2010

Supp

ly

Dem

and

NIST Big Data Working Group (NBD-WG) and

ISO/IEC JTC1 Study Group on Big Data (SGBD)

• NIST Big Data Working Group (NBD-WG) is leading the development of the Big Data

Technology Roadmap - http://bigdatawg.nist.gov/home.php

– Built on experience of developing the Cloud Computing standards fully accepted by

industry

• Set of documents delivered September 2014 (to be published as NIST Special Publication

documents) - http://bigdatawg.nist.gov/V1_output_docs.php

Volume 1: NIST Big Data Definitions

Volume 2: NIST Big Data Taxonomies

Volume 3: NIST Big Data Use Case & Requirements

Volume 4: NIST Big Data Security and Privacy Requirements

Volume 5: NIST Big Data Architectures White Paper Survey

Volume 6: NIST Big Data Reference Architecture

Volume 7: NIST Big Data Technology Roadmap

• NBD-WG defined 3 main components of the new

technology:– Big Data Paradigm

– Big Data Science and Data Scientist as a new profession

– Big Data Architecture



The Big Data Paradigm consists

of the distribution of data systems

across horizontally-coupled

independent resources to achieve

the scalability needed for the

efficient processing of extensive

datasets.

http://bigdatawg.nist.gov/home.php

http://bigdatawg.nist.gov/V1_output_docs.php

NIST Big Data Reference Architecture

Main components of the Big Data ecosystem • Data Provider

• Big Data Applications Provider

• Big Data Framework Provider

• Data Consumer

• Service Orchestrator

Big Data Lifecycle and Applications Provider activities• Collection

• Preparation

• Analysis and Analytics

• Visualization

• Access

Big Data Ecosystem includes all components that are involved into Big Data production, processing, delivery, and consuming



K E Y :

SWService Use

Big Data Information FlowSW Tools and Algorithms Transfer

Big Data Application Provider

Visualization AccessAnalyticsPreparationCollection

System Orchestrator

Se

cu

rit

y

& P

riv

ac

y

Ma

na

ge

me

nt

DATA

SW

DATA

SW

I N F O R M AT I O N VA L U E C H A I N

IT V

AL

UE

CH

AIN

Da

ta C

on

sum

er

Da

ta P

rovi

de

r

DATA

Horizontally Scalable (VM clusters)

Vertically Scalable

Horizontally Scalable

Vertically Scalable

Horizontally Scalable

Vertically Scalable

Big Data Framework ProviderProcessing Frameworks (analytic tools, etc.)

Platforms (databases, etc.)

Infrastructures

Physical and Virtual Resources (networking, computing, etc.)

DA

TA

SW

[ref] Volume 6: NIST Big Data Reference Architecture. http://bigdatawg.nist.gov/V1_output_docs.php

Big Data Architecture Framework (BDAF)

(1) Data Models, Structures, Types– Data formats, non/relational, file systems, etc.

(2) Big Data Management– Big Data Lifecycle (Management) Model

• Big Data transformation/staging

– Provenance, Curation, Archiving

(3) Big Data Analytics and Tools– Big Data Applications

• Target use, presentation, visualisation

(4) Big Data Infrastructure (BDI)– Storage, Compute, (High Performance Computing,) Network

– Sensor network, target/actionable devices

– Big Data Operational support

(5) Big Data Security– Data security in-rest, in-move, trusted processing environments



Big Data Infrastructure and Analytics Tools



Big Data Infrastructure

• Heterogeneous multi-provider

inter-cloud infrastructure

• Data management infrastructure

• Collaborative Environment

• Advanced high performance

(programmable) network

• Security infrastructure

• Federated Access and Delivery

Infrastructure (FADI)

Big Data Analytics

Infrastructure/Tools

• High Performance Computer

Clusters (HPCC)

• Big Data storage and databases

SQL and NoSQL

• Analytics/processing: Real-time,

Interactive, Batch, Streaming

• Big Data Analytics tools and

applications

Data Lifecycle/Transformation Model

• Data Model changes along

data lifecycle or evolution

• Data provenance is a discipline to track all

data transformations along lifecycle

• Identifying and linking data

– Persistent data/object identifiers (PID/OID)

– Traceability vs Opacity

– Referral integrity



Multiple Data Models and structures

• Data Variety and Variability

• Semantic InteroperabilityData Model (1)Data Model (1)

Data Model (3)

Data Model (4)Data (inter)linking

• PID/OID

• Identification

• Traceability vs

Opacity

• Privacy, Opacity

Data Storage (Big Data capable)

Consum

er

applic

ations

Data

Source

Data

Filter/Enrich,

Classification

Data

Delivery,

Visualisation

Data

Analytics,

Modeling,

Prediction

Data

Collection&

Registration

Data repurposing,

Analytics re-factoring,

Secondary processing

Big Data Cloud Based Service



Parallel

compute server

d1 d2 d3

External

ClientParallel

data server

Query

Source

dataset

Derived

datasets

Parallel

file system

(e.g., GFS,

HDFS)

Result

Data-intensive computing system (e.g. Hadoop)

Parallel

query server

External

data

sources

Examples:

Search, scene completion

service, log processing

Characteristics:

Massive data and

computation on cloud, small

queries and results

Big Data Stack



Data Store & Historical Analytics

• PERIODIC QUERY

• OLAP / DATA WAREHOUSE MODEL

• BI, Reports

• Dashboards

• User Interactive

Real-Time Decisions

• CONTINUOUS QUERY

• OLTP MODEL

• Programmatic Request-Response

• Stored Procedures

• Export for Pipelining

Data Ingestion

Message/Data Queue

Data Export

Hook into an existing queue to get

copy / subset of data. Queue handles

partitioning, replication, and ordering

of data, can manage backpressure

from slower downstream components

Use direct

Ingestion

API to

capture

entire

stream at

wire speed Ingestion will transform,

normalize, distribute and

integrate to one or more of

the Analytic / Decision

Engines of choice

Use one or more Analytic /

Decision engines to

accomplish specific task on

the Fast Data window

Streaming Analytics

• NON QUERY

• TIME WINDOW MODEL

• Counting, Statistics

• Rules

• Triggers

Export will transform,

normalize, distribute

and integrate to one

or more of the Data

Warehouse / Storage

Engines of choice

The OLAP Engines of

choice will append to

historical data and store

for later, further Big

Data analysis on entire

data set

Real-Time Decisions

and/or Results can

be fed back “up-

stream” to influence

the “next step”

Big Data Stack




• PERIODIC QUERY

• OLAP / DATA WAREHOUSE MODEL

• BI, Reports

• Dashboards

• User Interactive

Real-Time Decisions

• CONTINUOUS QUERY

• OLTP MODEL

• Programmatic Request-Response

• Stored Procedures

• Export for Pipelining

Data Ingestion

Message/Data Queue

Data Export

Hook into an existing queue to get

copy / subset of data. Queue handles

partitioning, replication, and ordering

of data, can manage backpressure

from slower downstream components

Use direct

Ingestion

API to

capture

entire

stream at

wire speed Ingestion will transform,

normalize, distribute and

integrate to one or more of

the Analytic / Decision

Engines of choice

Use one or more Analytic /

Decision engines to

accomplish specific task on

the Fast Data window

Streaming Analytics

• NON QUERY

• TIME WINDOW MODEL

• Counting, Statistics

• Rules

• Triggers

Export will transform,

normalize, distribute

and integrate to one

or more of the Data

Warehouse / Storage

Engines of choice

The OLAP Engines of

choice will append to

historical data and store

for later, further Big

Data analysis on entire

data set

Real-Time Decisions

and/or Results can

be fed back “up-

stream” to influence

the “next step”

Big Data Stack Components and used Interaction

• Ingestion: provides interface to the streaming data sources

including data transformation and normalisation

– Direct Ingestion using the data generating API at “wire speed”

– Message Queue: effective for heterogeneous data sources with

different speed/velocity

• Streaming analytics and real-time decisions

– Streaming analytics uses non-query time-window model

– Real-time analytics uses continuous query OLTP model

• Data export, data store and historical analytics

– Transform, normalize, distribute and integrate to one or more of the

Data Warehouse or storage platform

– Uses periodic queries OLTP or data warehouse model

– Data are stored for future Big Data analysis



Important Big Data Technologies




Real-Time Decisions

Data Ingestion

Message/Data Queue

Data Export

Streaming Analytics

Data Factory, Kafka, Flume, Scribe Samza,

Dataflow

,

Kinesis

Event Hubs

HDInsight, DocumentDB, Hadoop, Hive, Spark

SQL, Druid, BigQuery, DynamoDB, Vertica,

MongoDB, EMR, CouchDB,

VoltDB

Sqoop, HDFS

Microsoft Azure:

Event Hubs

Data Factory

Stream Analytics

HDInsight

DocumentDB

Google GCE:

DataFlow

BigQuery

Amazon AWS:

Kinesis

EMR

DynamoDB

Open Source

Samza

Kafka

Flume

Scribe

Storm

Cascading

S4

Crunch

Spark

Hadoop

Hive

Druid

MongoDB

CouchDB

VoltDBProprietary:

Vertica

HortonWorks

Stream Analytics,

Storm, Cascading,

S4, Crunch, Spark-

Streaming

The Big Data Platform is Cloud!



• Interesting applications are data hungry

• The data grows over time

• The data is immobile

• Compute comes to the data

• Big Data clusters are the new libraries

• Provide high-performance execution over Big Data repositories

Many spindles, many CPUs

Parallel processing

• Enable multiple services to access a repository concurrently

• Enable low-latency scaling of services

• Enable each service to leverage its own software stack

IaaS, file-system protections where needed

• Enable rapid resource scaling for power/demand

Scaling-aware storage

• Enable slow resource scaling for growth

Requirem

ents

Solu

tion

Cloud Platform Benefits for Big Data

• Segregated networks isolate traffic– Clouds construction provides separate networks for each type of

traffic

– Big Data applications benefit from lowest latencies possible for node to node synchronization, dynamic cluster resizing, and other scale-out operations

• Cloud deployment on virtual machines, containers, and bare

metal

– For a traditional highly load-variable problem one might consider

using VM’s as a deployment vehicle for that.

– Some Clouds will offer Container based isolation instead of VMs.

• Cloud tools for large scale applications deployment and support by major IDE– Zero-touch services provisioning



Cloud-powered Services Development Lifecycle



• Easily creates test environment close to real

• Powered by cloud deployment automation tools

– To enable configuration Management and Orchestration, Deployment automation

• Continuous development – test – integration

– CloudFormation Template, Configuration Template, Bootstrap Template

• Can be used with Puppet and Chef, two configuration and deployment management systems for clouds

[ref] Building Powerful Web Applications in the AWS Cloud” by Louis Columbus http://softwarestrategiesblog.com/2011/03/10/building-powerful-web-applications-in-the-aws-cloud/

Dev Test/QA Staging Prod

Amazon CloudFormation + Chef (or Puppet)

2-tier app with

production data

Multi-AZ and HA

2-tier app with

production

data

2-tier app

with test DBAll in one

instance

Configuration Management & Orchestration

Data Protection and Privacy in Cloud

Data protection and privacy will continue remaining the main restricting factor for cloud growth

and will stimulate the following research and developments

• National and international data protection regulation and legislation that should be

supported by corresponding certification and assessment procedures, services and bodies

• New advanced privacy enhancement technologies (PET) should protect user privacy on

stored data and activity in cloud

– Emergence of Big Data technologies will require new PET preventing user (re-)identification based

on correlation between multiple sources of data

• The shared security model currently used in cloud will increase a level of customer

controlled security (and privacy) compliance monitoring

• Cloud access control models will continue adopting federated access control and Identity

management models

– While based on initially verified user identity the access control methods should not reveal user

identity to cloud service providers and 3rd party services

• Data Encryption and key management

– New encrypted data processing models such as homomorphic encryption

– New key management models



Trustworthiness of Cloud Computing Platform and

Trust Management

Trusted cloud platform includes at least two aspects: platform trustworthiness and trust management

• Trustworthiness is a property of the cloud platform that includes multiple factors including design principles, implementation, technical mechanisms and operational methods

– Trustworthiness by design is a key to create and trusted cloud platform

– Cloud providers pay strong attention to all aspects of their platform to create a trusted environment for user services and data

• Trust management include both technical mechanisms and solutions for establishing and managing trust relations between provider and customer or user administrative and security domains

– Technical trust is rooted in the trusted platform and typically require manual setup

– Cloud will stimulate further development of the dynamic trust establishment and management mechanisms to adopt to the dynamic and on-demand character of cloud services

– Trusted Computing Platform (TCP) Architecture by Trusted Computing Group (TCG) provides and technical based for building distributed trusted environment

• TCO-TCG is based on hardware based Trust Platform Module (TPM) that allows tamper-resistant (hardware based) secret key storage and operate as a Root of Trust

• TCG website: http://www.trustedcomputinggroup.org/

• Trust management mechanisms are typically hidden to users but require precise design and strict operation from services and application providers



http://www.trustedcomputinggroup.org/

Big Data platforms and Providers

• AWS Big Data services

• Microsoft Azure Analytics Platform System and

HDInsight

• LexisNexis HPCC Systems



Cloud HPC and Big Data Platforms

• HPC on cloud platform – Special HPC and GPU VM instances as well as Hadoop/HPC clusters offered by

CSPs

• Amazon Big Data services– Amazon Elastic MapReduce

• Microsoft Analytics Platform System (APS)– Microsoft HD Insight

• LexisNexis HPC Cluster System– Combing both HPC cluster platform and optimized data processing languages

• Streaming analytics/processing tools– Apache Kafka (http://kafka.apache.org/)

• High throughput distributed subscription messaging system used for real-time stream processing

– Apache Storm (https://storm.incubator.apache.org/) • Distributed realtime computation system for unbounded streams of data

– Apache Spark (https://spark.apache.org/) • Is a new general-purpose fast cluster computing system that targets to combine batch and

real-time streaming data processing

• Spark provides primitives for in-memory cluster computing that makes it well suited to machine learning algorithms



http://kafka.apache.org/

https://storm.incubator.apache.org/

https://spark.apache.org/

Cloud and High Performance Computing (HPC)

• Current trends on increasing demand for HPC for Big Data Analytics and

Business Intelligence

– Data Analytics platforms and services are provided by the majority of big Cloud Service

Providers as designated Data Analytics, Hadoop and HPC clusters

• Problem: Cloud Computing is not designed for HPC but rather for

scalability and parallelization

– Tightly coupled HPC workloads scale not well in a VM/cloud environment

• Restricting factors for HPC on cloud platform

– High cost of data access/download from cloud

– High-end VM configuration optimized for compute are proposed not by all

clouds

• Computational challenges of Big Data can be solved by aggregating

compute power and parallelisation of tasks

– 24 hours on 1 machine can be 1 hour on 24 machines

– These 24 machines require supporting infrastructure that will do: orchestration; jobs

division into tasks, and distribution of tasks across a computer cluster.



Amazon EC2 HPC oriented AMI instances

• Amazon EC2 offers dedicated computation optimized VMs and “cluster instances” that deliver better performance to HPC users

– Two Cluster Compute instances that provide a very large amount of CPU coupled with increased network performance (10GigE). Instances come in two sizes,

• Nehalem-based “Quadruple Extra Large Instance” (eight cores/node, 23GB of RAM, 1.7TB of local storage)

• Sandy Bridge-based “Eight Extra Large Instance” (16 cores/node, 60.5GB of RAM, 3.4TB of local storage).

– Besides the Amazon cluster instance, users may select from one of several preconfigured clusters

– Additionally, two other specialized instances for HPC clusters• Cluster GPU instance that provides two NVidia Tesla Fermi M2050 GPUs with proportionally high CPU and 10GigE

network performance.

• High-I/O instance that provides two SSD-based volumes, each with 1024GB of storage.

– Pricing can vary depending upon on-demand, scheduled, or spot purchase of resources• Quadruple Extra Large Instance is US$ 1.3/hour (US$ 0.33/core·hour)

• Eight Extra Large Instance is US$ 2.4/hour (US$ 0.15/core·hour)

• Cluster GPU instance is US$ 2.1/hour, and the High I/O instance is US$ 3.1/hour

• For example– Small usage case (80 cores, 4GB of RAM per core, and basic storage of 500GB) would cost US$

24.00/hour (10 Eight Extra Large Instances)

– Larger usage case (256 cores, 4GB of RAM per core, and 1TB of fast global storage) would cost US$ 38.4/hour (16 Eight Extra Large Instances)

– Once created, the instances must be provisioned and configured to work as a cluster by the user

• Total cost depends on compute time, total data storage and data transfer cost– Amazon does not charge for data transferred into EC2 but has a varying rate schedule for transfer out of

the cloud; additionally, there are EC2 storage costs



Example: Univa HPC with RightScale and

Amazon AWS

• Univa offers “one-click” HPC computing with their partner RightScale

• First stage/offering: Create an internal cloud infrastructure system that provides “raw” virtual machines

using VMware, KVM, Xen, or cloud platforms such as OpenStack or VMware vSphere.

– The UniCloud product uses an IaaS layer to create the machine and then provision a raw machine on the fly

– Once the machine is provisioned (on one of IaaS platforms/providers), UniCloud does additional software

provisioning and configuration to turn the raw instances into nodes that can then be part of an HPC cluster

• Second stage/offering: Using external cloud infrastructure system, such as Amazon EC2 and AWS to

provide the “raw” virtual machines for HPC use

– Additional provisioning and configuration management is done by UniCloud (PaaS), which turns the raw instances

into HPC cluster nodes

– This process is automatic, and adding and removing nodes from HPC cluster can be done on demand.

• To launch an external cloud on Amazon EC2, just go to the Amazon Market Place, choose HPC from the

list, and launch your cluster.

– Once the cloud is built, you can then install your application

• Univa’s UniCloud package provides a “ready-to-run” HPC cloud with prebuilt “kits” that include Univa Grid

Engine

– UniCloud can burst from an internal cloud to an external public cloud or from one public cloud to another

• Univa costs are additive to the AWS instance charges and range from US$ 0.02 up to a max of US$

0.08/core·hour

– Univa proposed a Medium AWS instance for the small usage case (80 cores, 4GB of RAM per core, basic storage of

500GB) that would cost US$ 14.40/hour [(AWS @ US$ 0.16/hour + Univa @ US$ 0.02/hour)x80].

– The larger usage case (256 cores, 4GB of RAM per core, 1TB of fast storage) would cost US$ 58.88/hour.



AWS Cloud Big Data Services

AWS Cloud offers the following services and resources for Big Data processing:

• EC2 Virtual Machine (VM) instances for HPC optimized for computing (with

multiple cores) and with extended storage for large data processing.

• Amazon Elastic MapReduce (EMR) provides the Hadoop framework on

Amazon EC2 and offers a wide range of Hadoop related tools.

• Amazon Kinesis is a managed service for real-time processing of streaming big

data (throughput scaling from megabytes to gigabytes of data per second and

from hundreds of thousands different sources).

• Amazon DynamoDB highly scalable NoSQL data stores with sub-millisecond

response latency.

• Amazon Redshift fully-managed petabyte-scale data warehouse in cloud at cost

less than $1000 per terabyte per year. It is provided with columnar data storage

with possibility to parallelise queries.

• Amazon RDS scalable relational database.

• Amazon Glacier archival storage to AWS for long time data storage at lower cost

that standard Amazon S3 object storage.



Microsoft Azure Analytics Platform System (APS)

• Microsoft Azure cloud provides general IaaS services and reach Platform

as a Service (PaaS) services.

– Similar to AWS, Microsoft Azure offers special VM instances that have both

computational and memory advanced capabilities.

• The Analytics Platform System (APS) combines the Microsoft SQL Server

based Parallel Data Warehouse (PDW) platform with HDInsight and

Apache Hadoop based scalable data analytics platform.

– APS includes the PolyBase data querying technology to simplify integration of

the PDW SQL data and data from Hadoop.

• HDInsight Hadoop based platform has been co-developed with

Hortonworks

– HDInsight provides comprehensive integration and management functionality

for multi-workload data processing on Hadoop platform including batch,

stream, in-memory processing methods.



HDInsight: Microsoft’s Big Data Solution

• HDInsight can run both on

Azure Cloud and on Windows

Server (on premises)

– Data exchange via PolyBase

• Compatible with and support all

products from Apache Hadoop

stack

• Supports all stages of Big Data

processing

• PolyBase is a new technology

that allows integrating Microsoft

SQL Server based Parallel

Data Warehouse (PDW) with

Hadoop

• Azure Blob Storage used to

persistently store data – Data are streamed to Hadoop/HDFS

for processing and pushed back to

Azure Blob Storage



HDInsight/Hadoop Ecosystem



Distributed Storage

(HDFS)

Query

(Hive)

Distributed Processing

(MapReduce)

OD

BC

Legend

Red = Core

Hadoop

Blue = Data

processing

Purple = Microsoft

integration points

and enhancements

Orange = Data

Movement

Green = Packages

[ref] Microsoft Azure Training Kit. https://github.com/Azure-Readiness/MicrosoftAzureTrainingKit

https://github.com/Azure-Readiness/MicrosoftAzureTrainingKit

Hortonworks Data Platform Architecture [ref]

• HDP includes the most recent developments of the Open Source Hadoop suite

• Can run on Linux and on Windows OS

• Can be deployed on premises on dedicated cluster and on cloud as a hosted application



[ref] http://hortonworks.com/hdp /

http://hortonworks.com/hdp

Hortonworks Data Platform (HDP) http://hortonworks.com/

• HDP delivers a single integrated Hadoop platform for enterprises– Provides a data platform for multi-workload data processing across an array of processing methods

including batch and interactive to real-time

– Supports key capabilities of an enterprise data platform: Governance, Security and Operations

– YARN and Hadoop Distributed Filesystem (HDFS) are the core components of HDP

• YARN is treated as datacenter OS and supports multiple access methods (batch,

real-time, streaming, in-memory, and more) on a common data set– YARN is the architectural center of Hadoop that allows to process data simultaneously in multiple

ways

– Allows creating multi-tenant data analytics applications

• HDP runs natively on Linux and Windows OS – HDP provides the basis for Microsoft’s HDInsight Service meaning complete portability of data is

retained on-premise and in the cloud

– Available in integrated hardware from Teradata

• Hortonworks provides a simple starters solution Hadoop Sandbox– Hortonworks Sandbox is a single-node implementation of Hadoop based on the Hortonworks Data

Platform that includes all the typical components found in a Hadoop deployment



LexisNexis HPCC Systems as an integrated Open

Source platform for Big Data Analytics

HPCC Systems data analytics environment components and HPCC Systems architecture model is based on

a distributed, shared-nothing architecture and contains two cluster:

• THOR Data Refinery: Massively parallel Extract, Transform, and Load (ECL) engine that can be used for

variety of tasks such as massive: joins, merges, sorts, transformations, clustering, and scaling.

• ROXIE Data Delivery: Massively parallel, high throughput, structured query response engine.

– Perform volumes of structured queries and full text ranked Boolean search.

– ROXIE also provides real-time analytics capabilities, to address real-time classifications, prediction, fraud detection and other

problems that normally require stream analytics.

Other components of the HPCC environment

• Enterprise Control Language (ECL): An open source, data-centric declarative programming language

used by both THOR and ROXIE for large-scale data management and query processing.

• ECL compiler and job server (ECL Server): Serves as the code generator and compiler that translates

ECL code.

• System data store (Dali): Used for environment configuration, message queue maintenance, and

enforcement of LDAP security restrictions.

• Archiving server (Sasha): Serves as a companion ‘housekeeping’ server to Dali.

• Distributed File Utility (DFU Server): Controls the spraying and despraying operations used to move data

onto and out of THOR.

• The inter-component communication server (ESP Server): Supports multi-protocol communication

between services to enable various types of functionality to client applications via multiple protocols.



LexisNexis HPCC Systems Architecture

• THOR is used for massive data processing in batch mode for ETL processing

• ROXIE is used for massive query processing and real-time analytics



LexisNexis HPCC Systems Data Analytics Environment

• Unstructured and structured content– Ingest disparate data sources

– Manage hundreds of TB of data

• Data Refinery, Linking Fusion– Parallel Processing Architecture

– Discover non-obvious links and detect hidden patterns

– Language optimised for data-intensive Application

• Analysis Applications– Entity resolution

– Link Analysis

– Clustering Analysis

– Complex Analysis

• Variety of industry applications



LexisNexis Data Analytics Languages

• Enterprise Control Language (ECL): An open source, data-centric

declarative programming language

– The declarative character of ECL language simplifies coding

– ECL is developed to simplify both data query design and customary data

transformation programming

– ECL is explicitly parallel and relies on the platform parallelism.

• LexisNexis proprietary record linkage technology SALT (Scalable

Automated Linking Technology): automates data preparation process:

profiling, parsing, cleansing, normalisation, standardisation of data.

– SALT uses weighted matching and threshold based computation, it also

enables internal, external and remote linking with external or master datasets.

– Enables the power of the HPCC Systems and ECL

• Knowledge Engineering Language (KEL) is an ongoing development

– KEL is a domain specific data processing language that allows using

semantic relations between entities to automate generation of ECL code.



Possible research topics on Cloud Computing and

Big Data Infrastructure

• Control and management plane mechanisms for heterogeneous

intercloud/multicloud infrastructure and optimal resources provisioning and

management

• Federated cloud infrastructures: user and resources provisioning under

distributed QoS requirements and policy restrictions

• Cloud powered software and applications engineering for continuous

service/quality improvement by converging development and operation (DevOps)

• Cloud based infrastructures for Big Data processing to match 3 V of Big Data:

Volume, Velocity, Variety, including on-demand provisioning of Big Data services

• Security, access control and trust management in dynamically provisioned cloud

based applications in intercloud/multi-cloud environment

• Data centric security models and services for distributed Big Data (scientific)

applications



Summary and take away

• Emerging Big Data technologies facilitate cloud computing development

and to support HPC applications– Majority of big CSPs and first of all AWS and Microsoft Azure include in their offering HPC

oriented VM images, as well as special HPC and Hadoop clusters

• AWS cloud provides a number of services for Big Data processing including

Elastic MapReduce, NoSQL database DynamoBD, Petabyte scale data

warehouse Redshift, Kinesis for real-time strem processing

• Microsoft Azure Analytics Platform System (APS) combines the Microsoft

SQL Server based Parallel Data Warehouse (PDW) platform with the

HDInsight Hadoop platform developped in cooperation with Hortonworks

• LexisNexis HPCC Systems provides integrated not Hadoop platform for Big

Data analytics

• Cloud and Big Data are complex but rich research areas – High entry level but front end research and engineering problems highly demanded by

research infrastructure and industry



Big Data in Clouds - UvA · Big Data in Clouds Cloud based platforms and tools for Big Data...

Documents

Transcript of Big Data in Clouds - UvA · Big Data in Clouds Cloud based platforms and tools for Big Data...