2010 McKinsey Quarterly Clouds, Big Data, And Smart Assets Ten Tech Enabled Business Trends to Watch
Big Data in Clouds - UvA · Big Data in Clouds Cloud based platforms and tools for Big Data...
Transcript of Big Data in Clouds - UvA · Big Data in Clouds Cloud based platforms and tools for Big Data...
Big Data in Clouds
Cloud based platforms and tools for Big Data
applications
Data Centric Computing
UvA workshop 3 November 2015
Yuri Demchenko
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 1
Outline
• Big Data and Data Centric Computing
– Need for new paradigm?
• Cloud Computing as a platform of choice for Big
Data applications
• Big Data Stack and cloud advantages
• Cloud platforms for Big Data
• Questions and discussion
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 2
Acknowledgement
Slides 3, 14-18 are credit to David Bernstein
Cloud Computing is the New Pervasive Ubiquitous
Intelligence & Communications Platform
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 3
3
Education
Entertainment
Infrastructure
Relationships
Day to Day Life
TransportationCommerce
Communications
Slide is credit to David Bernstein
Public Clouds are All Over the Place
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 4
Cloud Centers are All Over the Place, from datacentermap.com November 2015
http://www.datacentermap.com/cloud.html
Cloud Computing as a key IT technology
transformation factor
• Cloud Computing is powering modern business and facilitating new technologies development that require elastic computing resources on-demand
– Mobile applications
– Big Data applications
– Internet of Things
– Changes telecom market landscape
• Demand from other technologies accelerate Cloud Computing development – Big Data and Data Intensive Applications, Research Infrastructures, Mobile
technologies, IoT
• Cloud Computing is increasing business agility and speed up new services/technologies development
– However requires IT platforms/solutions transformation and applications re-factoring/re-design
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 5
Source: VMware Business white paper
CloudIT Agility Business
Agility
Corporate
Performance
Gartner's 2014 Hype Cycle
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 6
Source: Gartner (August 2014)
[ref] Gartner's 2014 Hype Cycle for Emerging Technologies Maps the Journey to Digital Business. http://www.gartner.com/newsroom/id/2819918
Cloud Computing:
Approaching a maturity stage
Cisco Cloud Index 2013-2018: Cloud Service
Delivery Models Market Share
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 7
Cloud workload processed by different cloud service delivery models by 2018
• Software-as-a-Service (SaaS): 59 percent, up from 41 percent in 2013
• Infrastructure-as-a-Service (PaaS): 28 percent, down from 44 percent in 2013
• Platform-as-a-Service (IaaS): 13 percent, down from 15 percent in 2013
[ref] Cisco Global Cloud Index: Forecast and Methodology, 2013–2018 [online]
http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html
Data Center
Virtualization and
Cloud Computing
Growth by 2018
• 78% of workloads
will be processed
by cloud data
centers
• 22% will be
processed by
traditional data
centers.
Big Data is Driving Cloud Usage
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 8
Data is becoming the new raw material of business:
an economic input almost on a par with capital and
labour. “Every day I wake up and ask, ‘how can I flow
data better, manage data better, analyse data
better?” says Rollin Ford, the CIO of Wal-Mart.Source: Data, Data Everywhere, The Economist, February 25,
2010
There were 5 exabytes of information
created between the dawn of civilization
through 2003, but that much information
is now created every 2 days, and the
pace is increasingEric Schmidt, Google CEO, Techonomy
Conference, August 4, 2010
Supp
ly
Dem
and
NIST Big Data Working Group (NBD-WG) and
ISO/IEC JTC1 Study Group on Big Data (SGBD)
• NIST Big Data Working Group (NBD-WG) is leading the development of the Big Data
Technology Roadmap - http://bigdatawg.nist.gov/home.php
– Built on experience of developing the Cloud Computing standards fully accepted by
industry
• Set of documents delivered September 2014 (to be published as NIST Special Publication
documents) - http://bigdatawg.nist.gov/V1_output_docs.php
Volume 1: NIST Big Data Definitions
Volume 2: NIST Big Data Taxonomies
Volume 3: NIST Big Data Use Case & Requirements
Volume 4: NIST Big Data Security and Privacy Requirements
Volume 5: NIST Big Data Architectures White Paper Survey
Volume 6: NIST Big Data Reference Architecture
Volume 7: NIST Big Data Technology Roadmap
• NBD-WG defined 3 main components of the new
technology:– Big Data Paradigm
– Big Data Science and Data Scientist as a new profession
– Big Data Architecture
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 9
The Big Data Paradigm consists
of the distribution of data systems
across horizontally-coupled
independent resources to achieve
the scalability needed for the
efficient processing of extensive
datasets.
NIST Big Data Reference Architecture
Main components of the Big Data ecosystem • Data Provider
• Big Data Applications Provider
• Big Data Framework Provider
• Data Consumer
• Service Orchestrator
Big Data Lifecycle and Applications Provider activities• Collection
• Preparation
• Analysis and Analytics
• Visualization
• Access
Big Data Ecosystem includes all components that are involved into Big Data production, processing, delivery, and consuming
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 10
K E Y :
SWService Use
Big Data Information FlowSW Tools and Algorithms Transfer
Big Data Application Provider
Visualization AccessAnalyticsPreparationCollection
System Orchestrator
Se
cu
rit
y
& P
riv
ac
y
Ma
na
ge
me
nt
DATA
SW
DATA
SW
I N F O R M AT I O N VA L U E C H A I N
IT V
AL
UE
CH
AIN
Da
ta C
on
sum
er
Da
ta P
rovi
de
r
DATA
Horizontally Scalable (VM clusters)
Vertically Scalable
Horizontally Scalable
Vertically Scalable
Horizontally Scalable
Vertically Scalable
Big Data Framework ProviderProcessing Frameworks (analytic tools, etc.)
Platforms (databases, etc.)
Infrastructures
Physical and Virtual Resources (networking, computing, etc.)
DA
TA
SW
[ref] Volume 6: NIST Big Data Reference Architecture. http://bigdatawg.nist.gov/V1_output_docs.php
Big Data Architecture Framework (BDAF)
(1) Data Models, Structures, Types– Data formats, non/relational, file systems, etc.
(2) Big Data Management– Big Data Lifecycle (Management) Model
• Big Data transformation/staging
– Provenance, Curation, Archiving
(3) Big Data Analytics and Tools– Big Data Applications
• Target use, presentation, visualisation
(4) Big Data Infrastructure (BDI)– Storage, Compute, (High Performance Computing,) Network
– Sensor network, target/actionable devices
– Big Data Operational support
(5) Big Data Security– Data security in-rest, in-move, trusted processing environments
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 11
Big Data Infrastructure and Analytics Tools
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 12
Big Data Infrastructure
• Heterogeneous multi-provider
inter-cloud infrastructure
• Data management infrastructure
• Collaborative Environment
• Advanced high performance
(programmable) network
• Security infrastructure
• Federated Access and Delivery
Infrastructure (FADI)
Big Data Analytics
Infrastructure/Tools
• High Performance Computer
Clusters (HPCC)
• Big Data storage and databases
SQL and NoSQL
• Analytics/processing: Real-time,
Interactive, Batch, Streaming
• Big Data Analytics tools and
applications
Data Lifecycle/Transformation Model
• Data Model changes along
data lifecycle or evolution
• Data provenance is a discipline to track all
data transformations along lifecycle
• Identifying and linking data
– Persistent data/object identifiers (PID/OID)
– Traceability vs Opacity
– Referral integrity
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 13
Multiple Data Models and structures
• Data Variety and Variability
• Semantic InteroperabilityData Model (1)Data Model (1)
Data Model (3)
Data Model (4)Data (inter)linking
• PID/OID
• Identification
• Traceability vs
Opacity
• Privacy, Opacity
Data Storage (Big Data capable)
Consum
er
applic
ations
Data
Source
Data
Filter/Enrich,
Classification
Data
Delivery,
Visualisation
Data
Analytics,
Modeling,
Prediction
Data
Collection&
Registration
Data repurposing,
Analytics re-factoring,
Secondary processing
Big Data Cloud Based Service
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 14
Parallel
compute server
d1 d2 d3
External
ClientParallel
data server
Query
Source
dataset
Derived
datasets
Parallel
file system
(e.g., GFS,
HDFS)
Result
Data-intensive computing system (e.g. Hadoop)
Parallel
query server
External
data
sources
Examples:
Search, scene completion
service, log processing
Characteristics:
Massive data and
computation on cloud, small
queries and results
Big Data Stack
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 15
Data Store & Historical Analytics
• PERIODIC QUERY
• OLAP / DATA WAREHOUSE MODEL
• BI, Reports
• Dashboards
• User Interactive
Real-Time Decisions
• CONTINUOUS QUERY
• OLTP MODEL
• Programmatic Request-Response
• Stored Procedures
• Export for Pipelining
Data Ingestion
Message/Data Queue
Data Export
Hook into an existing queue to get
copy / subset of data. Queue handles
partitioning, replication, and ordering
of data, can manage backpressure
from slower downstream components
Use direct
Ingestion
API to
capture
entire
stream at
wire speed Ingestion will transform,
normalize, distribute and
integrate to one or more of
the Analytic / Decision
Engines of choice
Use one or more Analytic /
Decision engines to
accomplish specific task on
the Fast Data window
Streaming Analytics
• NON QUERY
• TIME WINDOW MODEL
• Counting, Statistics
• Rules
• Triggers
Export will transform,
normalize, distribute
and integrate to one
or more of the Data
Warehouse / Storage
Engines of choice
The OLAP Engines of
choice will append to
historical data and store
for later, further Big
Data analysis on entire
data set
Real-Time Decisions
and/or Results can
be fed back “up-
stream” to influence
the “next step”
Big Data Stack
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 16
Data Store & Historical Analytics
• PERIODIC QUERY
• OLAP / DATA WAREHOUSE MODEL
• BI, Reports
• Dashboards
• User Interactive
Real-Time Decisions
• CONTINUOUS QUERY
• OLTP MODEL
• Programmatic Request-Response
• Stored Procedures
• Export for Pipelining
Data Ingestion
Message/Data Queue
Data Export
Hook into an existing queue to get
copy / subset of data. Queue handles
partitioning, replication, and ordering
of data, can manage backpressure
from slower downstream components
Use direct
Ingestion
API to
capture
entire
stream at
wire speed Ingestion will transform,
normalize, distribute and
integrate to one or more of
the Analytic / Decision
Engines of choice
Use one or more Analytic /
Decision engines to
accomplish specific task on
the Fast Data window
Streaming Analytics
• NON QUERY
• TIME WINDOW MODEL
• Counting, Statistics
• Rules
• Triggers
Export will transform,
normalize, distribute
and integrate to one
or more of the Data
Warehouse / Storage
Engines of choice
The OLAP Engines of
choice will append to
historical data and store
for later, further Big
Data analysis on entire
data set
Real-Time Decisions
and/or Results can
be fed back “up-
stream” to influence
the “next step”
Big Data Stack Components and used Interaction
• Ingestion: provides interface to the streaming data sources
including data transformation and normalisation
– Direct Ingestion using the data generating API at “wire speed”
– Message Queue: effective for heterogeneous data sources with
different speed/velocity
• Streaming analytics and real-time decisions
– Streaming analytics uses non-query time-window model
– Real-time analytics uses continuous query OLTP model
• Data export, data store and historical analytics
– Transform, normalize, distribute and integrate to one or more of the
Data Warehouse or storage platform
– Uses periodic queries OLTP or data warehouse model
– Data are stored for future Big Data analysis
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 17
Important Big Data Technologies
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 18
Data Store & Historical Analytics
Real-Time Decisions
Data Ingestion
Message/Data Queue
Data Export
Streaming Analytics
Data Factory, Kafka, Flume, Scribe Samza,
Dataflow
,
Kinesis
Event Hubs
HDInsight, DocumentDB, Hadoop, Hive, Spark
SQL, Druid, BigQuery, DynamoDB, Vertica,
MongoDB, EMR, CouchDB,
VoltDB
Sqoop, HDFS
Microsoft Azure:
Event Hubs
Data Factory
Stream Analytics
HDInsight
DocumentDB
Google GCE:
DataFlow
BigQuery
Amazon AWS:
Kinesis
EMR
DynamoDB
Open Source
Samza
Kafka
Flume
Scribe
Storm
Cascading
S4
Crunch
Spark
Hadoop
Hive
Druid
MongoDB
CouchDB
VoltDBProprietary:
Vertica
HortonWorks
Stream Analytics,
Storm, Cascading,
S4, Crunch, Spark-
Streaming
The Big Data Platform is Cloud!
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 19
• Interesting applications are data hungry
• The data grows over time
• The data is immobile
• Compute comes to the data
• Big Data clusters are the new libraries
• Provide high-performance execution over Big Data repositories
Many spindles, many CPUs
Parallel processing
• Enable multiple services to access a repository concurrently
• Enable low-latency scaling of services
• Enable each service to leverage its own software stack
IaaS, file-system protections where needed
• Enable rapid resource scaling for power/demand
Scaling-aware storage
• Enable slow resource scaling for growth
Requirem
ents
Solu
tion
Cloud Platform Benefits for Big Data
• Segregated networks isolate traffic– Clouds construction provides separate networks for each type of
traffic
– Big Data applications benefit from lowest latencies possible for node to node synchronization, dynamic cluster resizing, and other scale-out operations
• Cloud deployment on virtual machines, containers, and bare
metal
– For a traditional highly load-variable problem one might consider
using VM’s as a deployment vehicle for that.
– Some Clouds will offer Container based isolation instead of VMs.
• Cloud tools for large scale applications deployment and support by major IDE– Zero-touch services provisioning
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 20
Cloud-powered Services Development Lifecycle
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 21
• Easily creates test environment close to real
• Powered by cloud deployment automation tools
– To enable configuration Management and Orchestration, Deployment automation
• Continuous development – test – integration
– CloudFormation Template, Configuration Template, Bootstrap Template
• Can be used with Puppet and Chef, two configuration and deployment management systems for clouds
[ref] Building Powerful Web Applications in the AWS Cloud” by Louis Columbus http://softwarestrategiesblog.com/2011/03/10/building-powerful-web-applications-in-the-aws-cloud/
Dev Test/QA Staging Prod
Amazon CloudFormation + Chef (or Puppet)
2-tier app with
production data
Multi-AZ and HA
2-tier app with
production
data
2-tier app
with test DBAll in one
instance
Configuration Management & Orchestration
Data Protection and Privacy in Cloud
Data protection and privacy will continue remaining the main restricting factor for cloud growth
and will stimulate the following research and developments
• National and international data protection regulation and legislation that should be
supported by corresponding certification and assessment procedures, services and bodies
• New advanced privacy enhancement technologies (PET) should protect user privacy on
stored data and activity in cloud
– Emergence of Big Data technologies will require new PET preventing user (re-)identification based
on correlation between multiple sources of data
• The shared security model currently used in cloud will increase a level of customer
controlled security (and privacy) compliance monitoring
• Cloud access control models will continue adopting federated access control and Identity
management models
– While based on initially verified user identity the access control methods should not reveal user
identity to cloud service providers and 3rd party services
• Data Encryption and key management
– New encrypted data processing models such as homomorphic encryption
– New key management models
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 22
Trustworthiness of Cloud Computing Platform and
Trust Management
Trusted cloud platform includes at least two aspects: platform trustworthiness and trust management
• Trustworthiness is a property of the cloud platform that includes multiple factors including design principles, implementation, technical mechanisms and operational methods
– Trustworthiness by design is a key to create and trusted cloud platform
– Cloud providers pay strong attention to all aspects of their platform to create a trusted environment for user services and data
• Trust management include both technical mechanisms and solutions for establishing and managing trust relations between provider and customer or user administrative and security domains
– Technical trust is rooted in the trusted platform and typically require manual setup
– Cloud will stimulate further development of the dynamic trust establishment and management mechanisms to adopt to the dynamic and on-demand character of cloud services
– Trusted Computing Platform (TCP) Architecture by Trusted Computing Group (TCG) provides and technical based for building distributed trusted environment
• TCO-TCG is based on hardware based Trust Platform Module (TPM) that allows tamper-resistant (hardware based) secret key storage and operate as a Root of Trust
• TCG website: http://www.trustedcomputinggroup.org/
• Trust management mechanisms are typically hidden to users but require precise design and strict operation from services and application providers
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 23
Big Data platforms and Providers
• AWS Big Data services
• Microsoft Azure Analytics Platform System and
HDInsight
• LexisNexis HPCC Systems
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 24
Cloud HPC and Big Data Platforms
• HPC on cloud platform – Special HPC and GPU VM instances as well as Hadoop/HPC clusters offered by
CSPs
• Amazon Big Data services– Amazon Elastic MapReduce
• Microsoft Analytics Platform System (APS)– Microsoft HD Insight
• LexisNexis HPC Cluster System– Combing both HPC cluster platform and optimized data processing languages
• Streaming analytics/processing tools– Apache Kafka (http://kafka.apache.org/)
• High throughput distributed subscription messaging system used for real-time stream processing
– Apache Storm (https://storm.incubator.apache.org/) • Distributed realtime computation system for unbounded streams of data
– Apache Spark (https://spark.apache.org/) • Is a new general-purpose fast cluster computing system that targets to combine batch and
real-time streaming data processing
• Spark provides primitives for in-memory cluster computing that makes it well suited to machine learning algorithms
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 25
Cloud and High Performance Computing (HPC)
• Current trends on increasing demand for HPC for Big Data Analytics and
Business Intelligence
– Data Analytics platforms and services are provided by the majority of big Cloud Service
Providers as designated Data Analytics, Hadoop and HPC clusters
• Problem: Cloud Computing is not designed for HPC but rather for
scalability and parallelization
– Tightly coupled HPC workloads scale not well in a VM/cloud environment
• Restricting factors for HPC on cloud platform
– High cost of data access/download from cloud
– High-end VM configuration optimized for compute are proposed not by all
clouds
• Computational challenges of Big Data can be solved by aggregating
compute power and parallelisation of tasks
– 24 hours on 1 machine can be 1 hour on 24 machines
– These 24 machines require supporting infrastructure that will do: orchestration; jobs
division into tasks, and distribution of tasks across a computer cluster.
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 26
Amazon EC2 HPC oriented AMI instances
• Amazon EC2 offers dedicated computation optimized VMs and “cluster instances” that deliver better performance to HPC users
– Two Cluster Compute instances that provide a very large amount of CPU coupled with increased network performance (10GigE). Instances come in two sizes,
• Nehalem-based “Quadruple Extra Large Instance” (eight cores/node, 23GB of RAM, 1.7TB of local storage)
• Sandy Bridge-based “Eight Extra Large Instance” (16 cores/node, 60.5GB of RAM, 3.4TB of local storage).
– Besides the Amazon cluster instance, users may select from one of several preconfigured clusters
– Additionally, two other specialized instances for HPC clusters• Cluster GPU instance that provides two NVidia Tesla Fermi M2050 GPUs with proportionally high CPU and 10GigE
network performance.
• High-I/O instance that provides two SSD-based volumes, each with 1024GB of storage.
– Pricing can vary depending upon on-demand, scheduled, or spot purchase of resources• Quadruple Extra Large Instance is US$ 1.3/hour (US$ 0.33/core·hour)
• Eight Extra Large Instance is US$ 2.4/hour (US$ 0.15/core·hour)
• Cluster GPU instance is US$ 2.1/hour, and the High I/O instance is US$ 3.1/hour
• For example– Small usage case (80 cores, 4GB of RAM per core, and basic storage of 500GB) would cost US$
24.00/hour (10 Eight Extra Large Instances)
– Larger usage case (256 cores, 4GB of RAM per core, and 1TB of fast global storage) would cost US$ 38.4/hour (16 Eight Extra Large Instances)
– Once created, the instances must be provisioned and configured to work as a cluster by the user
• Total cost depends on compute time, total data storage and data transfer cost– Amazon does not charge for data transferred into EC2 but has a varying rate schedule for transfer out of
the cloud; additionally, there are EC2 storage costs
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 27
Example: Univa HPC with RightScale and
Amazon AWS
• Univa offers “one-click” HPC computing with their partner RightScale
• First stage/offering: Create an internal cloud infrastructure system that provides “raw” virtual machines
using VMware, KVM, Xen, or cloud platforms such as OpenStack or VMware vSphere.
– The UniCloud product uses an IaaS layer to create the machine and then provision a raw machine on the fly
– Once the machine is provisioned (on one of IaaS platforms/providers), UniCloud does additional software
provisioning and configuration to turn the raw instances into nodes that can then be part of an HPC cluster
• Second stage/offering: Using external cloud infrastructure system, such as Amazon EC2 and AWS to
provide the “raw” virtual machines for HPC use
– Additional provisioning and configuration management is done by UniCloud (PaaS), which turns the raw instances
into HPC cluster nodes
– This process is automatic, and adding and removing nodes from HPC cluster can be done on demand.
• To launch an external cloud on Amazon EC2, just go to the Amazon Market Place, choose HPC from the
list, and launch your cluster.
– Once the cloud is built, you can then install your application
• Univa’s UniCloud package provides a “ready-to-run” HPC cloud with prebuilt “kits” that include Univa Grid
Engine
– UniCloud can burst from an internal cloud to an external public cloud or from one public cloud to another
• Univa costs are additive to the AWS instance charges and range from US$ 0.02 up to a max of US$
0.08/core·hour
– Univa proposed a Medium AWS instance for the small usage case (80 cores, 4GB of RAM per core, basic storage of
500GB) that would cost US$ 14.40/hour [(AWS @ US$ 0.16/hour + Univa @ US$ 0.02/hour)x80].
– The larger usage case (256 cores, 4GB of RAM per core, 1TB of fast storage) would cost US$ 58.88/hour.
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 28
AWS Cloud Big Data Services
AWS Cloud offers the following services and resources for Big Data processing:
• EC2 Virtual Machine (VM) instances for HPC optimized for computing (with
multiple cores) and with extended storage for large data processing.
• Amazon Elastic MapReduce (EMR) provides the Hadoop framework on
Amazon EC2 and offers a wide range of Hadoop related tools.
• Amazon Kinesis is a managed service for real-time processing of streaming big
data (throughput scaling from megabytes to gigabytes of data per second and
from hundreds of thousands different sources).
• Amazon DynamoDB highly scalable NoSQL data stores with sub-millisecond
response latency.
• Amazon Redshift fully-managed petabyte-scale data warehouse in cloud at cost
less than $1000 per terabyte per year. It is provided with columnar data storage
with possibility to parallelise queries.
• Amazon RDS scalable relational database.
• Amazon Glacier archival storage to AWS for long time data storage at lower cost
that standard Amazon S3 object storage.
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 29
Microsoft Azure Analytics Platform System (APS)
• Microsoft Azure cloud provides general IaaS services and reach Platform
as a Service (PaaS) services.
– Similar to AWS, Microsoft Azure offers special VM instances that have both
computational and memory advanced capabilities.
• The Analytics Platform System (APS) combines the Microsoft SQL Server
based Parallel Data Warehouse (PDW) platform with HDInsight and
Apache Hadoop based scalable data analytics platform.
– APS includes the PolyBase data querying technology to simplify integration of
the PDW SQL data and data from Hadoop.
• HDInsight Hadoop based platform has been co-developed with
Hortonworks
– HDInsight provides comprehensive integration and management functionality
for multi-workload data processing on Hadoop platform including batch,
stream, in-memory processing methods.
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 30
HDInsight: Microsoft’s Big Data Solution
• HDInsight can run both on
Azure Cloud and on Windows
Server (on premises)
– Data exchange via PolyBase
• Compatible with and support all
products from Apache Hadoop
stack
• Supports all stages of Big Data
processing
• PolyBase is a new technology
that allows integrating Microsoft
SQL Server based Parallel
Data Warehouse (PDW) with
Hadoop
• Azure Blob Storage used to
persistently store data – Data are streamed to Hadoop/HDFS
for processing and pushed back to
Azure Blob Storage
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 31
HDInsight/Hadoop Ecosystem
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 32
Distributed Storage
(HDFS)
Query
(Hive)
Distributed Processing
(MapReduce)
OD
BC
Legend
Red = Core
Hadoop
Blue = Data
processing
Purple = Microsoft
integration points
and enhancements
Orange = Data
Movement
Green = Packages
[ref] Microsoft Azure Training Kit. https://github.com/Azure-Readiness/MicrosoftAzureTrainingKit
Hortonworks Data Platform Architecture [ref]
• HDP includes the most recent developments of the Open Source Hadoop suite
• Can run on Linux and on Windows OS
• Can be deployed on premises on dedicated cluster and on cloud as a hosted application
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 33
[ref] http://hortonworks.com/hdp /
Hortonworks Data Platform (HDP) http://hortonworks.com/
• HDP delivers a single integrated Hadoop platform for enterprises– Provides a data platform for multi-workload data processing across an array of processing methods
including batch and interactive to real-time
– Supports key capabilities of an enterprise data platform: Governance, Security and Operations
– YARN and Hadoop Distributed Filesystem (HDFS) are the core components of HDP
• YARN is treated as datacenter OS and supports multiple access methods (batch,
real-time, streaming, in-memory, and more) on a common data set– YARN is the architectural center of Hadoop that allows to process data simultaneously in multiple
ways
– Allows creating multi-tenant data analytics applications
• HDP runs natively on Linux and Windows OS – HDP provides the basis for Microsoft’s HDInsight Service meaning complete portability of data is
retained on-premise and in the cloud
– Available in integrated hardware from Teradata
• Hortonworks provides a simple starters solution Hadoop Sandbox– Hortonworks Sandbox is a single-node implementation of Hadoop based on the Hortonworks Data
Platform that includes all the typical components found in a Hadoop deployment
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 34
LexisNexis HPCC Systems as an integrated Open
Source platform for Big Data Analytics
HPCC Systems data analytics environment components and HPCC Systems architecture model is based on
a distributed, shared-nothing architecture and contains two cluster:
• THOR Data Refinery: Massively parallel Extract, Transform, and Load (ECL) engine that can be used for
variety of tasks such as massive: joins, merges, sorts, transformations, clustering, and scaling.
• ROXIE Data Delivery: Massively parallel, high throughput, structured query response engine.
– Perform volumes of structured queries and full text ranked Boolean search.
– ROXIE also provides real-time analytics capabilities, to address real-time classifications, prediction, fraud detection and other
problems that normally require stream analytics.
Other components of the HPCC environment
• Enterprise Control Language (ECL): An open source, data-centric declarative programming language
used by both THOR and ROXIE for large-scale data management and query processing.
• ECL compiler and job server (ECL Server): Serves as the code generator and compiler that translates
ECL code.
• System data store (Dali): Used for environment configuration, message queue maintenance, and
enforcement of LDAP security restrictions.
• Archiving server (Sasha): Serves as a companion ‘housekeeping’ server to Dali.
• Distributed File Utility (DFU Server): Controls the spraying and despraying operations used to move data
onto and out of THOR.
• The inter-component communication server (ESP Server): Supports multi-protocol communication
between services to enable various types of functionality to client applications via multiple protocols.
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 35
LexisNexis HPCC Systems Architecture
• THOR is used for massive data processing in batch mode for ETL processing
• ROXIE is used for massive query processing and real-time analytics
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 36
LexisNexis HPCC Systems Data Analytics Environment
• Unstructured and structured content– Ingest disparate data sources
– Manage hundreds of TB of data
• Data Refinery, Linking Fusion– Parallel Processing Architecture
– Discover non-obvious links and detect hidden patterns
– Language optimised for data-intensive Application
• Analysis Applications– Entity resolution
– Link Analysis
– Clustering Analysis
– Complex Analysis
• Variety of industry applications
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 37
LexisNexis Data Analytics Languages
• Enterprise Control Language (ECL): An open source, data-centric
declarative programming language
– The declarative character of ECL language simplifies coding
– ECL is developed to simplify both data query design and customary data
transformation programming
– ECL is explicitly parallel and relies on the platform parallelism.
• LexisNexis proprietary record linkage technology SALT (Scalable
Automated Linking Technology): automates data preparation process:
profiling, parsing, cleansing, normalisation, standardisation of data.
– SALT uses weighted matching and threshold based computation, it also
enables internal, external and remote linking with external or master datasets.
– Enables the power of the HPCC Systems and ECL
• Knowledge Engineering Language (KEL) is an ongoing development
– KEL is a domain specific data processing language that allows using
semantic relations between entities to automate generation of ECL code.
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 38
Possible research topics on Cloud Computing and
Big Data Infrastructure
• Control and management plane mechanisms for heterogeneous
intercloud/multicloud infrastructure and optimal resources provisioning and
management
• Federated cloud infrastructures: user and resources provisioning under
distributed QoS requirements and policy restrictions
• Cloud powered software and applications engineering for continuous
service/quality improvement by converging development and operation (DevOps)
• Cloud based infrastructures for Big Data processing to match 3 V of Big Data:
Volume, Velocity, Variety, including on-demand provisioning of Big Data services
• Security, access control and trust management in dynamically provisioned cloud
based applications in intercloud/multi-cloud environment
• Data centric security models and services for distributed Big Data (scientific)
applications
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 39
Summary and take away
• Emerging Big Data technologies facilitate cloud computing development
and to support HPC applications– Majority of big CSPs and first of all AWS and Microsoft Azure include in their offering HPC
oriented VM images, as well as special HPC and Hadoop clusters
• AWS cloud provides a number of services for Big Data processing including
Elastic MapReduce, NoSQL database DynamoBD, Petabyte scale data
warehouse Redshift, Kinesis for real-time strem processing
• Microsoft Azure Analytics Platform System (APS) combines the Microsoft
SQL Server based Parallel Data Warehouse (PDW) platform with the
HDInsight Hadoop platform developped in cooperation with Hortonworks
• LexisNexis HPCC Systems provides integrated not Hadoop platform for Big
Data analytics
• Cloud and Big Data are complex but rich research areas – High entry level but front end research and engineering problems highly demanded by
research infrastructure and industry
Data Centric Computing
Workshop UvA 2015Cloudscape for Big Data 40