Microsoft Analytics Platform...

53
Microsoft Analytics Platform System The turnkey modern data warehouse appliance Stefan Cronjaeger June 2014

Transcript of Microsoft Analytics Platform...

Microsoft Analytics Platform System

The turnkey modern data warehouse appliance

Stefan Cronjaeger

June 2014

Agenda

• Modern Data Warehouse• Big Data

• Application examples

• Analytic Platform Systems• Architecture

• Hadoop

• Integration of Hadoop and APS• APS with external Hadoop Clusters

• APS with Hadoop in the Cloud

• APS with integrated Hadoop

3

Data sources

4

Data sourcesNon-Relational Data

Big Data: Variety, Velocity, Volume … and Analytics

Sensor and machine log

Business apps

Web

Social Media

Technologies to drive Big Data

What to do with the data

7

ForecastGeo analysis Customer interaction

Churn Customer segmentation

Shopping basket & Recommendation

Keywords & Sentiment

Scoring & Outlier

Examples for sentimental analysis: Not only Marketing

8

Browse blogs, Twitter, News articles, NewsgroupsExtract key words, pairs of key words, sentimentsAnalyze and correlate

Campaign supervision• Political campaigns and

keywords• Marketing campaigns• Trend analysis

Quality assurance• Analyse internal technical

discussion groups• Get early warning of

possible technical issues

Supply chain for fashion• Look in fashion blogs

and discussion groups• Forecast demand of

specific fashion articles

Structured data: Fraud detection in large amounts of financial data – where to look

9

Not all digits are equal!

130 years ago Simon Newcomb detected that more numbers started with the digit 1. Re-discovered by Benford

The idea:

Look into the numbers (e.g., balance sheet), look how the numbers are usually distributed and look for deviations

Application:

Tax fraud in balance sheets. Actually used by auditors

Manipulated numbers in scientific publications

Fraud in elections, election campaign financing, …

An application of Benford’s law

Bernhard Rauch, Max Göttsche, Gernot Brähler & Thomas

Kronfeld (2014) Deficit versus social

statistics: empirical evidence for the effectiveness of

Benford’s law, Applied Economics Letters, 21:3, 147-151

Differences in number statistics for EU reporting of Social Data and Deficit data by country

Data sources Non-relational data

Agenda

• Modern Data Warehouse• Big Data

• Application examples

• Analytic Platform Systems• Architecture

• Hadoop

• Integration of Hadoop and APS• APS with external Hadoop Clusters

• APS with Hadoop in the Cloud

• APS with integrated Hadoop

About Analytics Platform System

PDW Logical Architecture

Database “host” Servers

Control Host Node

Direct Attached Storage Nodes

Control Node (virtualized)

Compute/Storage Nodes (virtualized)

Client Queries

Virtualization spare

� All servers are virtualization hosts� Running Windows Server 2012

� Control and compute nodes are virtual� All run SQL Server 2012

� Control node spreads data and workload across compute nodes

� Data loads are in parallel and take advantage of the power of all nodes

Fast Infiniband interconnection

Scalability: Massively Parallel and Shared nothing

Smallest (0TB) To Largest (5PB)

• Start small with a few Terabyte warehouse

• Add capacity up to 5 Petabytes

0TB 5 PB

AddCapacity

AddCapacity

Just grow by adding scale unitsAn SMP system would have needed to be completely reconfigured

� The Base Unit has approximate useable storage capacity of 75TB, based on 5:1 compression.

� 3 additional Scale Units can fit into 1 rack, for up to 300 TB of useable storage.

� Multiple racks can be configured for more useable storage.

� The 1TB drives can be replaced with 2TB or 3TB drives, for double or triple capacity. However, multiple Scale Units will provide better performance compared to one Base Unit with larger hard drives. For example, 3 Scale Units with 1TB drives will perform much better than 1 Base Unit with 3TB drives.

� Backup Node and Landing Zone (ETL Storage) is not included. The customer can order whatever they want for backup purposes, and install it themselves.

2 InfiniBand FDR 36 Port Switches

2 Ethernet Switches 5120-24 G

Control Node DL360p

Failover Node DL360p

Base Unit for 2 nodes2 ProLiant DL360p Compute NodesStorage Block (D6000), 70 drives

1st Scale Unit for 4 nodes2 ProLiant DL360p Compute NodesStorage Block (D6000), 70 drives

2nd Scale Unit for 6 nodes2 ProLiant DL360p Compute NodesStorage Block (D6000), 70 drives

3rd Scale Unit for 8 nodes2 ProLiant DL360p Compute NodesStorage Block (D6000), 70 drives

For customer use

Software

Windows Server 2012:Control Node, Mgmt. Node and Compute Nodes run in virtualized Environment

System Center 2012:Single user i/f for management of PDW, OS, BI, custom apps and private cloud

SQL Server 2012 insideVisual Studio Data ToolsPowerview directly on PDW

xVelocityIn-memory executionClustered columnstore

Big Data IntegrationPolybase: T-SQL query to HadoopExternal tables on Hadoop

Workload ManagementWorkload classes

A multi-region/workload appliance

Microsoft

Distributed, scalable system on commodity hardware composed of:

• HDFS—distributed file system

• MapReduce—programming model

• Others: HBase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper

HBase (column DB)

Hive Mahout

Oozie

Sqoop

HBase/Cassandra/Couch/MongoDB

Avro

Zo

okeep

er

Pig FlumeCascadingR

Am

bari

HCatalog

Hadoop = MapReduce + HDFS

What is Hadoop?

Control NodeFailover NodeHadoop Head NodeHadoop redundant Head Node

PDW region

PDW scale unit

Hadoop region

Hadoop region

For customer use

APS: Parallel Data Warehouse and HDInsight region

Configurable: • Minimum 1 PDW region• Additional PDW scale units• Additional HDI scale units

HDI region overview

� In a nutshell, it’s a HDI instance running on an appliance.� HDInsight is Microsoft branded Hortonworks distro.

�An integrated appliance for running PDW region and HDI region

�PDW is offered as a stand-alone workload on the appliance.� HDI is offered only as an add-on to PDW, as a scale unit

� Based on V2 hardware.

�H/A for the Head Node is provided via Windows Failover Clustering (WFC), Data Node H/A is provided via HDFS/MapRedmechanisms

�Security add-ons to address security issues which are not contained in standard Hadoop

� Support for multiple user accounts

Single T-SQL query model for PDW and Hadoop with rich features of T-SQL including joins without ETL

Leverages the power of MPP to enhance query execution performance

Supports Windows Azure HDInsight to enable new hybrid cloud scenarios

Query non-Microsoft Hadoop distributions such as Hortonworks and Cloudera

Query Hadoop data with T-SQL using PolyBaseBringing the worlds or big data and the data warehouse together for users and IT

SQL ServerParallel DataWarehouse

Cloudera

Hortonworks(Windows, Linux)

Windows AzureHDInsight

PolyBase

Microsoft HDInsight

Select… Result set

Big data insights for any user Native Microsoft BI integration to create new insights with familiar tools

No ITintervention required

Analyze PDW and Hadoop data in the same view

Allow any users to create new insights with familiar tools

Leverages high adoptionof Excel, Power View, Power Pivot, and SSAS

Power Users

Data Scientists

Everyone else using Microsoft BI tools

Differentiation: Freedom of deployment options and hybrid solutions

APS Management Console 1PDW and Appliance

Agenda

• Modern Data Warehouse• Big Data

• Application examples

• Analytic Platform Systems• Architecture

• Hadoop

• Integration of Hadoop and APS• APS with external Hadoop Clusters

• APS with Hadoop in the Cloud

• APS with integrated Hadoop

Polybase Use Case Category 1 – Integration with external Hadoop clusters

Listening to SQL customers – ShinSeGaeInvesting into Online Shopping website (‘Korea’s Amazon’)

o SQL Server 2012 PDW & HDP 1.3/HDP 2.0 on Linux

What they want 1. ‘We want perform complex data mining on customer

purchase data – basket analysis’.

2. ‘We want to understand the social media data (reviews/Twitter) – specifically around our products & stores’.

3. ‘We will use Hadoop to keep all of our data ~ envisioned to be around 480 TB. PDW will be the efficient analysis engine for the hot data’.

4. ‘PDW & Polybase are much faster than Hive’.

5. ‘We’re interested in using data mining cloud services in Azure (hybrid scenarios)’

Microsoft NDA - Material

Listening to SQL customers – TeleCom

‘Understanding network quality’

o SQL Server 2012 PDW & Cloudera 4.5 on Linux

What they want 1. ‘We collect millions of network records for quality

assessment and capacity planning – on a daily basis’.

2. ‘Hadoop will be used for storage and ETL of these network record files’.

3. ‘PDW for more operational analysis, ad-hoc analysis, operational reports’.

4. ‘We are using Polybase along with Oozie-based orchestration for a seamless & automated integration’.

Microsoft NDA - Material

Polybase for integrating with various Hadoop distributions

• Support of Hortonwork’s HDP 1.x & 2.x (Windows Server and Linux)

• Support of Cloudera’s CDH 4.x (on Linux)

Push-down computation w/ AU1 release

• Pushing computation where data resides (Hadoop as query execution & processing aid)

• Transparent for users – no need to learn map/reduce

• Seamless query experience through external tables + simplified & parallelized ETL through T-SQL (CTAS for import & CETAS for export)

Integration with 3rd party tool and Microsoft insights/BI layer

• Existing applications simply work

• External tables populated through application layer like regular tables

SQL Server Security Model

• ‘You decide who sees what type of data’

• SQL Server permission model adapted for each Polybase object –external table, data source, and file format

Microsoft APS Polybase

Microsoft APS Polybase

APS control & data nodes

Your Apps

PowerPivot & PowerView

PowerPivot & PowerView

Social Apps

Sensor &

RFID

Mobile Apps

WebApps

Polybase/APSquery engine

External Table

External Data source

External File Format

Microsoft NDA - Material

Solution Architecture – Integration with external Hadoop cluster (1)

Microsoft APS Polybase

Microsoft APS Polybase

APS control & data nodes

Your Apps

PowerPivot & PowerView

PowerPivot & PowerView

Social Apps

Sensor &

RFID

Mobile Apps

WebApps

Polybase/APSquery engine

External Table

External Data source

External File Format SELECT user_name FROM ClickStream cs, PDW_User u WHERE cs.user_IP = u.user_IP and

cs.url=’www.microsoft.com’;

Querying Hadoop data

Creating external table, data source, file format

CREATE EXTERNAL DATA SOURCE HDP2.0 WITH ( TYPE = Hadoop,LOCATION = ‘hdfs://HDP:8020’, JOB_TRACKER_LOCATION=‘HDP:50300’);

CREATE EXTERNAL TABLE Clickstream(url varchar(50),event_date date) WITH ( DATA_SOURCE= HDP2.0, LOCATION =‘/employees/ employee.txt’, FILE_FORMAT = MyRCFile);

CREATE EXTERNAL FILE FORMAT MyRCFile WITH(FORMAT_TYPE = ‘RCFile’,

SERDE_METHOD=‘LazyBinarySerDe’ )

CREATE EXTERNAL TABLE Web_Sales WITH (LOCATION='/TPCDS/web_sales/‘, DATA_SOURCE = HDP2.0 , FILE_FORMAT = MyRCFile) AS SELECT u.* FROM PDW_User

CREATE TABLE PDW_Sales WITH DISTRIBUTION = Hash (id) AS SELECT FROM Web_Sales

Persistently exporting & importing

Microsoft NDA - Material

T-SQL Examples – Integration with external Hadoop cluster (2)

Recommendationengine & personalizedadvertising

HDP 1.3 on Linux (5-10

servers)

Campaign

APS/PDW

EDW

Analytic information(right customer targeting)

Online Shopping Mall

SSG.com(renewal)

EIS

OLAP(Tabular)

DATA Mining

Visualization(Silverlight)

Recent/hot data stored in PDW

PolybaseQueries

raw/cold data

Complex Event Processing (Storm)

Message Queues(KAFKA, Open source)

Tracking Log Servers

Web log data(160GB/daily) –External Polybase tables A, B, C1.

Unstructured/semi-structured text data - External Polybase tables D, E, F

Text (Board/SNS/Internal Text )

Weather..

2.

Company emails –External Polybase tables G, H, I

3.

Mails

BI analyst

Operational Data Store

10 GB Ethernet

Microsoft NDA - Material

Solution Architecture (Details) – ShinSeGae

Cloudera’sCDH 4 on HP (18+ servers)

APS/PDW

EDW

Network quality analysis

Capacity Planning

Visualization(PowerPivot)

Hot operational PDW data

PolybaseQueries

raw/cold data (Petabyte of network

logs)

High-frequencyEvent Processing(Network logs)

Capturing Network logs (>300 GB/per day) – External Polybase tables A, B, C

BI analyst/Planner/Decision-maker

Operational Data Store

Infiniband

OozieWorkflows

Remote procedure calls via stored procedures to trigger Polybase

queries

HCatalog

Usage of Hive’s Metadata stores

Microsoft NDA - Material

Solution Architecture (Details) – Telcom

Polybase Use Case Category 2 –Integration with Microsoft Azure

Listening to SQL customers (5) – Government

‘Bridging the gap between cloud & on-prem’

o Current POC - SQL Server 2012 PDW & HDInsight Azure

What they want 1. ‘HDInsight/Hadoop in the cloud to store and massage our

raw data (XML files) generated by our web-application’.

2. ‘PDW to keep the data on-prem (legal requirement) and to have an efficient query engine for analysis purposes’.

3. ‘Polybase is a great way of accessing our files in the cloud via simple T-SQL’.

4. ‘With this solution, we can allow web users to quickly ask questions while the heavy, more complex business analysis is accomplished by PDW users’.

Microsoft NDA - Material

On

-pre

mis

es

or

“pri

va

te c

lou

d”

Azure StorageAzure Storage

Azure HDInsightAzure HDInsight

Mic

roso

ft A

zure

Microsoft APS Polybase

Microsoft APS Polybase

Your Your Apps

Microsoft or 3rd

party Applications

YourYourApps

Public Internet

Polybase as key integrative feature• Integration with external Hadoop, HDInsight region & Azure Storage

Data aging strategies • Aging of cold data to Azure Storage

• APS & HDInsight region for hot & warm data

Query hot data & cold aged data• APS as modern cloud end-point for Azure

• Seamless querying of hot & cold data through APS

• APS as gateway allowing users to query all on-prem data via PowerBIand

APS control &

data nodes

CREATE EXTERNAL DATA SOURCE WASB WITH ( TYPE = Hadoop,LOCATION = ‘wasbs://[email protected]’);

CREATE EXTERNAL TABLE clickstream_HDInsights (url varchar(50), event_date date) WITH ( DATA_SOURCE= WASB, LOCATION =‘/input/ log1.txt’, FILE_FORMAT = MyDelimitedText);

T-SQL examples

SELECT FROM clickstream_HDInsights, PDW_Table

Azure Express Route

Microsoft NDA - Material

Solution Architecture – Hybrid Scenarios

Microsoft BI stack

IBM Cognos

PolybaseQueries

APS/PDW

EDW

PDW/APS for fast query response & data processing of

hot data

Operational Data Store

Public Internet or Azure Express

Route

Transforming to large text files ~ 10 GBs each

(External WASB Tables)

HDI on Azure

cheap data store –alternative to Hadoop on-

prem solution

Azure Blob Storage

Web Application for Tax Filing (e-

invoice)

Web apps- Generating tons of smaller XML files (~7KB each)

Other Web Feeds

HDI tools for data transformation

Microsoft NDA - Material

Solution Architecture (Details) – Government

Polybase Use Case Category 3 –Unified Appliance with PDW and HDInsight region

Listening to SQL customers (6) – Beverage & Vending Machines

‘What are you drinking? Why is the machine down’?

o POC - SQL Server/APS with PDW & HDI region

What they want 1. ‘We want a complete solution stack – we

do not have Hadoop experts in-house and don’t have the money to get it’.

2. ‘We want to store all raw data coming from vending machines into Hadoop’.

3. ‘360 degree of all our data – structured customer data & unstructured data coming from vending machines’.

4. ‘Predicate maintenance of machines’.

Microsoft NDA - Material

Unified Microsoft APS with

PDW & HDI region

Unified Microsoft APS with

PDW & HDI region

APS control & data nodes

Your Apps

PowerPivot & PowerView

PowerPivot & PowerView

Social Apps

Sensor &

RFID

Mobile Apps

WebApps

Distributed & replicated table

HDI name & data nodes

External Table

Unified appliance • Multi-workload support with PDW and HDInsight region

• HDInsight powered by HDP bits

• No need to deal with multiple support teams (‘better together’)

Seamless & performing query experience through Polybase• External tables can be used for HDI data

• PDW data nodes connected via high-speed network (Infiniband) to Hadoop data nodes

Simplified management & monitoring• One consistent monitoring experience through appliance management

tools

T-SQL examples

CREATE EXTERNAL DATA SOURCE HDI_R WITH ( TYPE = Hadoop, LOCATION = 'hdfs://HTUKIA-C-HHN01:8020‘, JOB_TRACKER_LOCATION ='HTUKIA-C HHN01:50300'

CREATE EXTERNAL TABLE HDI_Region (url varchar(50), event_date date) WITH ( DATA_SOURCE= WASB, LOCATION =‘/input/ log1.txt’, FILE_FORMAT = MyDelimitedText);

SELECT FROM clickstream_HDInsights, PDW_Table

Microsoft NDA - Material

Solution Architecture – Unified APS appliance

PowerQuery/PowerView/PowerMap

Data scientist group 2 - Using Polybase for existing tooling (T-SQL, BI tools), performing processing of complex analytical queries &

consistent management experience

Secu

re G

atew

ay &

AD

Int

egra

tion

Data scientist group 1 - using chaingof Hive queries & PowerQuery via

HiveODBC

Full Rack PDW

Polybase Queries

Infiniband

PDW regionHDI region

1 scale unit HDI region

msn.com – Log files

Analyzing ~3 TB Web Traffic

Microsoft servers –Log files

Hive & PowerQueryvia Hive ODBC

Analytical queries via SSDT

APS with PDW & HDI region

System Center & AdminConsole

Microsoft NDA - Material

Solution Architecture (Details) – Internal Microsoft Data Scientist

Microsoft Digital Crime UnitPart of Microsoft LCA (Legal and Corporate Affairs) mandated to help protect

Internet

DCU’s Challenge:

To effectively combat digital crime requires the

collection of huge amounts of data from multiple

sources.

DCU needs to be able to:

• Process 10s of TBs daily and house PBs of data

historically (accessible as needed)

• House 100s of terabytes from multiple sources that

is easily queryable.

• Use leading edge business intelligence and

visualization tools.

Azure

DCU Big Data Solution

S S RS

PowerView Excel with PowerPivot

Embedded BIPredictive Analytics

Hadoop

30 Node Cluster

On Windows

HP Business

Decision

Appliance

SharePoint,

SSRS, SSAS,

PowerView,

PowerPoint

500 TB

SAN

Storage

SQL Azure

MSFT

SQL

Stream

Insight

SSIS

HP EDW

Appliance

MSFT PDWData Sources

Sinkholes, Passive DNS,

Files, 3rd Party Security

Info……….

DCU Investigators

and AnalystsCorporate

Security

Officers

Microsoft Digital Crime Unit

Extract Load

Transform

Extract Load

Transform

Hadoop SSIS PDW SSAS Microsoft BI

DropDrop

Data Source for BIData Source for BI

Data Source for BIData Source for BI

Source for BISource for BI

Microsoft Digital Crime Unit currently being implemented)

– Part of Microsoft LCA (Legal and Corporate Affairs) mandated to help protect the

Internet

– To effectively combat digital crime requires the collection of huge amounts of data

from multiple sources.

• Process 10s of TBs daily and house PBs of data historically (accessible as needed)

• House 100s of terabytes from multiple sources that is easily queryable.

• Use leading edge business intelligence and visualization tools.

– 30 Node Hadoop on Windows Server

– Control Rack and 10 Node PDW Data Rack

– HP BDA (Business Decision Appliance) upgraded to SQL 2012

– BI Voyage currently implementing PDW and BI portions of the project.

Why 2 Storage Platforms?HADOOP Parallel Data Warehouse

• Storage Capacity in the Petabytes • Storage Capacity in the 100s of

Terabytes

• Simplified Load, just drop unstructured

or semi-structured files• ETL process more complex to

transform data in to reporting

optimized DB structures

• No optimization of queries • Structures can be optimized for

common query patterns.

• Queried by IT professionals • Queried by business analysts

• Complex and slow to query multiple

sources at once

• Optimized for fast query against key

data from multiple sources.

Hadoop is DCU’s Centralized Data

Warehouse. Simple load and high

capacity make it optimal for storing huge

volumes of data.

PDW is DCU’s Data Mart platform. Easily

accessible, intuitive data structures, and

blazing fast for querying data.

APS Differentiators• Part of a product family: From SQL server standalone to Cloud

service offerings

• TCO: Very low, especially when looking on the whole bundle: ETL (SSIS), PDW, Data marts (SQL server) and Analytics (SSAS, SSRS)

• Appliance: Much lower effort for DBAs

• Microsoft product stack integration – SSIS, SSAS, SSRS, PowerPivot, System Center, integration with Cloud services

• Linear Scaling via Shared Nothing

• xVelocity: Column Store and In-Memory execution

• Polybase: Integration with Big Data and Hadoop

• HDInsight integrated: fast Infiniband interconnect, management and security

“Microsoft exhibits one of the best value propositions on the market with a low cost and a highly favorable price/performance ratio”- Gartner, February 2012

• Store data in columnar format for massive compression

• Load data into or out of memory for next-generation performance

• Updateable and clustered for real-time trickle loading

48

Up to 100xfaster queries

Updatable clustered columnstore vs. table with customary indexing

Up to 15xmore compression

Columnstore

Parallel query execution

Query

Results

BI Tools

SSRS / SSAS

SQL Server SMP

Concurrency that fuels rapid adoptionGreat performance with mixed workloads

Analytics Platform SystemETL/ELT with SSIS, DQS, MDS

ERP CRM LOB APPS

ETL/ELT with DWLoader

Hadoop / Big Data

PDW

HDInsight

PolyBase

Ad hoc queries

MEC, a global media agency, uses SQL Server PDW with in-memory technology to cut query time—helping marketers unlock the value of their data.

SQL Server Analytics Platform System gives us massively parallel advantages. Whereas it would take up to four hours to run queries scaling across multiple nodes, now it takes just minutes.

Lower energy costs and usage

Reduce tuning efforts while retaining high performance

Simplify management with built in System Center

Reduce the data center footprint

Value through a single flexible appliance solution Why Analytics Platform System when I have SQL Server?

Accelerate time to value and insights with no forklift required for scaling out

Single appliance solution

PDW

HDInsight

PolyBase

Value through a single flexible appliance solution Why Analytics Platform System when I have SQL Server?

Your choice of hardware

Co-engineered with HP, Dell and Quanta best practices

Leading performance with commodity hardware

Pre-configured, built, tuned software and hardware

Integrated support plan with a single Microsoft contactPDW

HDInsight

PolyBase

CROSSMARK needed faster and more detailed insight into terabytes of information about product supply and demand. They deployed a turnkey business intelligence solution from Microsoft and HP that is based on the Microsoft SQL Server Parallel Data Warehouse.

People can instantly create their own reports with SQL Server Power View and PowerPivot for Excel and … they can build those reports 50 percent to many times faster compared with the previous system.