Modern Data Warehousing

Modern Data Warehousing

Insights on Any Data of Any Size

James Serra, MicrosoftPDW Technology Solution [email protected]

About Me Business Intelligence Consultant, in IT for 28 years Microsoft, PDW Technology Solution Professional (TSP) Owner of Serra Consulting Services, specializing in end-to-

end Business Intelligence and Data Warehouse solutions using the Microsoft BI stack

Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW developer

Been perm, contractor, consultant, business owner Presenter at PASS Business Analytics Conference and PASS

Summit MCSE for SQL Server 2012: Data Platform and BI SME for SQL Server 2012 certs Contributing writer for SQL Server Pro magazine Blog at JamesSerra.com SQL Server MVP Author of book “Reporting with Microsoft SQL Server 2012”

Agenda• Traditional data warehouse & modern data warehouse• APS architecture• Hadoop & PolyBase• Performance and scale• Appliance benefits• Summarize/questions

The traditional data warehouse

4

… data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing.

– Gartner, “The State of Data Warehousing in 2012”

Data sources

OLTP ERP CRM LOB

ETL

Data warehouse

BI and analytics

Will your current solution handle future needs?

The traditional data warehouse

5

Data sources

OLTP ERP CRM LOB

ETL

Data warehouse

BI and analytics

Increasing data volumes

1

Real-time performance

2

Non-Relational Data

Devices

Web Sensors

Social

New data sources & types

3

Cloud-born data

4

INFRASTRUCTURE

DATA MANAGEMENT & PROCESSING

DATA ENRICHMENT AND FEDERATED QUERY

BI & ANALYTICS

Self-service CollaborationCorporate PredictiveMobile

Extract, transform, load

Single query model Data quality Master data

management

Non-relationalRelational Analytical Streaming Internal & External

Data sources

OLTP ERP CRM LOB

Non-relational data

Devices

Web Sensors

Social

The modern data warehouse

Keep legacy investment

Buy new tier one hardware appliance

Acquire big data solution (Hadoop)

Acquire business intelligence solution

Roadblocks to evolving to a modern data warehouse

Limited

scalability & ability

to handle new data

types

Significant

training & still

siloed

High acquisition/

migration

costs & no

Hadoop

Complex with

low adoption

Solution and issue with that solution

Introducing the Microsoft Analytics Platform SystemYour turnkey modern data warehouse appliance

Next-generation performance at scale

Enterprise-ready big data

Engineered foroptimal value

• Relational and non-relational data in a single appliance

• Enterprise-ready Hadoop

• Integrated querying across Hadoop and APS using T-SQL

• Direct integration with Microsoft BI tools such as Power BI

• Near real-time performance with In-Memory

• Scale-out to accommodate your growing data

• Remove DW bottlenecks with MPP SQL Server

• Concurrency that fuels rapid adoption

• Industry’s lowest DW price/TB

• Value through a single appliance solution

• Value with flexible hardware options using commodity hardware

• Free up space on SAN

Hardware and software engineered togetherThe ease of an appliance

Co-engineered with HP, Dell, and Quanta best practices

Leading performance with commodity hardware

Pre-configured, built, and tuned software and hardware

Integrated support plan with a single Microsoft contact

PDW

HDInsight

PolyBase

Social and web analytics

Live data feeds

Advanced analytics

APS Architecture

Microsoft Analytics Platform System (APS), formally called by its code name “Project Madison”, was released in December 2010 (version 1). PDW is Microsoft’s reworking of the DatAllegro Inc. massive parallel processing (MPP) product started in 2003 and that Microsoft acquired in September 2008. Version 2 of PDW was made available in March, 2013. It was renamed from SQL Server Parallel Data Warehouse (PDW) to Analytics Platform System (APS) in April 2014 (it still includes the PDW region as well as a new HDInsights/Hadoop region).

Polybase was introduced with version 2 of PDW and has new features in PDW v2 AU1 (April 2014).

Case studies: http://www.microsoft.com/casestudies/Case_Study_Search_Results.aspx?Type=1&Keywords=%22Parallel%20Data%20Warehouse%22&LangID=46

http://www.microsoft.com/casestudies/Case_Study_Search_Results.aspx?Type=1&Keywords=%22Parallel%20Data%20Warehouse%22&LangID=46

http://www.microsoft.com/casestudies/Case_Study_Search_Results.aspx?Type=1&Keywords=%22Parallel%20Data%20Warehouse%22&LangID=46

APS Logical Architecture (overview)

“Compute” nodeBalanced storage

SQL


SQL


SQL


SQL

DMS

DMS

DMS

DMS

Compute Node – the “worker bee” of APS• Runs SQL Server 2012 APS• Contains a “slice” of each database

Control Node – the “brains” of the APS• Also runs SQL Server 2012 APS• Holds a “shell” copy of each database

• Metadata, statistics, etc• The “public face” of the appliance

Data Movement Services (DMS)• Part of the “secret sauce” of APS• Moves data around as needed• Enables parallel operations among the

compute nodes (queries, loads, etc)

“Control” nodeSQL

DMS

APS Logical Architecture (querying)


SQL“Control” nodeSQL


SQL


SQL


SQL

DMS

DMS

DMS

DMS

DMS

1) User connects to the appliance (control node) and submits query

2) Control node query processor determines best *parallel* query plan

3) APS distributes sub-queries to each compute node

4) Each compute node executes query on its subset of data

5) Each compute node returns a subset of the response to the control node

6) If necessary, control node does any final aggregation/computation

7) Control node returns results to user

APS Data Layout Options“Compute” node Balanced

storageSQL

Balanced storage

Balanced storage

Balanced storage

“Compute” nodeSQL



DMS

DMS

DMS

DMS

Time DimDate Dim IDCalendar YearCalendar QtrCalendar MoCalendar Day

Store Dim

Store Dim IDStore NameStore MgrStore Size

Product Dim

Prod Dim IDProd CategoryProd Sub CatProd Desc

Customer Dim

Cust Dim IDCust NameCust AddrCust PhoneCust Email

Sales Fact

Date Dim IDStore Dim IDProd Dim IDCust Dim IDQty SoldDollars Sold

TD

PD

SD

CD

TD

PD

SD

CD

TD

PD

SD

CD

TD

PD

SD

CD

Sale

s Fa

ct

Replicated

Table copied to each compute node

Distributed

Table spread across compute nodes based on “hash”

Star Schema

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

FactSales_A

FactSales_B

FactSales_C

FactSales_D

FactSales_E

FactSales_F

FactSales_G

FactSales_H

DATA DISTRIBUTION CREATE TABLE FactSales

(ProductKey INT NOT NULL ,OrderDateKey INT NOT NULL ,DueDateKey INT NOT NULL ,ShipDateKey INT NOT NULL ,ResellerKey INT NOT NULL ,EmployeeKey INT NOT NULL ,PromotionKey INT NOT NULL ,CurrencyKey INT NOT NULL ,SalesTerritoryKey INT NOT NULL ,SalesOrderNumber VARCHAR(20) NOT NULL,

) WITH (

DISTRIBUTION = HASH(ProductKey),

CLUSTERED INDEX(OrderDateKey) ,

PARTITION(OrderDateKey RANGE RIGHT FOR VALUES

( 20010601, 20010901,

) ) );

Control Node

…Compute Node 1

Compute Node 2

Compute Node X

Send Create Table SQL to each compute nodeCreate Table FactSales_ACreate Table FactSales_BCreate Table FactSales_C……Create Table FactSales_H

FactSalesA

FactSalesB

FactSalesC

FactSalesD

FactSalesE

FactSalesF

FactSalesG

FactSalesH

FactSalesA

FactSalesB

FactSalesC

FactSalesD

FactSalesE

FactSalesF

FactSalesG

FactSalesH

FactSalesA

FactSale B

FactSalesC

FactSalesD

FactSalesE

FactSalesF

FactSalesG

FactSalesH

Create table metadata on Control Node

APS – Balanced across servers and within

15

Largest Table 600,000,000,000

Randomly distributed across 40 compute nodes (5 racks)

15,000,000,000

In each server randomly distributed to 8 tables 1,875,000,000

Each partition – 2 years data partitioned by week 18,028,846

As an end user or DBA you think about 1 table: LineItem.You run “select * from LineItem”

APS is an appliance, simple to use!You don’t care or need to know that there are actually 320 tables representing your 1 logical table.

InfinibandInfinibandEthernetEthernet

Control NodeFailover Node

Microsoft Storage Spaces 1

Compute Node 1Compute Node 2







CustomerUse

Base Unit (6U):• Redundant Infiniband• Redundant Ethernet• Mgmt & Control (Active)• Rack Failover Node (Passive)

Base Unit (7U):• 2 HP 1U Servers

• (16 Cores/Ea. Total: 32)• Microsoft Storage Spaces 5U

• 1TB Drives• User Data Capacity: 75TB

Scale Unit (7U):• 2 HP 1U Servers



¼ R

ack

15

TB

(R

aw

)

1/2

Rack

30

TB

(Raw

)

Customer Space (8U)• ETL Servers• Backup Servers• Passive Unit (Additional spares)







Full R

ack

60

TB

(Raw

)


Failover Node









CustomerUse

Extension Base Unit (5U):• Redundant Infiniband• Redundant Ethernet• Rack Failover Node (Passive)

Extension Base Unit (7U):• 2 HP 1U Servers






1¼

R

ack

75

.5TB

(R

aw

)








3 R

ack

18

1.2

TB

(Un

com

pre

ssed)

1 1

/2 R

ack

90

.6TB

(Raw

)2 R

ack

12

0.8

TB

(Raw

)


Failover Node









CustomerUse

Extension Base Unit (5U):• Redundant Infiniband• Redundant Ethernet• Rack Failover Node (Passive)

Extension Base Unit (7U):• 2 HP 1U Servers













HP Configuration

• 2 – 56 compute nodes (32-896 cores)

• 1 – 7 racks

• 1, 2, or 3 TB drives

• 15TB – 1.2PB uncompressed

• 75TB – 6PB User data (5:1)

• Up to 7 spare nodes available across the entire appliance

• Dual Infiband: 56Gbps

Details




Microsoft Analytics Platform SystemYour turnkey modern data warehouse appliance

Megabytes

What is big data and why is it valuable to the business A evolution in the nature and use of data in the enterprise

Data complexity: variety and velocity

Peta

byte

s/Volu

me

Historical analysis

Insight analysis

Predictive analytics

Predictive forecasting

Valu

e t

o t

he b

usi

ness

What is Hadoop?

Microsoft Confidential

19

Distributed, scalable system on commodity HW

Composed of a few parts:

HDFS – Distributed file system

MapReduce – Programming model

Others: HBase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper

Core Services

OPERATIONAL SERVICES

DATASERVICES

HDFS

SQOOP

FLUME

NFS

LOAD & EXTRACT

WebHDFS

OOZIE

AMBARI

YARN

MAP REDUCE

HIVE &HCATALOGPIG

HBASEFALCON

Hadoop Cluster

compute&

storage . . .

. . .

. .compute

&storage

.

.

Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware

Move HDFS into the warehouse before analysis

HDFS (Hadoop)

ETL

WarehouseHDFS (Hadoop)

Learn new skills

TSQL

Build Integrate ManageMaintainSupport

Complex query and analysis with big data todaySteep learning curve, slow and inefficient

Hadoop ecosystem

“New” data sources

Devices

Web Sensor Social

“New” data sources“New” data sources

Devices

Web Sensor Social

APS delivers enterprise-ready Hadoop with HDInsightManageable, secured and highly available Hadoop integrated into the appliance

High performance tuned within the appliance

End-user authentication with Active Directory

Accessible insights for everyone with Microsoft BI tools

Managed and monitored using System Center

100% Apache Hadoop

SQL ServerParallel DataWarehouse

Microsoft HDInsight

PolyBase

Leverage your existing TSQL skills

Parallel Data Warehouse workload

HDInsight workload

Fabric

Hardware

Applia

nce

A region is a logical container within an appliance

Each workload contains the following boundaries:• Security

• Metering

• Servicing

APS appliance overview

Select… Result set Provides a single T-SQL query model (“semantic layer”) for APS and Hadoop with rich features of T-SQL, including joins without ETL

Uses the power of MPP to enhance query execution performance

Supports Windows Azure HDInsight to enable new hybrid cloud scenarios

Provides the ability to query non-Microsoft Hadoop distributions, such as Hortonworks and Cloudera

Use existing SQL skillset, no IT intervention

Query Hadoop data with T-SQL using PolyBaseBringing the worlds or big data and the data warehouse together for users and IT

SQL ServerParallel DataWarehouse

Cloudera CHD Linux 4.3

Hortonworks HDP 2.0 (Windows, Linux)

Windows AzureHDInsight

PolyBase

Microsoft HDInsightHDP 1.3

Query re la t i ona l + non re la t i ona l

Others? Federated querying

AU1: Windows Azure storage blob (WASB)

Use cases where PolyBase simplifies using Hadoop dataBringing islands of Hadoop data together

High performance queries against Hadoop data(Predicate pushdown)

Archiving data warehouse data to Hadoop (move)(Hadoop as cold storage)

Exporting relational data to Hadoop (copy)(Hadoop as backup/DR, analysis,

cloud use)Importing Hadoop data into data warehouse (copy)

(Hadoop as staging area)

Big data insights for anyoneNative Microsoft BI integration to create new insights with familiar tools

Tools like Power BI minimize ITintervention for discovering dataT-SQL for DBA and power users to join relational and Hadoop data

Hadoop tools like map-reduce, Hive and Pig for data scientists

Leverages high adoptionof Excel, Power View, Power Pivot, and SSAS

Power Users

Data Scientist

Everyone else using Microsoft BI tools

Performance limitations and scale with a traditional data warehouse

Diminishing scale as requirements grow

Scale up Rowstore

Sub-optimal performance for many data warehouse queries

Data

Page 1 Page 2 Page 3

Querying data by row

C1 C2 C3 C4

R1 R1 R1 R1

R2 R2 R2 R2

R3 R3 R3 R3

R4 R4 R4 R4

R5 R5 R5 R5

R6 R6 R6 R6

Forklift

Forklift

Scale-out Massively Parallel Processing (MPP) parallelizes queries (speed-driven not just capacity-driven)

Multiple nodes with dedicated CPU, memory, storage “shared-nothing”

Incrementally add HW for near-linear scale to multi-PB (no need to delete older data, stage)

Handles query complexity and concurrency at scale

No “forklift” of prior warehouse to increase capacity

Start small with a few terabyte warehouse

Query while you load (250GB/hour per node)

Scaling out relational data to petabytesScale-out technologies in the Analytics Platform System

28

PDW

0TB 6PB

PDW/ HDInsight

PDW/ HDInsight

PDW/ HDInsight

PDW/ HDInsight

PDW/ HDInsight

PDW/ HDInsight

Blazing fast performanceMPP and In-memory columnstore for next-generation performance

• Store data in columnar format for massive compression

• Load data into or out of memory for next-generation performance

• Updateable and clustered for real-time trickle loading

• No secondary indexes required

29

Up to 100x faster queries

Updatable clustered columnstore vs. table with customary indexing

Up to 15xmore compression

Columnstore index representation

C1

C3

C5

C4

C2

C6

Parallel query execution

Query

Results

PDW MPP vs. SQL Server SMP

• 2B row fact sales table

• Nine different queries including

• simple counts• Sum/min/max with

group-bys• Multiple inner joins

with 3-5 dimension tables

• Multiple sub-queries across the big fact table

BI Tools

Reporting and cubes

SQL Server SMP

Concurrency that fuels rapid adoptionGreat performance with mixed workloads

Analytics Platform System

ETL/ELT with SSIS, DQS, MDS

ERP CRM LOB APPS

ETL/ELT with DWLoader

Hadoop / Big Data

PDW

HDInsight

PolyBase

Ad hoc queries

Intra-Day

Near real-time

Fast ad hoc

Columnstore

Polybase

CRTAS

Link Table

Real-Time

ROLAP / MOLAP DirectQuery

SNAC

APS provides the industry’s lowest DW appliance price/TBReshaped hardware specs through software innovation Price per terabyte for leading vendors Significantly lower

price per TB than the closest competitor

Price per TB User-Available Storage (Compressed)

NOTE: Orange line indicates average price per TB.

Thou

sands

Oracle EMC IBM Teradata Microsoft

$30

$25

$20

$15

$10

$5

$0

Lower storage costs with Windows Server 2012 Storage Spaces

Small cost gap between multiple clustered HP DL980's with SAN vs APS 1/4 rack

Virtualized architecture overview

Host 2

Host 1

Host 3

Host 4

Economical disk

storageIB andEthernet

Direct attached SAS

Base UnitCTL

MAD

AD

VMM

Compute 2

Compute 1

• APS engine• DMS Manager• SQL Server 2012 Enterprise Edition (APS build)

Software details

• All hosts run Windows Server 2012 Standard and Windows Azure Virtual Machines

• Fabric or workload in Hyper-V Virtual Machines

• Fabric virtual machine, management server (MAD01), and control server (CTL) share one server

• APS agent that runs on all hosts and all virtual machines

• DWConfig and Admin Console

• Windows Storage Spaces and Azure Storage blobs

• Does not require expertise in Hyper-V or Windows

APS High-Availability

X XCompute Host 1

Compute Host 2

XControl Host

Failover Host

Infin

iban

d 1

Ethe

rnet

1

Infin

iban

d 2

Ethe

rnet

2

XXXFAB AD VMM MAD CTL

Compute 2 VM

Compute 1 VMCompute 1 VMIn

finib

and

1

Ethe

rnet

1

• No Single Point-Of-Failure• No need for SQL Server

Clustering

Less DBA Maintenance/Monitoring• No index creation• No deleting/archiving data to save space• Management simplicity (System Center)• No blocking• No logs• No query hints• No wait states• No IO tuning• No query optimization/tuning• No index reorgs/rebuilds• No managing filegroups• No shrinking/expanding databases• No managing physical servers• No patching servers and software

RESULT: DBA spend more of their time as architects and not baby sitters!

The no-compromise modern data warehouse solution Microsoft’s turn-key modern data warehouse appliance Analytics Platform System

Microsoft

• Improved query performance

• Faster data loading• Improved concurrency• Less DBA maintenance• Limited training needed• Use familiar BI tools• Ease of appliance

deployment• Mixed workload

support

• Improved data compression

• Scalability• High availability• PolyBase• Integration with cloud-

born data• HDInsight/Hadoop

integration• Data warehouse

consolidation• Easy support model

Summary of Benefits

Bold = benefits of APS over upgrading to SQL Server 2014

© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Questions?

James Serra, MicrosoftPDW Technology Solution ProfessionalEmail me at: [email protected] me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com

Blog about PDW topics: http://www.jamesserra.com/archive/category/pdw/

Modern Data Warehousing

Data & Analytics

Transcript of Modern Data Warehousing