Modern Data Warehousing

download Modern Data Warehousing

of 38

Embed Size (px)

description

The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data. How can you prevent this from happening? Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries. Is there one appliance that can support this modern data warehouse? Yes! It is the Parallel Data Warehouse (PDW) from Microsoft, which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1). In this session I will dig into the details of the modern data warehouse and PDW. I will give an overview of the PDW hardware and software architecture, identify what makes PDW different, and demonstrate the increased performance. In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.

Transcript of Modern Data Warehousing

  • Modern Data Warehousing Insights on Any Data of Any Size James Serra, Microsoft PDW Technology Solution Professional JamesSerra3@gmail.com JamesSerra.com
  • About Me Business Intelligence Consultant, in IT for 28 years Microsoft, PDW Technology Solution Professional (TSP) Owner of Serra Consulting Services, specializing in end-to-end Business Intelligence and Data Warehouse solutions using the Microsoft BI stack Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW developer Been perm, contractor, consultant, business owner Presenter at PASS Business Analytics Conference and PASS Summit MCSE for SQL Server 2012: Data Platform and BI SME for SQL Server 2012 certs Contributing writer for SQL Server Pro magazine Blog at JamesSerra.com SQL Server MVP Author of book Reporting with Microsoft SQL Server 2012
  • Agenda Traditional data warehouse & modern data warehouse APS architecture Hadoop & PolyBase Performance and scale Appliance benefits Summarize/questions
  • 4 Data sources Will your current solution handle future needs?
  • 5 Data sourcesNon-Relational Data
  • Data sources Non-relational data
  • Keep legacy investment Buy new tier one hardware appliance Acquire big data solution (Hadoop) Acquire business intelligence solution Roadblocks to evolving to a modern data warehouse Limited scalability & ability to handle new data Significant training & still siloed High acquisition/ migration costs & no Hadoop Complex with low adoption Solution and issue with that solution
  • Introducing the Microsoft Analytics Platform System Your turnkey modern data warehouse appliance Relational and non-relational data in a single appliance Enterprise-ready Hadoop Integrated querying across Hadoop and APS using T-SQL Direct integration with Microsoft BI tools such as Power BI Near real-time performance with In-Memory Scale-out to accommodate your growing data Remove DW bottlenecks with MPP SQL Server Concurrency that fuels rapid adoption Industrys lowest DW price/TB Value through a single appliance solution Value with flexible hardware options using commodity hardware Free up space on SAN
  • Hardware and software engineered together The ease of an appliance Co-engineered with HP, Dell, and Quanta best practices Leading performance with commodity hardware Pre-configured, built, and tuned software and hardware Integrated support plan with a single Microsoft contactPDW HDInsight PolyBase
  • APS Architecture Microsoft Analytics Platform System (APS), formally called by its code name Project Madison, was released in December 2010 (version 1). PDW is Microsofts reworking of the DatAllegro Inc. massive parallel processing (MPP) product started in 2003 and that Microsoft acquired in September 2008. Version 2 of PDW was made available in March, 2013. It was renamed from SQL Server Parallel Data Warehouse (PDW) to Analytics Platform System (APS) in April 2014 (it still includes the PDW region as well as a new HDInsights/Hadoop region). Polybase was introduced with version 2 of PDW and has new features in PDW v2 AU1 (April 2014). Case studies: http://www.microsoft.com/casestudies/Case_Study_Search_Results.aspx?Type=1&Keywords=%22Parallel%20 Data%20Warehouse%22&LangID=46
  • APS Logical Architecture (overview) Compute node Balanced storage SQL Compute node Balanced storage SQL Compute node Balanced storage SQL Compute node Balanced storage SQL DMS DMS DMS DMS Compute Node the worker bee of APS Runs SQL Server 2012 APS Contains a slice of each database Control Node the brains of the APS Also runs SQL Server 2012 APS Holds a shell copy of each database Metadata, statistics, etc The public face of the appliance Data Movement Services (DMS) Part of the secret sauce of APS Moves data around as needed Enables parallel operations among the compute nodes (queries, loads, etc) Control node SQL DMS
  • APS Logical Architecture (querying) Compute node Balanced storage SQLControl node SQL Compute node Balanced storage SQL Compute node Balanced storage SQL Compute node Balanced storage SQL DMS DMS DMS DMS DMS 1) User connects to the appliance (control node) and submits query 2) Control node query processor determines best *parallel* query plan 3) APS distributes sub-queries to each compute node 4) Each compute node executes query on its subset of data 5) Each compute node returns a subset of the response to the control node 6) If necessary, control node does any final aggregation/computation 7) Control node returns results to user
  • APS Data Layout Options Compute node Balanced storage SQL Balanced storage Balanced storage Balanced storage Compute node SQL Compute node SQL Compute node SQL DMS DMS DMS DMS Time Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Store Dim Store Dim ID Store Name Store Mgr Store Size Product Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc Customer Dim Cust Dim ID Cust Name Cust Addr Cust Phone Cust Email Sales Fact Date Dim ID Store Dim ID Prod Dim ID Cust Dim ID Qty Sold Dollars Sold T D P D S D C D T D P D S D C D T D P D S D C D T D P D S D C D SalesFact Replicated Table copied to each compute node Distributed Table spread across compute nodes based on hash Star Schema
  • FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H DATA DISTRIBUTION CREATE TABLE FactSales ( ProductKey INT NOT NULL , OrderDateKey INT NOT NULL , DueDateKey INT NOT NULL , ShipDateKey INT NOT NULL , ResellerKey INT NOT NULL , EmployeeKey INT NOT NULL , PromotionKey INT NOT NULL , CurrencyKey INT NOT NULL , SalesTerritoryKey INT NOT NULL , SalesOrderNumber VARCHAR(20) NOT NULL, ) WITH ( DISTRIBUTION = HASH(ProductKey), CLUSTERED INDEX(OrderDateKey) , PARTITION (OrderDateKey RANGE RIGHT FOR VALUES ( 20010601, 20010901, ) ) ); Control Node Compute Node 1 Compute Node 2 Compute Node X Send Create Table SQL to each compute node Create Table FactSales_A Create Table FactSales_B Create Table FactSales_C Create Table FactSales_H FactSales A FactSales B FactSales C FactSales D FactSales E FactSales F FactSales G FactSales H FactSales A FactSales B FactSales C FactSales D FactSales E FactSales F FactSales G FactSales H FactSales A FactSale B FactSales C FactSales D FactSales E FactSales F FactSales G FactSales H Create table metadata on Control Node
  • APS Balanced across servers and within 15 Largest Table 600,000,000,000 Randomly distributed across 40 compute nodes (5 racks) 15,000,000,000 In each server randomly distributed to 8 tables 1,875,000,000 Each partition 2 years data partitioned by week 18,028,846 As an end user or DBA you think about 1 table: LineItem. You run select * from LineItem APS is an appliance, simple to use! You dont care or need to know that there are actually 320 tables representing your 1 logical table.
  • Rack 15TB(Raw) 1/2Rack 30TB(Raw) FullRack 60TB(Raw) 1Rack 75.5TB (Raw) 3Rack 181.2TB(Uncompressed) 11/2Rack 90.6TB(Raw) 2Rack 120.8TB(Raw) 2 56 compute nodes (32- 896 cores) 1 7 racks 1, 2, or 3 TB drives 15TB 1.2PB uncompressed 75TB 6PB User data (5:1) Up to 7 spare nodes available across the entire