Adf dw walkthrough

29

Transcript of Adf dw walkthrough

… data warehousing has reached the most

significant tipping point since its inception.

The biggest, possibly most elaborate data

management system in IT is changing.

– Gartner, “The State of Data Warehousing in 2012”

Data sources

5

Data sources

Increasing data volumes

1

Real-time data

2

Non-Relational Data

New data sources & types

3

Cloud-born data

4

ETL Tool(SSIS, etc)

EDW(SQL Svr, Teradata, etc)

Extract

Original Data

Load

Transformed Data

Transform

BI Tools

Data Marts

Data Lake(s)

Dashboards

Apps

ETL Tool(SSIS, etc)

EDW(SQL Svr, Teradata, etc)

Extract

Original Data

Load

Transformed Data

Transform

BI Tools

Ingest (EL)

Original Data

Data Marts

Data Lake(s)

Dashboards

Apps

ETL Tool(SSIS, etc)

EDW(SQL Svr, Teradata, etc)

Extract

Original Data

Load

Transformed Data

Transform

BI Tools

Ingest (EL)

Original Data

Scale-out Storage & Compute

(HDFS, Blob Storage, etc)

Transform & Load

Data Marts

Data Lake(s)

Dashboards

Apps

Streaming data

ETL Tool(SSIS, etc)

EDW(SQL Svr, Teradata, etc)

Extract

Original Data

Load

Transformed Data

Transform

BI Tools

Ingest (EL)

Original Data

Scale-out Storage & Compute

(HDFS, Blob Storage, etc)

Transform & Load

Data Marts

Data Lake(s)

Dashboards

Apps

Streaming data

BI Tools

Data Marts

Data Lake(s)

Dashboards

AppsData Hub

(Storage & Compute)

Data Sources(Import From)

Move data among Hubs

Data Hub(Storage & Compute)

Data Sources(Import From)

Ingest

Connect & Collect Transform & Enrich PublishInformation Production:

Ingest

Move to data mart, etc

BI Tools

Data Marts

Data Lake(s)

Dashboards

AppsData Hub

(Storage & Compute)

Data Sources(Import From)

Data Connector:Import from source to Hub

Data Connector: Import/Export among Hubs

Data Hub(Storage & Compute)

Data Sources(Import From)

Data Connector:Import from source to Hub

Data Connector:Export from Hub to data store

Connect & Collect Transform & Enrich PublishInformation Production:

• Coordination & Scheduling • Monitoring & Mgmt• Data Lineage

Example Scenario: Data warehouse sales to Azure pipeline

Raw sales (Custom view on top of DW tables)

Hive processing

Sales by category by day

OrderDate Company CategoryQtyOrdered

Unit Price

Sales Order

6/1/2004Action Bicycle Specialists Accessories 1716 22.0393SO71784

6/1/2004Action Bicycle Specialists Bikes 2288 864.0452SO71784

6/1/2004Action Bicycle Specialists Clothing 2340 26.8155SO71784

6/1/2004Action Bicycle Specialists Components 598 329.8538SO71784

6/1/2004Aerobic Exercise Company Components 338 133.8744SO71915

6/1/2004Action Bicycle Specialists Accessories 910 25.1057SO71938

Data Factory Walkthrough

New-AzureDataFactory-Name “HaloTelemetry“-Location “West-US“

New-AzureDataFactory-Name “DW-Demo2“-Location “West-US“

On Premises SQL Server Azure Blob Storage

New User View

Azure Data Factory

On Premises SQL Server Azure Blob Storage

AdventureWorksLTDW2014

Azure Data FactoryV

iew

Of

New Sales

Aggregated sales

Vie

w O

f

On Premises SQL Server Azure Blob Storage

New User View

Copy “NewSales” to Blob Storage

Cloud New Sales

Azure Data FactoryV

iew

Of

New Sales

New User Activity

Pipeline

On Premises SQL Server Azure Blob Storage

New User View

Copy New Sales to Blob Storage

Cloud New Sales

Azure Data FactoryV

iew

Of

Cloud New SalesAggregate

New Sales

AggregatedSales

HDInsight

Aggregated Sales

Pipeline

Pipeline OnPrem SSIS package

"availability": { "frequency": "Day", interval": 6 }

Hourly

12-6

6-12

12-6

AggregatesSalesActivity: (e.g. Hive):

Dataset2

Dataset3

Hourly

12-1

1-2

2-3

Daily

Monday

Tuesday

Wednesday

Daily

Monday

Tuesday

Wednesday

Hive Activity

Sales From DW

other source

Daily Sales

• Is my data successfully getting produced?

• Is it produced on time?

• Am I alerted quickly of failures?

• What about troubleshooting information?

• Are there any policy warnings or errors?

• Easily move data to my existing data marts for consumption by my existing BI

tools

• Azure DB

• SQL Server on premises

• Oracle

• Files

• Azure Blob content

Coordination:

• Rich scheduling

• Complex dependencies

• Incremental rerun

Authoring:

• JSON & Powershell/C#

Management:

• Lineage

• Data production policies (late data, rerun, latency, etc)

Hub: Azure Hub (HDInsight + Blob storage)

• Activities: Hive, Pig, C#

• Data Connectors: Blobs, Tables, Azure DB, On Prem SQL Server, Oracle

• Contact me: [email protected]

www.microsoft.com/learning

http://microsoft.com/technet

http://channel9.msdn.com/Events/TechEd

http://developer.microsoft.com