Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data...

34
1 Copyright © 2019 R20/Consultancy B.V., The Netherlands. All rights reserved. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photographic, or otherwise, without the explicit written permission of the copyright owners. Ten Guidelines for a Modern Data Architecture Rick F. van der Lans Industry analyst Email [email protected] Twitter @rick_vanderlans www.r20.nl Copyright © 2019 R20/Consultancy B.V., The Netherlands. All rights reserved. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photographic, or otherwise, without the explicit written permission of the copyright owners. Ten Guidelines for a Modern Data Architecture Rick F. van der Lans Industry analyst Email [email protected] Twitter @rick_vanderlans www.r20.nl

Transcript of Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data...

Page 1: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

1

Copyright © 2019 R20/Consultancy B.V., The Netherlands. All rights reserved. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, 

electronic, mechanical, photographic, or otherwise, without the explicit written permission of the copyright owners.

Ten Guidelines for a Modern Data Architecture

Rick F. van der LansIndustry analystEmail [email protected] Twitter @rick_vanderlanswww.r20.nl

Copyright © 2019 R20/Consultancy B.V., The Netherlands. All rights reserved. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, 

electronic, mechanical, photographic, or otherwise, without the explicit written permission of the copyright owners.

Ten Guidelines for a Modern Data Architecture

Rick F. van der LansIndustry analystEmail [email protected] Twitter @rick_vanderlanswww.r20.nl

Page 2: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

2

Copyright © 2019 R20/Consultancy B.V., The Netherlands 3

Rick F. van der Lans

Rick F. van der Lans is a highly‐respected independent analyst, consultant, author, and internationally acclaimed lecturer specializing in data warehousing, business intelligence, big data, and database technology. He is managing director of R20/Consultancy BV. 

He has presented countless seminars, webinars, and keynotes at industry‐leading conferences. Rick helps clients worldwide to design their data warehouse, big data, and business intelligence architectures and solutions and assists them with selecting theright products. He has been influential in introducing the new logical data warehouse architecture worldwide which helps organizations to develop more agile business intelligence systems.

He is the author of several books on computing, including his  new Data Virtualization: Selected Writings and Data Virtualization for Business Intelligence Systems. Some of these books are available in different languages. Books such as the popular Introduction to SQL is available in English, Dutch, Italian, Chinese, and German and is sold world wide. He also authored numerous whitepapers for vendors.

In 2018 he was selected the sixth most influential BI analyst worldwide by onalytica.com.

Ambassador of Axians Business Analytics Laren (formerly Kadenza): This  consultancy company specializes in business intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick works closely together with the consultants in many projects. Their joint experiences and insights are shared in seminars, webinars, blogs, and whitepapers.

You can get in touch with Rick van der Lans via: Email: [email protected]: www.r20.nlTwitter:  @Rick_vanderlansLinkedIn:  http://www.linkedin.com/pub/rick‐van‐der‐lans/9/207/223 

Copyright © 2019 R20/Consultancy B.V., The Netherlands 4

Introduction to Data Architectures

Page 3: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

3

Copyright © 2019 R20/Consultancy B.V., The Netherlands 5

Copyright © 2019 R20/Consultancy B.V., The Netherlands 6

Page 4: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

4

Copyright © 2019 R20/Consultancy B.V., The Netherlands 7

What is a Data Architecture?

Wikipedia: A data architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations.

Examples of data architectures:• Data warehouse architecture

• Data streaming architecture

• Transactional system

Copyright © 2019 R20/Consultancy B.V., The Netherlands 8

Data Architects versus Solutions Architects

Data Architects Solutions Architects

focus on how information moves across the system from one application to another

look at the overall technological environment of the company

collaborate with clients to determine the specifications of the project, as well as the business goals that will align with the collected data

meet with their clients and establish their specific technology needs based on their business objectives

design the data model for the organization; where to store the customer data, how to retrieve the data; who can read the data

has a more technical point of view. Do we select a cloud solution, or on premise? What will the network look like? How will everything be connected without failures?

Page 5: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

5

Copyright © 2019 R20/Consultancy B.V., The Netherlands 9

The Birth of a New Data Architecture (1)

Copyright © 2019 R20/Consultancy B.V., The Netherlands 10

The Birth of a New Data Architecture (2)

Page 6: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

6

Copyright © 2019 R20/Consultancy B.V., The Netherlands 11

There is no 

“The best data architecture”!

Copyright © 2019 R20/Consultancy B.V., The Netherlands 12

Ten Guidelines for Cloud Data Architectures

1. Use technology designed for the cloud

2. Stay cloud platform independent

3. Centralize transformation specifications

4. Centralize technical and business metadata

5. Implement abstraction

6. Use cases must match

7. Store all data

8. Choose productivity over performance

9. Cloud is networks

10.Apply a holistic design approach

Page 7: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

7

Copyright © 2019 R20/Consultancy B.V., The Netherlands 13

1. Use Technology Designed for the Cloud 

Copyright © 2019 R20/Consultancy B.V., The Netherlands 14

Page 8: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

8

Copyright © 2019 R20/Consultancy B.V., The Netherlands 15

Scale Up versus Scale Out

Scale up (vertical scaling) means adding more resources to one node in a system

Scale out (horizontal scaling) means adding more nodes to a system• Continuous availability/redundancy

• Cost/performance flexibility

• Contiguous upgrades

• Geographical distribution

scale out

scaleup

Copyright © 2019 R20/Consultancy B.V., The Netherlands 16

Effect of Partitions on Query Response

number of partitions/processors

total through

put

bottleneck

Page 9: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

9

Copyright © 2019 R20/Consultancy B.V., The Netherlands 17

Market of Database ServersClassic SQL

Analytical SQL

NewSQL

Graph

Key‐value stores

Document stores

Column‐family stores

Streaming SQL

Search technology

Data virtualization

HDFS + MapReduce

HDFS + Spark

SQL‐on‐Hadoop Transactions

SQL‐on‐Hadoop Query

Queryoriented

Transactionoriented

Queryoriented

Transactionoriented

Generalpurpose

Transactionoriented

Queryoriented

Generalpurpose

SQLdatabaseservers

NoSQLdatabaseservers

Hadoop and Spark

Alldatabaseservers

Translytical SQL

Object Storage

Cube/multi‐dimensional

File system

Copyright © 2019 R20/Consultancy B.V., The Netherlands 18

Application‐based Analytics

Databaseserver

Application

operations

Page 10: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

10

Copyright © 2019 R20/Consultancy B.V., The Netherlands 19

Database‐Based Analytics

Databaseserver

Application

Copyright © 2019 R20/Consultancy B.V., The Netherlands 20

Partial Parallel Analytics

Databaseserver Master

Worker 1 Worker 2 Worker 3

Application

Page 11: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

11

Copyright © 2019 R20/Consultancy B.V., The Netherlands 21

Full Parallel Analytics

Databaseserver Master

Worker 1 Worker 2 Worker 3

Application

Copyright © 2019 R20/Consultancy B.V., The Netherlands 22

Optimizing Distributed Joins

Customer table partitions

Order table partitions

Product table partitions

Page 12: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

12

Copyright © 2019 R20/Consultancy B.V., The Netherlands 23

2. Stay Cloud Platform Independent

Copyright © 2019 R20/Consultancy B.V., The Netherlands 24

Cloud Platform are Becoming the New Mainframes

Page 13: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

13

Copyright © 2019 R20/Consultancy B.V., The Netherlands 25

Mainframe = Lock In

Proprietary operating systems

Proprietary system management software

Proprietary database servers

Proprietary security systems

Proprietary development environments

Proprietary JCLs

Proprietary …

Copyright © 2019 R20/Consultancy B.V., The Netherlands 26

Cloud Platform = Lock In?

Proprietary operating systems

Proprietary management software

Proprietary database servers• E.g. Amazon: RDS, RedShift (SQL), S3, …

Proprietary security systems

Proprietary development environments• E.g. Microsoft Azure: Reporting Services,

Analytics services, Data Management Services, …

Proprietary …

Page 14: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

14

Copyright © 2019 R20/Consultancy B.V., The Netherlands 27

Stay Cloud Platform Independent

=Design to Migrate

Copyright © 2019 R20/Consultancy B.V., The Netherlands 28

3. Centralize Transformation Specifications

Page 15: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

15

Copyright © 2019 R20/Consultancy B.V., The Netherlands 29

Specifications

Sourcesystems Analytics & reporting

Data From Sources to Dashboards

Data structure specifications

Integration specifications

Transformation specifications

Data security specifications

Data cleansing specifications

Analytical specifications

Visualization specifications

Data privacy specifications

Copyright © 2019 R20/Consultancy B.V., The Netherlands 30

Example: The Classic Data Warehouse

ETL ETLETL

Sourcesystems

Data martsStagingarea

Analytics &reporting

Datawarehouse

Page 16: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

16

Copyright © 2019 R20/Consultancy B.V., The Netherlands 31

Example: Implementing the Specifications

Data

Virtualization

Sourcesystems

Analytics &reporting

ETL

Data Warehouse

Full history

Permanent surrogate keys

No cleansing

No integration

No deletes

Data warehouse becomes “data lakish”

Copyright © 2019 R20/Consultancy B.V., The Netherlands 32

4. Centralize Technical and Business Metadata

Page 17: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

17

Copyright © 2019 R20/Consultancy B.V., The Netherlands 33

Metadata Today

Metadata dispersed across many systems• In database servers

• In integration tools

• In documentation

• In reporting tools

• In spreadsheets

Most is technical and not business metadata

Not integrated – no clear relationships between metadata elements

Copyright © 2019 R20/Consultancy B.V., The Netherlands 34

Solutions for Centralizing Metadata

Types of solutions• Home made metadata integration system

• Professional metadata tool; Collibra, Tibco EBX, …

• Scraping and linking; ASG, …

• Data warehouse automation: Astera, Attunity, TimeXtender, WhereScape, …

Design considerations• Integration with master data and reference data

• Searchable metadata for business users and ICT professionals

• Programmable interface

• It’s all about discipline

Page 18: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

18

Copyright © 2019 R20/Consultancy B.V., The Netherlands 35

ASG Enterprise Data Management – Data Lineage

Copyright © 2019 R20/Consultancy B.V., The Netherlands 36

Automate Repetitive Tasks

Page 19: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

19

Copyright © 2019 R20/Consultancy B.V., The Netherlands 37

ETL ETLETL

Sourcesystems

Data martsStagingarea

Analytics &reporting

Datawarehouse

Generating Specifications

RepositoryGenerator

Copyright © 2019 R20/Consultancy B.V., The Netherlands 38

Why Generating Specifications?

Consistent specifications

Consistent reporting results

Improved productivity and maintainability• Avoids reinventing the wheel

Improved time-to-market for new reports

Improved governance and auditability

Easier to adopt new technologies:• SQL to SQL, SQL to NoSQL, SQL to Hadoop, ETL to

ELT, …

Easier data model migration• Normalized, star, snowflake, datavault, SuperNova

Page 20: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

20

Copyright © 2019 R20/Consultancy B.V., The Netherlands 39

5. Implement Abstraction

Copyright © 2019 R20/Consultancy B.V., The Netherlands 40

Data Virtualization Overview

streamingdatabases

socialmedia data

productionapplication website

analytics& reporting

mobileApp

internalportal dashboard

privatedata

ODBC/SQL JDBC/SQL XML/SOAP REST/JSON XQuery MDX/DAX

JMS SQL SQL+ XSLT Hive Prop. Excel JSONCICS SOAP

applications

SQL statement

JMS message SQL statement SOAP messageData Virtualization Server

unstructureddataSQL

databasesHadoop,NoSQLdatabaseESB

messaging

legacydatabase

cloudapplications

Page 21: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

21

Copyright © 2019 R20/Consultancy B.V., The Netherlands 41

Data Virtualization Server

Virtual table pointing to source

Virtual table:May contain row selections, column selections, column concatenations, transformations, column and table name changes, groupings, aggregations, data cleansing, …

Data consumer

Developing Virtual Tables

Source

Copyright © 2019 R20/Consultancy B.V., The Netherlands 42

Layers of Virtual Tables

Enterprise data layer

Data consumptionlayer

Data sourcelayer

Data V

irtualizatio

n Server

Page 22: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

22

Copyright © 2019 R20/Consultancy B.V., The Netherlands 43

Improved Performance Through Query Pushdown

Data source

DataVirtualization

server

ApplicationRequest consistingof eight operations

Two operations processedby data virtualizationserver

Six operationsprocessedby data source

request

Data source

DV server

ApplicationRequest consistingof eight operations

Six operations processedby data virtualizationserver

Two operationsprocessedby data source

request

Copyright © 2019 R20/Consultancy B.V., The Netherlands 44

Push Down Query Processing

(1) Incoming Query:

SELECT C2, CONCAT(C5, C6)FROM Virtual tableWHERE C1 = > 1000AND C4 BETWEEN 10 AND 20

Data Virtualization

Server

Data So

urce

(3) Executed Query:

SELECT C2, CONCAT(C5, C6)FROM Result

(2) Executed Query:

SELECT C2, C5, C6FROM TableWHERE C1 = > 1000AND C4 BETWEEN 10 AND 20

Page 23: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

23

Copyright © 2019 R20/Consultancy B.V., The Netherlands 45

Accessing Files

(1) Incoming Query:

SELECT C2, CONCAT(C5, C6)FROM Virtual tableWHERE C1 = > 1000AND C4 BETWEEN 10 AND 20

Data Virtualization

Server

Data So

urce

(3) Executed Query:

SELECT C2, CONCAT(C5, C6)FROM ResultWHERE C1 = > 1000AND C4 BETWEEN 10 AND 20

(2) Executed Query:

SELECT C1, C2, C4, C5, C6FROM File

Copyright © 2019 R20/Consultancy B.V., The Netherlands 46

Data Virtualization and Cloud Integration

Business users

Data V

irtualizatio

n

On premisedata sources

Cloud‐baseddata sources

Page 24: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

24

Copyright © 2019 R20/Consultancy B.V., The Netherlands 47

6. Use Cases Must Match

Copyright © 2019 R20/Consultancy B.V., The Netherlands 48

Market of Database ServersClassic SQL

Analytical SQL

NewSQL

Graph

Key‐value stores

Document stores

Column‐family stores

Streaming SQL

Search technology

Data virtualization

HDFS + MapReduce

HDFS + Spark

SQL‐on‐Hadoop Transactions

SQL‐on‐Hadoop Query

Queryoriented

Transactionoriented

Queryoriented

Transactionoriented

Generalpurpose

Transactionoriented

Queryoriented

Generalpurpose

SQLdatabaseservers

NoSQLdatabaseservers

Hadoop and Spark

Alldatabaseservers

Translytical SQL

Object Storage

Cube/multi‐dimensional

File system

Page 25: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

25

Copyright © 2019 R20/Consultancy B.V., The Netherlands 49

Specialization of Cars

Copyright © 2019 R20/Consultancy B.V., The Netherlands 50

7. Store All Data

Page 26: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

26

Copyright © 2019 R20/Consultancy B.V., The Netherlands 51

New Requirements for Transactional Systems

Keep track of history• Simplifies rest of the data architecture

• Everything becomes versioned

• Don’t throw data away (unless regulations enforce)

Log who does what, when, why, and where• For analytical purposes

• Separate small tables are one large table?

Develop an automatic replication mechanism for all data

Correct data at the door

Design them to support reporting and analytics

Minimize design exceptions!

Responsible for enforcing data quality (possible?)

Application

Copyright © 2019 R20/Consultancy B.V., The Netherlands 52

8. Choose Productivity Over Performance

Page 27: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

27

Copyright © 2019 R20/Consultancy B.V., The Netherlands 53

Example: SnowflakeDB

VirtualWarehouse forData Scientists

VirtualWarehouse forMarketing Dashboards

VirtualWarehouse forSelf‐Service BI Finance

VirtualWarehouse forExternal Parties

VirtualWarehouse forExternal Parties

VirtualWarehouse forTesting and Dev

Centraldatabase

Copyright © 2019 R20/Consultancy B.V., The Netherlands 54

SnowflakeDB

The Effect of SnowflakeDB

ETL ETLETL

Sourcesystems Data marts

Stagingarea Reporting

Datawarehouse

Page 28: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

28

Copyright © 2019 R20/Consultancy B.V., The Netherlands 55

Development Steps for Data Science

Defining goals

Data selection

Data understanding

Data enrichment

Data cleansing

Data coding/binning/bucketing

Creating analytical model

Analytics

Interpreting & understanding results

Copyright © 2019 R20/Consultancy B.V., The Netherlands 56

Page 29: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

29

Copyright © 2019 R20/Consultancy B.V., The Netherlands 57

Wide Range of Tools Used by Data Scientists

Tools used by Data 

Scientists

Spreadsheets:

Excel

Self‐service BI Tools: 

Alteryx, PowerBI, Spotfire, QlikSense, Tableau, …

Program Languages: 

Python, R, Scala, …

ML Automation Tools:

BigML, DataRobot, SAS Factory Miner, …

Data Science Workbenches:

Amazon SageMaker, Cloudera Data Science Workbench, …

Many others

Copyright © 2019 R20/Consultancy B.V., The Netherlands 58

DataRobot

Page 30: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

30

Copyright © 2019 R20/Consultancy B.V., The Netherlands 59

9. Cloud is Networks

Copyright © 2019 R20/Consultancy B.V., The Netherlands 60

SystemA

SystemB

Network

My Personal View of the Network 

Page 31: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

31

Copyright © 2019 R20/Consultancy B.V., The Netherlands 61

Data is Pushed to the Processing

Copyright © 2019 R20/Consultancy B.V., The Netherlands 62

Push the Processing to the Data!

Page 32: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

32

Copyright © 2019 R20/Consultancy B.V., The Netherlands 63

10. Apply a Holistic Design Approach

Copyright © 2019 R20/Consultancy B.V., The Netherlands 64

From a Lineair to a Holistic Approach

Data and SolutionsArchitectures

Data Storage and Processing Technology

Design Principles

Data Securityand Privacy

Page 33: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

33

Copyright © 2019 R20/Consultancy B.V., The Netherlands 65

Closing Remarks

Copyright © 2019 R20/Consultancy B.V., The Netherlands 66

Closing Remarks

A bad design on premises is still a bad design in the cloud

The cloud is not a solution for every problem

What are the business benefits?

Page 34: Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick

34

Copyright © 2019 R20/Consultancy B.V., The Netherlands 67

www.r20.nl