Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data...
Transcript of Ten Guidelines for a Modern Data Architecture€¦ · intelligence, data management, big data, data...
1
Copyright © 2019 R20/Consultancy B.V., The Netherlands. All rights reserved. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means,
electronic, mechanical, photographic, or otherwise, without the explicit written permission of the copyright owners.
Ten Guidelines for a Modern Data Architecture
Rick F. van der LansIndustry analystEmail [email protected] Twitter @rick_vanderlanswww.r20.nl
Copyright © 2019 R20/Consultancy B.V., The Netherlands. All rights reserved. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means,
electronic, mechanical, photographic, or otherwise, without the explicit written permission of the copyright owners.
Ten Guidelines for a Modern Data Architecture
Rick F. van der LansIndustry analystEmail [email protected] Twitter @rick_vanderlanswww.r20.nl
2
Copyright © 2019 R20/Consultancy B.V., The Netherlands 3
Rick F. van der Lans
Rick F. van der Lans is a highly‐respected independent analyst, consultant, author, and internationally acclaimed lecturer specializing in data warehousing, business intelligence, big data, and database technology. He is managing director of R20/Consultancy BV.
He has presented countless seminars, webinars, and keynotes at industry‐leading conferences. Rick helps clients worldwide to design their data warehouse, big data, and business intelligence architectures and solutions and assists them with selecting theright products. He has been influential in introducing the new logical data warehouse architecture worldwide which helps organizations to develop more agile business intelligence systems.
He is the author of several books on computing, including his new Data Virtualization: Selected Writings and Data Virtualization for Business Intelligence Systems. Some of these books are available in different languages. Books such as the popular Introduction to SQL is available in English, Dutch, Italian, Chinese, and German and is sold world wide. He also authored numerous whitepapers for vendors.
In 2018 he was selected the sixth most influential BI analyst worldwide by onalytica.com.
Ambassador of Axians Business Analytics Laren (formerly Kadenza): This consultancy company specializes in business intelligence, data management, big data, data warehousing, data virtualization, and analytics. In this part‐time role, Rick works closely together with the consultants in many projects. Their joint experiences and insights are shared in seminars, webinars, blogs, and whitepapers.
You can get in touch with Rick van der Lans via: Email: [email protected]: www.r20.nlTwitter: @Rick_vanderlansLinkedIn: http://www.linkedin.com/pub/rick‐van‐der‐lans/9/207/223
Copyright © 2019 R20/Consultancy B.V., The Netherlands 4
Introduction to Data Architectures
3
Copyright © 2019 R20/Consultancy B.V., The Netherlands 5
Copyright © 2019 R20/Consultancy B.V., The Netherlands 6
4
Copyright © 2019 R20/Consultancy B.V., The Netherlands 7
What is a Data Architecture?
Wikipedia: A data architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations.
Examples of data architectures:• Data warehouse architecture
• Data streaming architecture
• Transactional system
Copyright © 2019 R20/Consultancy B.V., The Netherlands 8
Data Architects versus Solutions Architects
Data Architects Solutions Architects
focus on how information moves across the system from one application to another
look at the overall technological environment of the company
collaborate with clients to determine the specifications of the project, as well as the business goals that will align with the collected data
meet with their clients and establish their specific technology needs based on their business objectives
design the data model for the organization; where to store the customer data, how to retrieve the data; who can read the data
has a more technical point of view. Do we select a cloud solution, or on premise? What will the network look like? How will everything be connected without failures?
5
Copyright © 2019 R20/Consultancy B.V., The Netherlands 9
The Birth of a New Data Architecture (1)
Copyright © 2019 R20/Consultancy B.V., The Netherlands 10
The Birth of a New Data Architecture (2)
6
Copyright © 2019 R20/Consultancy B.V., The Netherlands 11
There is no
“The best data architecture”!
Copyright © 2019 R20/Consultancy B.V., The Netherlands 12
Ten Guidelines for Cloud Data Architectures
1. Use technology designed for the cloud
2. Stay cloud platform independent
3. Centralize transformation specifications
4. Centralize technical and business metadata
5. Implement abstraction
6. Use cases must match
7. Store all data
8. Choose productivity over performance
9. Cloud is networks
10.Apply a holistic design approach
7
Copyright © 2019 R20/Consultancy B.V., The Netherlands 13
1. Use Technology Designed for the Cloud
Copyright © 2019 R20/Consultancy B.V., The Netherlands 14
8
Copyright © 2019 R20/Consultancy B.V., The Netherlands 15
Scale Up versus Scale Out
Scale up (vertical scaling) means adding more resources to one node in a system
Scale out (horizontal scaling) means adding more nodes to a system• Continuous availability/redundancy
• Cost/performance flexibility
• Contiguous upgrades
• Geographical distribution
scale out
scaleup
Copyright © 2019 R20/Consultancy B.V., The Netherlands 16
Effect of Partitions on Query Response
number of partitions/processors
total through
put
bottleneck
9
Copyright © 2019 R20/Consultancy B.V., The Netherlands 17
Market of Database ServersClassic SQL
Analytical SQL
NewSQL
Graph
Key‐value stores
Document stores
Column‐family stores
Streaming SQL
Search technology
Data virtualization
HDFS + MapReduce
HDFS + Spark
SQL‐on‐Hadoop Transactions
SQL‐on‐Hadoop Query
Queryoriented
Transactionoriented
Queryoriented
Transactionoriented
Generalpurpose
Transactionoriented
Queryoriented
Generalpurpose
SQLdatabaseservers
NoSQLdatabaseservers
Hadoop and Spark
Alldatabaseservers
Translytical SQL
Object Storage
Cube/multi‐dimensional
File system
Copyright © 2019 R20/Consultancy B.V., The Netherlands 18
Application‐based Analytics
Databaseserver
Application
operations
10
Copyright © 2019 R20/Consultancy B.V., The Netherlands 19
Database‐Based Analytics
Databaseserver
Application
Copyright © 2019 R20/Consultancy B.V., The Netherlands 20
Partial Parallel Analytics
Databaseserver Master
Worker 1 Worker 2 Worker 3
Application
11
Copyright © 2019 R20/Consultancy B.V., The Netherlands 21
Full Parallel Analytics
Databaseserver Master
Worker 1 Worker 2 Worker 3
Application
Copyright © 2019 R20/Consultancy B.V., The Netherlands 22
Optimizing Distributed Joins
Customer table partitions
Order table partitions
Product table partitions
12
Copyright © 2019 R20/Consultancy B.V., The Netherlands 23
2. Stay Cloud Platform Independent
Copyright © 2019 R20/Consultancy B.V., The Netherlands 24
Cloud Platform are Becoming the New Mainframes
13
Copyright © 2019 R20/Consultancy B.V., The Netherlands 25
Mainframe = Lock In
Proprietary operating systems
Proprietary system management software
Proprietary database servers
Proprietary security systems
Proprietary development environments
Proprietary JCLs
Proprietary …
Copyright © 2019 R20/Consultancy B.V., The Netherlands 26
Cloud Platform = Lock In?
Proprietary operating systems
Proprietary management software
Proprietary database servers• E.g. Amazon: RDS, RedShift (SQL), S3, …
Proprietary security systems
Proprietary development environments• E.g. Microsoft Azure: Reporting Services,
Analytics services, Data Management Services, …
Proprietary …
14
Copyright © 2019 R20/Consultancy B.V., The Netherlands 27
Stay Cloud Platform Independent
=Design to Migrate
Copyright © 2019 R20/Consultancy B.V., The Netherlands 28
3. Centralize Transformation Specifications
15
Copyright © 2019 R20/Consultancy B.V., The Netherlands 29
Specifications
Sourcesystems Analytics & reporting
Data From Sources to Dashboards
Data structure specifications
Integration specifications
Transformation specifications
Data security specifications
Data cleansing specifications
Analytical specifications
Visualization specifications
Data privacy specifications
Copyright © 2019 R20/Consultancy B.V., The Netherlands 30
Example: The Classic Data Warehouse
ETL ETLETL
Sourcesystems
Data martsStagingarea
Analytics &reporting
Datawarehouse
16
Copyright © 2019 R20/Consultancy B.V., The Netherlands 31
Example: Implementing the Specifications
Data
Virtualization
Sourcesystems
Analytics &reporting
ETL
Data Warehouse
Full history
Permanent surrogate keys
No cleansing
No integration
No deletes
Data warehouse becomes “data lakish”
Copyright © 2019 R20/Consultancy B.V., The Netherlands 32
4. Centralize Technical and Business Metadata
17
Copyright © 2019 R20/Consultancy B.V., The Netherlands 33
Metadata Today
Metadata dispersed across many systems• In database servers
• In integration tools
• In documentation
• In reporting tools
• In spreadsheets
Most is technical and not business metadata
Not integrated – no clear relationships between metadata elements
Copyright © 2019 R20/Consultancy B.V., The Netherlands 34
Solutions for Centralizing Metadata
Types of solutions• Home made metadata integration system
• Professional metadata tool; Collibra, Tibco EBX, …
• Scraping and linking; ASG, …
• Data warehouse automation: Astera, Attunity, TimeXtender, WhereScape, …
Design considerations• Integration with master data and reference data
• Searchable metadata for business users and ICT professionals
• Programmable interface
• It’s all about discipline
18
Copyright © 2019 R20/Consultancy B.V., The Netherlands 35
ASG Enterprise Data Management – Data Lineage
Copyright © 2019 R20/Consultancy B.V., The Netherlands 36
Automate Repetitive Tasks
19
Copyright © 2019 R20/Consultancy B.V., The Netherlands 37
ETL ETLETL
Sourcesystems
Data martsStagingarea
Analytics &reporting
Datawarehouse
Generating Specifications
RepositoryGenerator
Copyright © 2019 R20/Consultancy B.V., The Netherlands 38
Why Generating Specifications?
Consistent specifications
Consistent reporting results
Improved productivity and maintainability• Avoids reinventing the wheel
Improved time-to-market for new reports
Improved governance and auditability
Easier to adopt new technologies:• SQL to SQL, SQL to NoSQL, SQL to Hadoop, ETL to
ELT, …
Easier data model migration• Normalized, star, snowflake, datavault, SuperNova
20
Copyright © 2019 R20/Consultancy B.V., The Netherlands 39
5. Implement Abstraction
Copyright © 2019 R20/Consultancy B.V., The Netherlands 40
Data Virtualization Overview
streamingdatabases
socialmedia data
productionapplication website
analytics& reporting
mobileApp
internalportal dashboard
privatedata
ODBC/SQL JDBC/SQL XML/SOAP REST/JSON XQuery MDX/DAX
JMS SQL SQL+ XSLT Hive Prop. Excel JSONCICS SOAP
applications
SQL statement
JMS message SQL statement SOAP messageData Virtualization Server
unstructureddataSQL
databasesHadoop,NoSQLdatabaseESB
messaging
legacydatabase
cloudapplications
21
Copyright © 2019 R20/Consultancy B.V., The Netherlands 41
Data Virtualization Server
Virtual table pointing to source
Virtual table:May contain row selections, column selections, column concatenations, transformations, column and table name changes, groupings, aggregations, data cleansing, …
Data consumer
Developing Virtual Tables
Source
Copyright © 2019 R20/Consultancy B.V., The Netherlands 42
Layers of Virtual Tables
Enterprise data layer
Data consumptionlayer
Data sourcelayer
Data V
irtualizatio
n Server
22
Copyright © 2019 R20/Consultancy B.V., The Netherlands 43
Improved Performance Through Query Pushdown
Data source
DataVirtualization
server
ApplicationRequest consistingof eight operations
Two operations processedby data virtualizationserver
Six operationsprocessedby data source
request
Data source
DV server
ApplicationRequest consistingof eight operations
Six operations processedby data virtualizationserver
Two operationsprocessedby data source
request
Copyright © 2019 R20/Consultancy B.V., The Netherlands 44
Push Down Query Processing
(1) Incoming Query:
SELECT C2, CONCAT(C5, C6)FROM Virtual tableWHERE C1 = > 1000AND C4 BETWEEN 10 AND 20
Data Virtualization
Server
Data So
urce
(3) Executed Query:
SELECT C2, CONCAT(C5, C6)FROM Result
(2) Executed Query:
SELECT C2, C5, C6FROM TableWHERE C1 = > 1000AND C4 BETWEEN 10 AND 20
23
Copyright © 2019 R20/Consultancy B.V., The Netherlands 45
Accessing Files
(1) Incoming Query:
SELECT C2, CONCAT(C5, C6)FROM Virtual tableWHERE C1 = > 1000AND C4 BETWEEN 10 AND 20
Data Virtualization
Server
Data So
urce
(3) Executed Query:
SELECT C2, CONCAT(C5, C6)FROM ResultWHERE C1 = > 1000AND C4 BETWEEN 10 AND 20
(2) Executed Query:
SELECT C1, C2, C4, C5, C6FROM File
Copyright © 2019 R20/Consultancy B.V., The Netherlands 46
Data Virtualization and Cloud Integration
Business users
Data V
irtualizatio
n
On premisedata sources
Cloud‐baseddata sources
24
Copyright © 2019 R20/Consultancy B.V., The Netherlands 47
6. Use Cases Must Match
Copyright © 2019 R20/Consultancy B.V., The Netherlands 48
Market of Database ServersClassic SQL
Analytical SQL
NewSQL
Graph
Key‐value stores
Document stores
Column‐family stores
Streaming SQL
Search technology
Data virtualization
HDFS + MapReduce
HDFS + Spark
SQL‐on‐Hadoop Transactions
SQL‐on‐Hadoop Query
Queryoriented
Transactionoriented
Queryoriented
Transactionoriented
Generalpurpose
Transactionoriented
Queryoriented
Generalpurpose
SQLdatabaseservers
NoSQLdatabaseservers
Hadoop and Spark
Alldatabaseservers
Translytical SQL
Object Storage
Cube/multi‐dimensional
File system
25
Copyright © 2019 R20/Consultancy B.V., The Netherlands 49
Specialization of Cars
Copyright © 2019 R20/Consultancy B.V., The Netherlands 50
7. Store All Data
26
Copyright © 2019 R20/Consultancy B.V., The Netherlands 51
New Requirements for Transactional Systems
Keep track of history• Simplifies rest of the data architecture
• Everything becomes versioned
• Don’t throw data away (unless regulations enforce)
Log who does what, when, why, and where• For analytical purposes
• Separate small tables are one large table?
Develop an automatic replication mechanism for all data
Correct data at the door
Design them to support reporting and analytics
Minimize design exceptions!
Responsible for enforcing data quality (possible?)
Application
Copyright © 2019 R20/Consultancy B.V., The Netherlands 52
8. Choose Productivity Over Performance
27
Copyright © 2019 R20/Consultancy B.V., The Netherlands 53
Example: SnowflakeDB
VirtualWarehouse forData Scientists
VirtualWarehouse forMarketing Dashboards
VirtualWarehouse forSelf‐Service BI Finance
VirtualWarehouse forExternal Parties
VirtualWarehouse forExternal Parties
VirtualWarehouse forTesting and Dev
Centraldatabase
Copyright © 2019 R20/Consultancy B.V., The Netherlands 54
SnowflakeDB
The Effect of SnowflakeDB
ETL ETLETL
Sourcesystems Data marts
Stagingarea Reporting
Datawarehouse
28
Copyright © 2019 R20/Consultancy B.V., The Netherlands 55
Development Steps for Data Science
Defining goals
Data selection
Data understanding
Data enrichment
Data cleansing
Data coding/binning/bucketing
Creating analytical model
Analytics
Interpreting & understanding results
Copyright © 2019 R20/Consultancy B.V., The Netherlands 56
29
Copyright © 2019 R20/Consultancy B.V., The Netherlands 57
Wide Range of Tools Used by Data Scientists
Tools used by Data
Scientists
Spreadsheets:
Excel
Self‐service BI Tools:
Alteryx, PowerBI, Spotfire, QlikSense, Tableau, …
Program Languages:
Python, R, Scala, …
ML Automation Tools:
BigML, DataRobot, SAS Factory Miner, …
Data Science Workbenches:
Amazon SageMaker, Cloudera Data Science Workbench, …
Many others
Copyright © 2019 R20/Consultancy B.V., The Netherlands 58
DataRobot
30
Copyright © 2019 R20/Consultancy B.V., The Netherlands 59
9. Cloud is Networks
Copyright © 2019 R20/Consultancy B.V., The Netherlands 60
SystemA
SystemB
Network
My Personal View of the Network
31
Copyright © 2019 R20/Consultancy B.V., The Netherlands 61
Data is Pushed to the Processing
Copyright © 2019 R20/Consultancy B.V., The Netherlands 62
Push the Processing to the Data!
32
Copyright © 2019 R20/Consultancy B.V., The Netherlands 63
10. Apply a Holistic Design Approach
Copyright © 2019 R20/Consultancy B.V., The Netherlands 64
From a Lineair to a Holistic Approach
Data and SolutionsArchitectures
Data Storage and Processing Technology
Design Principles
Data Securityand Privacy
33
Copyright © 2019 R20/Consultancy B.V., The Netherlands 65
Closing Remarks
Copyright © 2019 R20/Consultancy B.V., The Netherlands 66
Closing Remarks
A bad design on premises is still a bad design in the cloud
The cloud is not a solution for every problem
What are the business benefits?
34
Copyright © 2019 R20/Consultancy B.V., The Netherlands 67
www.r20.nl