Data Regions: Modernizing your company's data ecosystem
-
Upload
dataworks-summithadoop-summit -
Category
Technology
-
view
264 -
download
0
Transcript of Data Regions: Modernizing your company's data ecosystem
C o p y r i g h t © 2 0 1 5 , S A S I n s t i t u t e I n c . A l l r i g h t s re s e rv e d .
1
Data Regions:Modernizing Your Company’s Data Ecosystem
Evan Levy Vice President, Data Management ProgramsSAS
EvanJayLevy
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 2
A 20 Year Old Paradigm
The Change Data PerspectiveTraditional Assumption
All data originates from internal systems
The company runs on OLTP systems
Users have the BI/DW to address their reporting and analysis needs
Users require data from many sources (and the quantity is growing)
Business Operations rely on OLTP, Data, and Analytics
The Data Warehouse is the data source
Today’s Reality Most data is internal; >35% is external
Today’s Reality
We have multiple analytical systems: data mining, exploration, sandboxes, etc.
1339F9C1339F9C
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 3
Data Challenges…“Why is all the data put into the warehouse? Only 3 people need to use the data”
“Can you tell me what data we purchased from outside vendors?”
“Why will it take you 30 days to load data? I can cut and paste it into my server in 4 minutes.”
“We have to standardize business terminology. We’ve learned that data governance is critical.”
“Why do I have to work around the ‘infrastructure’. Shouldn’t it be built for my needs?”
“You send me a file from SalesForce every month, and the layout changes every month. And you don’t tell me.”
“We have data all over (systems, the cloud, external apps, etc.) Why don’t we have a catalog of the sources?
“Finance wants all data reconciled. I can’t wait. Why do I have to suffer from their requirements?”
133A061
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 4
Data Characteristics
Data
Access
Domain
Structure
Audience
Integrity
1337ADC
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 5
Data Characteristics
Audience
The individual user (and their skills and data needs)
Reviewing data about a known situations
Report users
DW Developers
Uses ETL tools to retrieve and load data
Analytic Developers
Builds analytical models to manipulate known data
Data Scientists
Analyzes any available data to identify new trends
BI Developers
Building reports using structured data
Business Analyst
Analyzing data to for a new hypothesis
Develops code to navigate any available data source
Application Developers
1337ADC
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 6
A business analyst running a report on
DBMS tables
Data Characteristics
Access
Custom code navigating a flat file (to retrieve specific
values)
Code call platform specific APIs for data
access
The methods, interfaces, and tools used to access the data
A cloud-application sending
transactions
SQL
An application listening / receiving
event streams
A data scientist playing with data in a
sandbox
x y z
Access
1337ADC
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 7
Data Characteristics
StructureStructured Data Semi Structured Data
Unstructured Data
The structure and organization of the data content1337ADC
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 8
EnterpriseBusiness Unit
Data Characteristics
Domain
Organization
Project
Individual
The business context for data usage1337ADC
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 9
Data Characteristics
IntegrityClient John Smith
Username Oracleuser
RequestDate 9/28/2000
Request Time 23:59:07
Status Code OK
Browser Netscape
203.93.245.97 - oracleuser [28/Sep/2000:23:59:07 -0700] "GET /files/search/search.jsp?s=driver&a=10 HTTP/1.0" 200 2374 "http://datawarehouse. oracle.co/contents.htm" "Mozilla/4.7 [en] (WinNT; I)"
P;ECalibri;M220;SB;L10 P;ECalibri;M220;L11 P;ECalibri;M220;SI;L24 P;ECalibri;M220;SB;L9 P;ECalibri;M220;L10 P;ESegoe UI;M200;L9 P;ESegoe UI;M200;SB;L9 P;ECalibri;M180;L9 F;P0;DG0G8;M300 B;Y12;X5;D0 0 11 4 O;L;D;V0;K47;G100 0.001 F;M495;R1 F;SM24;Y1;X1 C;K"name" F;SM24;X2 C;K"Shares" F;SM24;X3 C;K"Quote/ Price" F;SM24;X4 C;K"cost/ share" F;SM24;X5 C;K"total cost" F;SM24;Y2;X1 C;K"aapl" F;P4;FF2G;SM24;X2 C;K1454.4024 F;SM24;X3 C;K126.85 F;SM24;X4 C;K79.006952 F;P4;FF2G;SM24;X5 C;K114907.9 F;SM24;Y3;X1 C;K"axp" F;P4;FF2G;SM24;X2 C;K1454.4108 F;SM24;X3 C;K79.27 F;SM24;X4 …
name Shares Quote/ Price cost/ share total costaapl 1,454.40 126.85 79.006952 114,907.90axp 1,454.41 79.27 84.671889 123,147.71bmy 3,666.51 63.95 43.25259 158,586.21brk.b 1,000 143.46 119.3527 119,352.70celg 1,000 116.44 102.47094 102,470.94chl 500 71.4 71.4179 35,708.95
The format, typing, and accuracy of the data 1337ADC
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 10
The 5 Characteristics of Data
Data
Access
Domain
Structure
Audience
Integrity
1339F9C
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 11
Challenging the Existing Data Paradigm
Support numerous new data sources
Establish a shared source staging area
Allow “trial & error” analysis for all users
Support Self Service Data (ETL, report, analysis, etc.)
Support different levels of data acceptance
1339F9C
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 12
Data RegionsInternal
Applications
Sour
ce D
ata
Rep
osito
ry
Cloud Applications
DataStreamsFiles
Services
Inbound Data
Sour
ce
Onb
oard
ing
Sandbox
Reporting & BI
EnterpriseView
Data Exploration
Advanced Analytics & Modeling
Messages
133A061
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 13
Data Regions
Addressing an Enterprise Data NeedInternal
Applications
Sour
ce D
ata
Rep
osito
ry
Cloud Applications
DataStreamsFiles
Services
Inbound Data
Sour
ce
Onb
oard
ing
Sandbox
Reporting & BI
EnterpriseView
Data Exploration
Advanced Analytics & Modeling
Messages
Create an environment that
fits user needs (not IT convenience)
Support data onboarding and distribution as a production need
Support a diverse set of data usage
needs
Address the complexities of data movement
Reduce resource/skill
overlap across the company
133A061
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 14
Data Regions
Source Onboarding
Audience Source Onboarding developers only; receiving for Source Data repository
Access Supports multiple delivery methods: txns, messages, bulk formats.
Structure Data layout based on source system. Likely dynamic & volatile
Domain N/A. This detail is implicit with the data source and the supplier.
Integrity N/A. Data details are defined by the data supplier.
• Manages the delivery of data from internal & external sources • Holds data until acceptance is complete; Data is then moved
to the Source Data Repository • Centralized support for sophisticated data capture methods
(ESP, 3rd party data delivery, API/messaging, etc.) • Productionalizes source data capture, identification and
sharing
1339F9C
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 15
Data Regions
Source Data Repository• Stores and retains all source data content; reduces enterprise
storage requirements • Establishes centralized registry of available data sources. • Reflects a defined data layout (independent of source
changes) • Alleviates developers’ need to learn data navigation, layout,
naming conventions on dozens of source systems
Audience Data Integration (Developers – DW, Application, Data Scientists, etc. )
Access Usually file oriented (transaction and other access based on situation)
Structure Company-centric, documented layout; Incl structured & unstructured
Domain N/A. Data reflects source
Integrity Company-centric format; Data quality and accuracy not addressed. 1339F9C
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 16
Data Regions
Data Exploration• Supports one-off, in depth business analysis using any data
─ Environment is permanent but resource usage is very transient─ Does not support production application access or deployment
• Often a general purpose platform that can support numerous technologies (Big Data, files, RDBMS, advanced analytics, etc.)
• A walled-off, protected data scientist-centric environment
Audience Data Scientists & Analytics Developers (unable to be supported by sandbox)
Access All access methods due to the “from scratch” nature of environment
Structure All data layouts. (Unstructured likely due to focus on new concept development)
Domain Typically enterprise or line of business level
Integrity Data transformed/standardized to streamline exploration efforts (often ignored for new or unknown data sources)
1339F9C
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 17
Data Regions
Enterprise View• Contains multiple integrated subject areas (w/ long-term history) • Content reflects enterprise trusted (and corrected) data• Includes metadata (terms, definitions, lineage, etc.) • Supports query processing and data provisioning
─ Online end-user queries and reporting ─ Data provisioning to analytical and transactional systems─ Content continually updated (where possible)
Audience All user. Most access will occur via query tools or data manipulation/ETL tools
Access Usually query-based access (w/existing tools). Unstructured requires APIs
Structure Data is usually structured. (unstructured requires special tools/extensions
Domain Enterprise level. Other domains may use content for provisioning purposes
Integrity Reflective of enterprise terminology and value standards1339F9C
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 18
Data Regions
Sandbox• Allowing users to extend their analysis with custom data
─ Supports structured data and queries using existing tools/technologies ─ Focused on supporting additional (external) data
• Environment is temporary; does not support production─ Walled-off environment; reports or data not distributable
• Allows for business-level data discovery and exploration─ Supports one-off user data needs
Audience Advanced business users. Requites dbms query and data integration skills
Access Data is accessible via SQL/table environment.
Structure Data content is structured and RDBMS oriented (goal is data variety)
Domain Any/All domains (enterprise to individual)
Integrity Enterprise data is standardized/corrected. Other data must be addressed by user1337ADC1339F9C
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 19
Data Regions
Reporting and Business Intelligence
• Supports defined reporting and ad hoc analysis (departmental data marts)
• Supports an application- or tool-centric view of data─ Simplifies tool access and data manipulation, or─ Reflects unique business (organization) view of data details
• Requires additional technical staff resources ─ ETL processing for additional sources, aggregates, hierarchies, etc. ─ Query and usage support for non-enterprise data
Audience Business users focused on using standard reports and content
Access Usually SQL-based access. Some data may be tool-centric (e.g. OLAP cubes)
Structure Usually structured data and reflecting rows of columns
Domain Likely to use enterprise data. Additional data may reflect different structure or domain as needed.
Integrity Enterprise data is standardized/corrected. Other data must be addressed by user1337ADC1339F9C
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 20
Data Regions
Advanced Analytics & Modeling
• A processing environment that can support advanced analytics─ Typically general purpose processing platforms with inexpensive directly
attached storage ─ Data is structured and often stored in highly denormalized structures─ usually driven by a specialized tool or language
• Typically small, high-value user audience • Production-supported environment. Data & Results are distributed
Audience Highly skilled technical staff (data scientists, developers with advanced analysis skills)
Access Data accessed via specialized tools using standard and custom access methods.
Structure Data is usually structured; May process unstructured data into structured content
Domain Typically enterprise-level data. Business drivers are often specific to organization
Integrity Data is often cleansed and standardized1339F9C
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 21
Data Services
Sour
ce D
ata
Rep
osito
ry
Sour
ce
Onb
oard
ing
Sandbox
Reporting & BI
EnterpriseView
Data Exploration
Advanced Analytics & Modeling
Data Transformation
Data Quality Data Governance
Metadata
1339F9C
Copyr ight © 2016, SAS Ins t i tute Inc. Al l r ights reserved . 22
Getting Started, Moving Forward…• Evaluate the diversity of audiences and domains
− Understand the unique combinations – those dictate the complexity of your environment
− Review the external data that is already in use
• Extend your environment one region at a time− Focus on adding (or remediating) regions based on business need
• Sharing data is not a courtesy – it’s a production need − Data provisioning and integration is a costly activity; it should be addressed
with “economies-of-scale” methods − Establishing repositories (with card catalogs) to provide “raw” and
“approved” data is a necessity
13378871339F9C
Copyr igh t © 2016, SAS Ins t i tute Inc . A l l r i gh ts r es erved .
THANKS!
www.EvanJLevy.com@EvanJayLevy