Post on 18-Feb-2018
The IT Data LakeLeveraging IT Data to Improve IT and Business OutcomesNovember 03, 2017
Michael SickSenior ManagerErnst & Young LLPMichael.Sick@ey.com+1 919.523.4447
Sriram KedharManagerErnst & Young LLPFnu.Sriram.Kedhar@ey.com+1 610.504.3796
Page 2
On Data and DigitalIt is broadly accepted that data is critical to digital transformation
Data! Data! Data! I can’t make bricks without clay! - Sir Arthur Conan Doyle
- Data is essential to effective decision making
Data is a precious thing and will last longer than the systems themselves - Tim Berners-Lee
- Data is a long lived asset
Take Away: It has been long recognized that data is key to solid decision making and recently emphasized that the transition to ‘digital’ approaches requires even greater attention to data.
Page 3
Advanced AnalyticsPutting Data and Analytics to Work
Order to Cash
Compliance Reporting
Work Order Management
Quote, Invoice and Contract Management
Contract Analysis
Lease Revenue / NOI Analysis
Property Performance
Analysis
Preventive Maintenance
Appraisal Support / Analysis
Advanced Analytics Machine Learning
Recognition
NLP
RPA
Interactive
Speech / Tech Enabled
Robotics
Autonomous
Industrial
Programmable
Deep Learning
Neural Networks
Analytics
Hypothesis-driven
Statistical
Classification
Autonomous analytics
Advanced Analytics are being used to transform a broad set of business processes across a broad set of sectors
Page 4
What are leaders doing? Survey SaysHow do you leverage data & analytics?
Leaders use advanced analytics to drive double digit-growth of 15%+ in revenues and operating margins, as well as improved risk profiles.
► 70% of “leading” organizations use advanced analytics to overhaul business strategies –changing the nature of competitive differentiation.
► 75% of top performers operate a full range of enterprise, departmental, and line-of-business analytics groups that work within a well-aligned framework.
Hint: If you intend to win at your business, the use of advanced analytics is essential.
Page 5
But what about IT?Our experience suggests that IT is lagging in digital transformation
► Q&A: Have you taken part in a project using advanced analytics supporting the ‘business’ at their job?
► Q&A: Have you taken part in a project using advanced analytics supporting IT?
► Q&A: Directly involved or not, does your IT department use data to Optimize their Software Development / SDLC practices? Identify common configuration defects in production deployments?
► Our experience and business suggests that larger IT organizations at highly digital businesses are starting to leverage IT data to improve their outcomes.
► The result, the IT Data Lake and associated processes.
Page 6
The IT Data Lake – Sample Use CasesThere are many common use cases to realize value in the IT Data Lake. Below are several examples that we have found are common.
► Transaction Tracing – root service analysis (dev->prod)
► Audit Logging – Transaction validation
► Performance Analysis – with drill down to components
► Performance Impact – overlay of performance to user
► Data Quality Dashboard & Alerts – rapid identification
► Predictive Model Application – application of the predictive models to the incoming data streams –models often time-series / window based
► Data Quality / Correction – use of models and corrected master / reference data sets to enrich incoming data streams
► Data Storage Optimization – recommended platform
► Data Compute Optimization – recommend platform
► Configuration Graph – build graph of it assets / config
► Blueprint & Deployment Analysis – find key gaps
► Config. Mgmt. Gap Analysis – what’s missing / wrong
► Incident Analysis – root cause analysis
► Audit Archive – long term transaction validation
► Bad Build Identification – find builds with likely defects
► Performance Analysis – models for performance norms
► Data Correction Models – creation of rule and categorizations based models for the identification (and potential correction) of data quality issues
► Internal Threat Analysis – augment pure security view of internal threads / behavior
Raw Data – archived to historical
Events – from predictive and DQ
Models – for prediction and DQ
Master Data – corrected MRD data
Page 7
IT Data Lake – The Layering of ValueWe have found the value of the IT Data Lake to be both iterative over time and additive based on the type of analysis. Real-time supports historic with a steady flow of data, Historic supports Predictive with master data and improved data quality.
Va
lue
Time (months)
Mining logs for extracting insight in the “now”
Collecting and reporting on historical data
Modeling with logs and contextual data
1
2
3
Trace
6 9 123
Trend
Think
1 2 3► How do I capture the flow of
business information in my vast universe of IT systems?
► What is the source of a given failure?
► What can I learn from analyzing and reporting on a vast amount of data collected over time?
► What types of trends exist in my easily accessible (free) data?
► Why do IT projects fail?
► What makes users stay on my site?
► Where do my employees go when they have a question?
► Developers
► Operational Support
► + IT Management
► L1-L3 Support
► + Business owners
REAL-TIME
HISTORIC
PREDICTIVE
Page 8
The IT Data Lake – Target Reference ArchitectureThe IT Data Lake contains log data from various systems and master / reference data from core IT systems. It is capable of leveraging IT data to support decisions that optimize IT functions and the business that that IT supports.
Page 9
Metadata tagging for the IT Data Lake
1 January 2014 Presentation title
Data Sources
RDBMS
Flat files
Data Ingestion
Da
ta Im
po
rt
IT Data Lake
Me
tad
ata
T
ag
gin
g
Native Data +
Tags
(HDFS)
Semantic Layer
Da
ta M
ap
pin
g
Canonical Data Model / Enterprise Data Dictionary
Calculation Engines
(Optional)
End Users / Channels
…
Data Ingestion and Governance
1 2 3 5 6 7
Sch
em
ale
ss
wri
te
Sch
em
a o
n
rea
d
Consumption Model
Kibana
Hive
Arcadia
4
Developer
Tester
Manager
Business
Data Flow Steps
Data Sources – Systems of origin/Systems of record
Data Ingestion - Schema-less tool-based automatic data import; attaches metadata tags (attribute names, domains, definitions) from a Canonical Data Model or a standardized data dictionary during ingestion
IT Data Lake - Data and metadata tags loaded to Data Lake (schema-less write to HDFS). Data is stored in data tables; tags are stored in metadata stores/tables
Canonical Data Model – A standardized data model or a data dictionary defining data elements used during consumption across various channels
Semantic Layer - Identifies and maps data elements from data lake to output using the metadata tag information
Calculation Engines (Optional) – Any additional calculations or computations needed before data presented for consumption
End Users / Channels – End users or analytics/reporting applications consuming data
1
2
3
5
6
7
4
We use a metadata tagging methodology to implement our clients IT data lakes
Page 10
IT Data Lake Reference ImplementationElasticsearch, Logstash and Kibana (ELK)
Infrastructure Logs
Application Logs
Business Logs
Monitoring Logs
Security Logs
Distributed Logs
Lo
gst
ash
Users
Developer
Tester
Security
Manager
Legal
Enterprise Logging Analytics Data Products
Transaction Tracing
► Trace transactions across systems
► View transaction details
► Query transactions with text analytics
► View Trends in Transaction
Business Identifiers
► Logging statements can be decorated with identifiers with business meaning
► I.E. Loan Id, Security Id, Bank Id
► Users able to query and aggregate on business identifiers
Configured Reports
► Pre-configured reports can be created
► Per System reports for quick health views
► Purpose specific reports like “Top N Security Concerns”
Log data sent to Logstash collector and ingested in real-time
Data sent to core logging system for near real-time ingestion and query
If implemented, deeper analytics performed in batch
Deeper analytics can be imported back into the core system
A unified query system implemented for time-series oriented searches
A variety of Users can be configured to securely query the system
A variety of data products can be created/viewed by the system
1
3 4
6 7
1
2
3
4
5
6
7
Enterprise Logging Analytics Platform
Random AccessStorage
and Search Index
<Elasticsearch>
Discovery / Search Analytics
<Elasticsearch>
Sequential (File) Access
Storage<Hadoop-HDFS>
Batch Analytics<Hive>
Batch Logging Analytics
Real-Time Logging Analytics Reporting <Kibana>
54
2
Page 11
The IT Data Lake – Lessons Learned (1 of 2)As EY has helped its clients execute on building out IT Data Lakes, we have learned a number of key lessons that we have turned into leading practices for execution.
► Data Rich – ensure your ecosystem of data is data rich by starting with the near real-time use case and expanding it to a broad set of applications. This supports Historic and Predictive use cases by ensuring that key data sets have already been landed.
► Likewise, a focus on the Historic use cases prior to the Predictive use cases will help ensure that key master data sets have been landed and examined.
► Predictive and other advanced analytic data methods can be difficult. By ensuring that key data is landed and well known, the data wrangling “tax” on each predictive use case is reduced.
► Master Data – the master and reference data for IT data is generally less examined and lower quality than master and reference data for business data sets. An initial focus on landing and addressing quality issues of key data sets (Configuration and Asset Management, Project Data, Identity Data …) is critical and worth the investment.
► Value Focus – use case priority should be largely value focused, especially as the IT Data Lake and surrounding program get started. It is critical that IT demonstrate that it can a) reduce operational costs b) improve operational metrics (uptime, user experience) and reduce long term risk via the analysis of IT data.
► Use Case Portfolio – build, communicate, and validate a multi-year portfolio of use cases and work streams that demonstrate value that justifies the overall IT Data Lake effort (we suggest at least a 10:1 ratio of value to potential expense as likely return will be lower since some use cases will under perform and some potential value will remain unrealized across all use cases)
Page 12
The IT Data Lake – Lessons Learned (2 of 2)As EY has helped its clients execute on building out IT Data Lakes, we have learned a number of key lessons that we have turned into leading practices for execution.
► All SDLC Environments – many efforts start by focusing only on production logs / data. We have found this to be a mistake. In the near real-time use case, allowing developers to quickly identify root cause / component issues delivers immediate value. For SDLC related use cases for Historic and Predictive analysis, the data showing trends and root cause of issues is found during the development and testing cycles.
► Invest in Graph – create a graph of IT dependencies and ensure that it can be easily joined to the transaction level logs and generated predictive and alert events. First, nobody in your IT organization has the big picture in how everything fits together. A visually explore-able model provides valuable insights on its own. Second, the graph is critical for allowing Historic and Predictive analysis to understand how components are / are not connected.
► Invest in Data Model – we have found two critical ways the data model for the IT Data Lake should be established. The first is data model for the core IT Master and Reference Data (MRD) with connectivity to the underlying transactional data (logs). Key domains for this data model include Configuration and Asset Management, Software Development Lifecycle, Policy (and security policy). The second data model is for the connection between IT data and key business process steps and identifiers. For example, for a manufacturer they might include Vendor Id, Material Codes, User Id, Plant … These will be used to build metrics and views that show the business impact of IT activities.
► Splunk Zen – Splunk is a pervasive tool for log analysis and it is incumbent at many companies. Splunk is a sophisticated tool with many compelling features. That said, the Splunk business model for charging based on the rate of ingestion makes it a difficult sell for broad use in the IT Data Lake. For near real-time, it will be unlikely that any enterprise will want to target Splunk for all logs from development to production. The IT Data Lake will not replace the current IT and security functions of Splunk. If Splunk is present, it should be integrated with the IT Data Lake as a data supplier and consumer.
Page 13
Disclaimer
► EY refers to the global organization, and may refer to one or more, of the member firms of Ernst & Young Global Limited, each of which is a separate legal entity. Ernst & Young LLP is a client-serving member firm of Ernst & Young Global Limited operating in the US.
► Views expressed in this presentation are those of the speakers and do not necessarily represent the views of Ernst & Young LLP.
► This presentation is provided solely for the purpose of enhancing knowledge on technology matters.
► These slides are for educational purposes only and are not intended, and should not be relied upon, as technical advice.
Appendix
Addressing the OpportunityFocus on Data and Analytics
Competing on Analytics: Updated with a New Introduction: The New Science of Winning
Characteristics
− Structured data, not necessarily big
− Hypothesis driven with sample data sets
− Cleaner data = better analysis & results
− 3rd party demographic, psychographic data
− Getting this right = operational / customer table stakes
− Big data and data driven hypothesis
− Lots of data doesn’t necessarily mean it’s useful
− Machine learning / can be compute intensive
− 3rd party social and open data
− Getting this right = differentiation and growth
Optimization“What’s the best that can happen?”
Machine Learning“What can we learnfrom the data?”
Experimental design“”What happens if we try this?”
Predictive modeling“”What will happen next?”
Forecasting/extrapolation“”What if these trends continue?”
Statistical analysis“Why is this happening?”
Alerts“What actions are needed?”
Query/drill down“What exactly is the problem?”
Ad hoc reports“How many, how often where?”
Standard reports“What happened?”
Co
mp
etit
ive
ad
van
tage
Sophistication of intelligence
Autonomous
Analytics
Prescriptive
Analytics
Predictive
Analytics
Descriptive
Analytics
Page 16
IT data quality frameworkOur IT data quality framework, in combination with the IT data management operating framework will be leveraged to identify improvement opportunities for data being collected in the IT data lake as well as provide a repeatable process for ongoing monitoring for key issues such as root cause analysis, tracking and remediation
Governance Policies
1Data Lineage,
Metadata, Data Quality assessment
2Data Violation
Repository,Issue Management
3Ongoing Data
Reconciliation and acquisition monitoring
► Help build data reconciliation dash boards
► Assist with integrating reconciliation reporting
► Help provide increased transparency in sourcing and usage of data, from source systems/golden sources to data lake
► Help define and document associated metadata and reference data
► Document lineage and data flows for each CI class
► Establish functional and data model, build DQ Violation Repository and define DQ rules
► Assist in prioritizing issues, performing root cause analysis, facilitating remediation
Output: Confidence Ratings
► Inclusion of all Technology Assets
► Accuracy of Critical Data Attributes
► Timeliness & Usability of Asset Data
Business Objective: To measure and improve
quality of IT data which is of strategic importance
for operational efficiency and IT management purposes
Functional Knowledge
IT Inventory DataAttributes
Reference architecturefor IT Inventory
IT Risk Requirements &Controls
Page 17
IT Data Management is a cross-functional discipline that facilitates the effective and efficient management, control and protection of IT assets (both hardware and software) across the organization, at all stages of their lifecycle. This is accomplished through process, governance, organizational management, process integration, data governance and supporting technology to drive operational and strategic decision making.
IT Data Lifecycle
Producers of ITdata
IT asset management
Procurement
Infrastructure management
Software engineering
Public and private cloud
Managed services
Consumers of ITdata
Risk management
Information security
Configuration management
IT financial management
Compliance
Change management
Contract management
Capacity management
ISO 19770 -1 and 2 COBIT v5 FRB and FFIEC guidance ISO 27002 ITIL v3 SANS critical controls
IT data management lifecycleOur deep understanding of the IT data lifecycle will enable us to propose a robust IT data management strategy, data governance plan and data model forming the basis for the IT data lake initiative