HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv...

17
The IT Data Lake Leveraging IT Data to Improve IT and Business Outcomes November 03, 2017 Michael Sick Senior Manager Ernst & Young LLP [email protected] +1 919.523.4447 Sriram Kedhar Manager Ernst & Young LLP [email protected] +1 610.504.3796

Transcript of HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv...

Page 1: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

The IT Data LakeLeveraging IT Data to Improve IT and Business OutcomesNovember 03, 2017

Michael SickSenior ManagerErnst & Young [email protected]+1 919.523.4447

Sriram KedharManagerErnst & Young [email protected]+1 610.504.3796

Page 2: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 2

On Data and DigitalIt is broadly accepted that data is critical to digital transformation

Data! Data! Data! I can’t make bricks without clay! - Sir Arthur Conan Doyle

- Data is essential to effective decision making

Data is a precious thing and will last longer than the systems themselves - Tim Berners-Lee

- Data is a long lived asset

Take Away: It has been long recognized that data is key to solid decision making and recently emphasized that the transition to ‘digital’ approaches requires even greater attention to data.

Page 3: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 3

Advanced AnalyticsPutting Data and Analytics to Work

Order to Cash

Compliance Reporting

Work Order Management

Quote, Invoice and Contract Management

Contract Analysis

Lease Revenue / NOI Analysis

Property Performance

Analysis

Preventive Maintenance

Appraisal Support / Analysis

Advanced Analytics Machine Learning

Recognition

NLP

RPA

Interactive

Speech / Tech Enabled

Robotics

Autonomous

Industrial

Programmable

Deep Learning

Neural Networks

Analytics

Hypothesis-driven

Statistical

Classification

Autonomous analytics

Advanced Analytics are being used to transform a broad set of business processes across a broad set of sectors

Page 4: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 4

What are leaders doing? Survey SaysHow do you leverage data & analytics?

Leaders use advanced analytics to drive double digit-growth of 15%+ in revenues and operating margins, as well as improved risk profiles.

► 70% of “leading” organizations use advanced analytics to overhaul business strategies –changing the nature of competitive differentiation.

► 75% of top performers operate a full range of enterprise, departmental, and line-of-business analytics groups that work within a well-aligned framework.

Hint: If you intend to win at your business, the use of advanced analytics is essential.

Page 5: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 5

But what about IT?Our experience suggests that IT is lagging in digital transformation

► Q&A: Have you taken part in a project using advanced analytics supporting the ‘business’ at their job?

► Q&A: Have you taken part in a project using advanced analytics supporting IT?

► Q&A: Directly involved or not, does your IT department use data to Optimize their Software Development / SDLC practices? Identify common configuration defects in production deployments?

► Our experience and business suggests that larger IT organizations at highly digital businesses are starting to leverage IT data to improve their outcomes.

► The result, the IT Data Lake and associated processes.

Page 6: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 6

The IT Data Lake – Sample Use CasesThere are many common use cases to realize value in the IT Data Lake. Below are several examples that we have found are common.

► Transaction Tracing – root service analysis (dev->prod)

► Audit Logging – Transaction validation

► Performance Analysis – with drill down to components

► Performance Impact – overlay of performance to user

► Data Quality Dashboard & Alerts – rapid identification

► Predictive Model Application – application of the predictive models to the incoming data streams –models often time-series / window based

► Data Quality / Correction – use of models and corrected master / reference data sets to enrich incoming data streams

► Data Storage Optimization – recommended platform

► Data Compute Optimization – recommend platform

► Configuration Graph – build graph of it assets / config

► Blueprint & Deployment Analysis – find key gaps

► Config. Mgmt. Gap Analysis – what’s missing / wrong

► Incident Analysis – root cause analysis

► Audit Archive – long term transaction validation

► Bad Build Identification – find builds with likely defects

► Performance Analysis – models for performance norms

► Data Correction Models – creation of rule and categorizations based models for the identification (and potential correction) of data quality issues

► Internal Threat Analysis – augment pure security view of internal threads / behavior

Raw Data – archived to historical

Events – from predictive and DQ

Models – for prediction and DQ

Master Data – corrected MRD data

Page 7: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 7

IT Data Lake – The Layering of ValueWe have found the value of the IT Data Lake to be both iterative over time and additive based on the type of analysis. Real-time supports historic with a steady flow of data, Historic supports Predictive with master data and improved data quality.

Va

lue

Time (months)

Mining logs for extracting insight in the “now”

Collecting and reporting on historical data

Modeling with logs and contextual data

1

2

3

Trace

6 9 123

Trend

Think

1 2 3► How do I capture the flow of

business information in my vast universe of IT systems?

► What is the source of a given failure?

► What can I learn from analyzing and reporting on a vast amount of data collected over time?

► What types of trends exist in my easily accessible (free) data?

► Why do IT projects fail?

► What makes users stay on my site?

► Where do my employees go when they have a question?

► Developers

► Operational Support

► + IT Management

► L1-L3 Support

► + Business owners

REAL-TIME

HISTORIC

PREDICTIVE

Page 8: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 8

The IT Data Lake – Target Reference ArchitectureThe IT Data Lake contains log data from various systems and master / reference data from core IT systems. It is capable of leveraging IT data to support decisions that optimize IT functions and the business that that IT supports.

Page 9: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 9

Metadata tagging for the IT Data Lake

1 January 2014 Presentation title

Data Sources

RDBMS

Flat files

Data Ingestion

Da

ta Im

po

rt

IT Data Lake

Me

tad

ata

T

ag

gin

g

Native Data +

Tags

(HDFS)

Semantic Layer

Da

ta M

ap

pin

g

Canonical Data Model / Enterprise Data Dictionary

Calculation Engines

(Optional)

End Users / Channels

Data Ingestion and Governance

1 2 3 5 6 7

Sch

em

ale

ss

wri

te

Sch

em

a o

n

rea

d

Consumption Model

Kibana

Hive

Arcadia

4

Developer

Tester

Manager

Business

Data Flow Steps

Data Sources – Systems of origin/Systems of record

Data Ingestion - Schema-less tool-based automatic data import; attaches metadata tags (attribute names, domains, definitions) from a Canonical Data Model or a standardized data dictionary during ingestion

IT Data Lake - Data and metadata tags loaded to Data Lake (schema-less write to HDFS). Data is stored in data tables; tags are stored in metadata stores/tables

Canonical Data Model – A standardized data model or a data dictionary defining data elements used during consumption across various channels

Semantic Layer - Identifies and maps data elements from data lake to output using the metadata tag information

Calculation Engines (Optional) – Any additional calculations or computations needed before data presented for consumption

End Users / Channels – End users or analytics/reporting applications consuming data

1

2

3

5

6

7

4

We use a metadata tagging methodology to implement our clients IT data lakes

Page 10: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 10

IT Data Lake Reference ImplementationElasticsearch, Logstash and Kibana (ELK)

Infrastructure Logs

Application Logs

Business Logs

Monitoring Logs

Security Logs

Distributed Logs

Lo

gst

ash

Users

Developer

Tester

Security

Manager

Legal

Enterprise Logging Analytics Data Products

Transaction Tracing

► Trace transactions across systems

► View transaction details

► Query transactions with text analytics

► View Trends in Transaction

Business Identifiers

► Logging statements can be decorated with identifiers with business meaning

► I.E. Loan Id, Security Id, Bank Id

► Users able to query and aggregate on business identifiers

Configured Reports

► Pre-configured reports can be created

► Per System reports for quick health views

► Purpose specific reports like “Top N Security Concerns”

Log data sent to Logstash collector and ingested in real-time

Data sent to core logging system for near real-time ingestion and query

If implemented, deeper analytics performed in batch

Deeper analytics can be imported back into the core system

A unified query system implemented for time-series oriented searches

A variety of Users can be configured to securely query the system

A variety of data products can be created/viewed by the system

1

3 4

6 7

1

2

3

4

5

6

7

Enterprise Logging Analytics Platform

Random AccessStorage

and Search Index

<Elasticsearch>

Discovery / Search Analytics

<Elasticsearch>

Sequential (File) Access

Storage<Hadoop-HDFS>

Batch Analytics<Hive>

Batch Logging Analytics

Real-Time Logging Analytics Reporting <Kibana>

54

2

Page 11: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 11

The IT Data Lake – Lessons Learned (1 of 2)As EY has helped its clients execute on building out IT Data Lakes, we have learned a number of key lessons that we have turned into leading practices for execution.

► Data Rich – ensure your ecosystem of data is data rich by starting with the near real-time use case and expanding it to a broad set of applications. This supports Historic and Predictive use cases by ensuring that key data sets have already been landed.

► Likewise, a focus on the Historic use cases prior to the Predictive use cases will help ensure that key master data sets have been landed and examined.

► Predictive and other advanced analytic data methods can be difficult. By ensuring that key data is landed and well known, the data wrangling “tax” on each predictive use case is reduced.

► Master Data – the master and reference data for IT data is generally less examined and lower quality than master and reference data for business data sets. An initial focus on landing and addressing quality issues of key data sets (Configuration and Asset Management, Project Data, Identity Data …) is critical and worth the investment.

► Value Focus – use case priority should be largely value focused, especially as the IT Data Lake and surrounding program get started. It is critical that IT demonstrate that it can a) reduce operational costs b) improve operational metrics (uptime, user experience) and reduce long term risk via the analysis of IT data.

► Use Case Portfolio – build, communicate, and validate a multi-year portfolio of use cases and work streams that demonstrate value that justifies the overall IT Data Lake effort (we suggest at least a 10:1 ratio of value to potential expense as likely return will be lower since some use cases will under perform and some potential value will remain unrealized across all use cases)

Page 12: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 12

The IT Data Lake – Lessons Learned (2 of 2)As EY has helped its clients execute on building out IT Data Lakes, we have learned a number of key lessons that we have turned into leading practices for execution.

► All SDLC Environments – many efforts start by focusing only on production logs / data. We have found this to be a mistake. In the near real-time use case, allowing developers to quickly identify root cause / component issues delivers immediate value. For SDLC related use cases for Historic and Predictive analysis, the data showing trends and root cause of issues is found during the development and testing cycles.

► Invest in Graph – create a graph of IT dependencies and ensure that it can be easily joined to the transaction level logs and generated predictive and alert events. First, nobody in your IT organization has the big picture in how everything fits together. A visually explore-able model provides valuable insights on its own. Second, the graph is critical for allowing Historic and Predictive analysis to understand how components are / are not connected.

► Invest in Data Model – we have found two critical ways the data model for the IT Data Lake should be established. The first is data model for the core IT Master and Reference Data (MRD) with connectivity to the underlying transactional data (logs). Key domains for this data model include Configuration and Asset Management, Software Development Lifecycle, Policy (and security policy). The second data model is for the connection between IT data and key business process steps and identifiers. For example, for a manufacturer they might include Vendor Id, Material Codes, User Id, Plant … These will be used to build metrics and views that show the business impact of IT activities.

► Splunk Zen – Splunk is a pervasive tool for log analysis and it is incumbent at many companies. Splunk is a sophisticated tool with many compelling features. That said, the Splunk business model for charging based on the rate of ingestion makes it a difficult sell for broad use in the IT Data Lake. For near real-time, it will be unlikely that any enterprise will want to target Splunk for all logs from development to production. The IT Data Lake will not replace the current IT and security functions of Splunk. If Splunk is present, it should be integrated with the IT Data Lake as a data supplier and consumer.

Page 13: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 13

Disclaimer

► EY refers to the global organization, and may refer to one or more, of the member firms of Ernst & Young Global Limited, each of which is a separate legal entity. Ernst & Young LLP is a client-serving member firm of Ernst & Young Global Limited operating in the US.

► Views expressed in this presentation are those of the speakers and do not necessarily represent the views of Ernst & Young LLP.

► This presentation is provided solely for the purpose of enhancing knowledge on technology matters.

► These slides are for educational purposes only and are not intended, and should not be relied upon, as technical advice.

Page 14: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Appendix

Page 15: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Addressing the OpportunityFocus on Data and Analytics

Competing on Analytics: Updated with a New Introduction: The New Science of Winning

Characteristics

− Structured data, not necessarily big

− Hypothesis driven with sample data sets

− Cleaner data = better analysis & results

− 3rd party demographic, psychographic data

− Getting this right = operational / customer table stakes

− Big data and data driven hypothesis

− Lots of data doesn’t necessarily mean it’s useful

− Machine learning / can be compute intensive

− 3rd party social and open data

− Getting this right = differentiation and growth

Optimization“What’s the best that can happen?”

Machine Learning“What can we learnfrom the data?”

Experimental design“”What happens if we try this?”

Predictive modeling“”What will happen next?”

Forecasting/extrapolation“”What if these trends continue?”

Statistical analysis“Why is this happening?”

Alerts“What actions are needed?”

Query/drill down“What exactly is the problem?”

Ad hoc reports“How many, how often where?”

Standard reports“What happened?”

Co

mp

etit

ive

ad

van

tage

Sophistication of intelligence

Autonomous

Analytics

Prescriptive

Analytics

Predictive

Analytics

Descriptive

Analytics

Page 16: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 16

IT data quality frameworkOur IT data quality framework, in combination with the IT data management operating framework will be leveraged to identify improvement opportunities for data being collected in the IT data lake as well as provide a repeatable process for ongoing monitoring for key issues such as root cause analysis, tracking and remediation

Governance Policies

1Data Lineage,

Metadata, Data Quality assessment

2Data Violation

Repository,Issue Management

3Ongoing Data

Reconciliation and acquisition monitoring

► Help build data reconciliation dash boards

► Assist with integrating reconciliation reporting

► Help provide increased transparency in sourcing and usage of data, from source systems/golden sources to data lake

► Help define and document associated metadata and reference data

► Document lineage and data flows for each CI class

► Establish functional and data model, build DQ Violation Repository and define DQ rules

► Assist in prioritizing issues, performing root cause analysis, facilitating remediation

Output: Confidence Ratings

► Inclusion of all Technology Assets

► Accuracy of Critical Data Attributes

► Timeliness & Usability of Asset Data

Business Objective: To measure and improve

quality of IT data which is of strategic importance

for operational efficiency and IT management purposes

Functional Knowledge

IT Inventory DataAttributes

Reference architecturefor IT Inventory

IT Risk Requirements &Controls

Page 17: HPSKDVL]HG WKDW WKH - · PDF file3djh 2q 'dwd dqg 'ljlwdo,w lv eurdgo\ dffhswhg wkdw gdwd lv fulwlfdo wr gljlwdo wudqvirupdwlrq 'dwd 'dwd 'dwd , fdq¶w pdnh eulfnv zlwkrxw fod\ 6lu

Page 17

IT Data Management is a cross-functional discipline that facilitates the effective and efficient management, control and protection of IT assets (both hardware and software) across the organization, at all stages of their lifecycle. This is accomplished through process, governance, organizational management, process integration, data governance and supporting technology to drive operational and strategic decision making.

IT Data Lifecycle

Producers of ITdata

IT asset management

Procurement

Infrastructure management

Software engineering

Public and private cloud

Managed services

Consumers of ITdata

Risk management

Information security

Configuration management

IT financial management

Compliance

Change management

Contract management

Capacity management

ISO 19770 -1 and 2 COBIT v5 FRB and FFIEC guidance ISO 27002 ITIL v3 SANS critical controls

IT data management lifecycleOur deep understanding of the IT data lifecycle will enable us to propose a robust IT data management strategy, data governance plan and data model forming the basis for the IT data lake initiative