A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their...

Post on 29-Mar-2020

2 views 0 download

Transcript of A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their...

#TDPARTNERS16 GEORGIA WORLD CONGRESS CENTER

A Data Lake is more than Hadoop.

Hadoop is more

than a Data Lake

Dan Graham

Teradata Director Technical Marketing

What’s the Big Idea? Big idea #1

“store all data” (whatever “all” means)

Big idea #2 “un-washed, raw data” (NoETL / late-binding)

Big idea #3 “resolve the nagging problem of

accessibility and data integration”

DTG

Big idea #4 Data access/integration

Isn’t that in the data warehouse?

What is a Data Lake?

A data lake is a collection of long term data containers that capture, refine, and explore any form of raw data at scale, enabled by low cost technologies, from which multiple downstream facilities may draw upon.

Data sources Downstream

Sensors email

Transactions Machine logs

Geolocation Media

BI Tools IDW

Data Marts Analysis

Apps Other Data Lake Data Lake

DTG

Data Warehouse Design Pattern Data Lake Design Pattern

Data Lake is a Design Pattern

• Scalability at low cost

• Original raw data fidelity

• Refine data for exploration

• Loosely coupled, late binding

• Serves downstream systems

• Long term storage

Subject oriented

Data model of the business

Integrated

Consolidated

Consistent data formats

Nonvolatile persisted data

Time variant

High concurrency levels

DTG

Data Lake Design Pattern Data Lake Technologies

S3

1800

Design Patterns vis-à-vis Technologies DTG

• Scalability at low cost

• Original raw data fidelity

• Refine data for exploration

• Loosely coupled, late binding

• Serves downstream systems

• Long term storage

Who is this Guy? What’s he Doing?

Data treatments

Capture, refine, explore

original raw data and metadata

DTG

Data scientists

Programmers

Business users

Batch jobs

Multiple Data Lakes DTG

Sensor data capture, refining

New product design

Market pricing

Hadoop is more than a data lake. A data lake is more than Hadoop.

DTG

What the Data Lake is Not

• Not a single central repository for all data • Unless you rebuild half the data center

• 100s of reasons data bypasses the lake

• Not only system feeding the data warehouse

• Data goes direct or through ETL servers

• Not an archive • Policies, audits, immutability, extreme security, expirations

• Not dashboards and data marts

ETL analysis

data lake

DTG

Data Manufacturing

DATA R&D

DATA LAKE DATA PRODUCTS

DTG

Data Manufacturing & Hadoop Cluster

DATA R&D

DATA LAKE DATA PRODUCTS

DTG

Data Integration Just Say No to your Inner DBA (and some users)

Levels of data trust Data integration

Certified 100%

Trustworthy 80%

Proven 60%

Experimental 40%

Raw/high risk 20% Low

High

Inve

stm

en

t

DTG

Use Cases

Data Integration Optimization

Reference data look-ups Joins for derived data Lots of derived data

Service-level goals to meet

High velocity data Unstructured data

Low value data Cost savings ROI

DTG

Dark Data Insights

• Dark data, data exhaust deleted

• New unstructured data,

• Expensive, no ROI, unknown value

• Low user demand

• Dark data often contains insights

• Data lake costs are much lower

• Explore, research, discover

• Promote some to production

sensors

email

weblogs

logins

tweets

GPS

Production

mobile

DTG

Complex/ Iterative Processing

• Extensive CPU usage • Iterative processing

• non sequential loops & branches

• Complex algorithms • Video content analysis

• Photo analysis

• Text analysis

• Random forests

• Monte Carlo methods

• Scientific research • Weather simulation

• Electromagnetic modeling

• Physics, DNA, etc.

Complex processing

Set processing

DTG

Managing Shadow IT

• To get their job done, users abscond with data daily

• Bypass IT, governance, and security

• Data-mart-under-my-desk

• Dispensing data reliably • HELP users get needed data

• Improve data quality

• Get some control versus none

• Add some governance, security, audit

DTG

Data Lake

Offloading the Coldest Data

• Offload coldest rows • Free up IDW storage

• Temperature = usage • Date stamp often irrelevant

• Archive, compliance

• Accessible with QueryGrid

Hot/warm data

Coldest data

ETL

QueryGrid move

DTG

Single Subject Data Analysis

• Analytics • Query and reporting

• Data mining

• Dashboards

• Single subject star schema • 1-2 raw data fact tables

• Structured + unstructured data

• Non cleansed data

• Non integrated data

• Dimension tables

#Version: 1.0 #GMT-Offset: -0800 #Software: MyCorpTopaz Web Cache 2.0.0.2.0; #Start-Date: 2015-06-21 00:00:18 #Fields: c-ip c-dns c-auth-id date time cs-method cs-uri sc-status-ctrl bytes cs(Cookie) cs(Referrer) time-taken cs(User-Agent) #date: 2015-07-31; ”buyer”=“Willcox”; order”=“lingerie”; DMS.user; GET /images/bottom.gif 200A17x 350 "BIGipServer_webcache”=“217”; ORA_UCM_AGID=%2fMP%2f8M7%3etSHPV%40%2fS%3f%3fDh3V“; "http://www.myDBl.com/nl.html" 37087 "Mozilla/4.5 [en] (WinNT;)"

Raw data files

store

address

date

type

DTG

Big Pictures

Data Lake Architecture

Math

and Stats

Data

Mining

Business

Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

Marketing

Executives

Operational

Systems

Frontline

Workers

Customers

Partners

Engineers

Data

Scientists

Business

Analysts

Access Preparation Acquisition

Search

Profiling

Tagging

Analytics

Cleansing

Validation

Aggregation

Materialization

Ingest

Conversion

Encryption

Security, Metadata/Lineage, Administration

Distributed Storage

SOURCES

Sensors

email

Social

Telemetry

Mobile

Tabular Data

Machine logs

DTG

Access Preparation Acquisition

Data Lake Architecture

Math

and Stats

Data

Mining

Business

Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

Marketing

Executives

Operational

Systems

Frontline

Workers

Customers

Partners

Engineers

Data

Scientists

Business

Analysts

Streams Search Aggregations

Security, Metadata/Lineage, Administration

Distributed Storage

Msg. queues Cleansing Access

Experiments Governance Files

SOURCES

Sensors

email

Social

Telemetry

Mobile

Tabular Data

Machine logs

DTG

Access Preparation Acquisition

Hadoop Data Lake Technologies

Math

and Stats

Data

Mining

Business

Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

Marketing

Executives

Operational

Systems

Frontline

Workers

Customers

Partners

Engineers

Data

Scientists

Business

Analysts

YARN, Ambari, Navigator, HCatalog, Sentry

HDFS, S3 Raw data, derived views

SOURCES

Sensors

email

Social

Telemetry

Mobile

Tabular Data

Machine logs

DTG

Data Lake: Teradata 1800

Math

and Stats

Data

Mining

Business

Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

Marketing

Executives

Operational

Systems

Frontline

Workers

Customers

Partners

Engineers

Data

Scientists

Business

Analysts

Access Preparation Acquisition

Teradata Parallel Data Environment

SOURCES

Sensors

email

Social

Telemetry

Mobile

Tabular Data

Machine logs

DTG

Data Lab Studio

QueryGrid

SAS mining Fuzzy Logix

SPSS Revolution R

Informatica DataStage Oracle DI

SAS DI Studio Ab Initio

Microsoft

TPT Data-mover

Listener REST APIs Attunity

Informatica, IBM Data Stage, Oracle Data Integrator, Talend

Viewpoint, Ecosystem Manager, Unity

Data Lake Definition Summary

• The data lake is a design pattern • Requires and uses many technologies

• The data lake is more than Hadoop • Amazon S3, Cassandra, Teradata

• Other tools and technologies

• Hadoop is more than a data lake

• The data lake manages raw data • Refined in downstream processes Downstream

consumers

Data

sources

DTG

Thank You

Questions/Comments

Email:

Follow Me

Twitter @

Rate This Session #

with the PARTNERS Mobile App

Remember To Share Your Virtual Passes

Daniel.Graham@Teradata.Com

DanGraham_

417 -- rate it a 5 please

26

27

Data Lake Platforms

Data lake definition Hadoop Amazon

EMR Cassandra Teradata

1800

Long term data containers X X X X

Capture, refine, and explore X X X X

Raw data at scale X X X X

Low cost technologies X X X X

Feeds downstream uses X X X X

Options

Schema-on-read X X X JSON, NVPs

File system HDFS S3 CFS RDBMS

Search engines Solr Solr

SQL, Java, Python, Ruby, scripts X X X X

Data Integration on demand

Data value assumed

Typically schema-on-read

Data integration up front

Data value manufactured

Typically schema-on-write

Value Creation via Data Integration

DATA LAKE

SCM

CRM

ERP INTEGRATED

DATA WAREHOUSE

DTG

Access Preparation Acquisition

HDFS

Teradata’s Hadoop Data Lake Products

Math

and Stats

Data

Mining

Business

Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

Marketing

Executives

Operational

Systems

Frontline

Workers

Customers

Partners

Engineers

Data

Scientists

Business

Analysts

Listener App Center

SOURCES

Sensors

email

Social

Telemetry

Mobile

Tabular Data

Machine logs

DTG

Viewpoint