Why use Hadoop?, Challenges / Learning Hadoop & Average Salary of Hadoop Professional
A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their...
Transcript of A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their...
![Page 1: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/1.jpg)
#TDPARTNERS16 GEORGIA WORLD CONGRESS CENTER
A Data Lake is more than Hadoop.
Hadoop is more
than a Data Lake
Dan Graham
Teradata Director Technical Marketing
![Page 2: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/2.jpg)
What’s the Big Idea? Big idea #1
“store all data” (whatever “all” means)
Big idea #2 “un-washed, raw data” (NoETL / late-binding)
Big idea #3 “resolve the nagging problem of
accessibility and data integration”
DTG
Big idea #4 Data access/integration
Isn’t that in the data warehouse?
![Page 3: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/3.jpg)
What is a Data Lake?
A data lake is a collection of long term data containers that capture, refine, and explore any form of raw data at scale, enabled by low cost technologies, from which multiple downstream facilities may draw upon.
Data sources Downstream
Sensors email
Transactions Machine logs
Geolocation Media
BI Tools IDW
Data Marts Analysis
Apps Other Data Lake Data Lake
DTG
![Page 4: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/4.jpg)
Data Warehouse Design Pattern Data Lake Design Pattern
Data Lake is a Design Pattern
• Scalability at low cost
• Original raw data fidelity
• Refine data for exploration
• Loosely coupled, late binding
• Serves downstream systems
• Long term storage
Subject oriented
Data model of the business
Integrated
Consolidated
Consistent data formats
Nonvolatile persisted data
Time variant
High concurrency levels
DTG
![Page 5: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/5.jpg)
Data Lake Design Pattern Data Lake Technologies
S3
1800
Design Patterns vis-à-vis Technologies DTG
• Scalability at low cost
• Original raw data fidelity
• Refine data for exploration
• Loosely coupled, late binding
• Serves downstream systems
• Long term storage
![Page 6: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/6.jpg)
Who is this Guy? What’s he Doing?
Data treatments
Capture, refine, explore
original raw data and metadata
DTG
Data scientists
Programmers
Business users
Batch jobs
![Page 7: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/7.jpg)
Multiple Data Lakes DTG
Sensor data capture, refining
New product design
Market pricing
![Page 8: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/8.jpg)
Hadoop is more than a data lake. A data lake is more than Hadoop.
DTG
![Page 9: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/9.jpg)
What the Data Lake is Not
• Not a single central repository for all data • Unless you rebuild half the data center
• 100s of reasons data bypasses the lake
• Not only system feeding the data warehouse
• Data goes direct or through ETL servers
• Not an archive • Policies, audits, immutability, extreme security, expirations
• Not dashboards and data marts
ETL analysis
data lake
DTG
![Page 10: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/10.jpg)
Data Manufacturing
DATA R&D
DATA LAKE DATA PRODUCTS
DTG
![Page 11: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/11.jpg)
Data Manufacturing & Hadoop Cluster
DATA R&D
DATA LAKE DATA PRODUCTS
DTG
![Page 12: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/12.jpg)
Data Integration Just Say No to your Inner DBA (and some users)
Levels of data trust Data integration
Certified 100%
Trustworthy 80%
Proven 60%
Experimental 40%
Raw/high risk 20% Low
High
Inve
stm
en
t
DTG
![Page 13: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/13.jpg)
Use Cases
![Page 14: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/14.jpg)
Data Integration Optimization
Reference data look-ups Joins for derived data Lots of derived data
Service-level goals to meet
High velocity data Unstructured data
Low value data Cost savings ROI
DTG
![Page 15: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/15.jpg)
Dark Data Insights
• Dark data, data exhaust deleted
• New unstructured data,
• Expensive, no ROI, unknown value
• Low user demand
• Dark data often contains insights
• Data lake costs are much lower
• Explore, research, discover
• Promote some to production
sensors
weblogs
logins
tweets
GPS
Production
mobile
DTG
![Page 16: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/16.jpg)
Complex/ Iterative Processing
• Extensive CPU usage • Iterative processing
• non sequential loops & branches
• Complex algorithms • Video content analysis
• Photo analysis
• Text analysis
• Random forests
• Monte Carlo methods
• Scientific research • Weather simulation
• Electromagnetic modeling
• Physics, DNA, etc.
Complex processing
Set processing
DTG
![Page 17: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/17.jpg)
Managing Shadow IT
• To get their job done, users abscond with data daily
• Bypass IT, governance, and security
• Data-mart-under-my-desk
• Dispensing data reliably • HELP users get needed data
• Improve data quality
• Get some control versus none
• Add some governance, security, audit
DTG
Data Lake
![Page 18: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/18.jpg)
Offloading the Coldest Data
• Offload coldest rows • Free up IDW storage
• Temperature = usage • Date stamp often irrelevant
• Archive, compliance
• Accessible with QueryGrid
Hot/warm data
Coldest data
ETL
QueryGrid move
DTG
![Page 19: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/19.jpg)
Single Subject Data Analysis
• Analytics • Query and reporting
• Data mining
• Dashboards
• Single subject star schema • 1-2 raw data fact tables
• Structured + unstructured data
• Non cleansed data
• Non integrated data
• Dimension tables
#Version: 1.0 #GMT-Offset: -0800 #Software: MyCorpTopaz Web Cache 2.0.0.2.0; #Start-Date: 2015-06-21 00:00:18 #Fields: c-ip c-dns c-auth-id date time cs-method cs-uri sc-status-ctrl bytes cs(Cookie) cs(Referrer) time-taken cs(User-Agent) #date: 2015-07-31; ”buyer”=“Willcox”; order”=“lingerie”; DMS.user; GET /images/bottom.gif 200A17x 350 "BIGipServer_webcache”=“217”; ORA_UCM_AGID=%2fMP%2f8M7%3etSHPV%40%2fS%3f%3fDh3V“; "http://www.myDBl.com/nl.html" 37087 "Mozilla/4.5 [en] (WinNT;)"
Raw data files
store
address
date
type
DTG
![Page 20: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/20.jpg)
Big Pictures
![Page 21: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/21.jpg)
Data Lake Architecture
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS & APPS
USERS
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
Access Preparation Acquisition
Search
Profiling
Tagging
Analytics
Cleansing
Validation
Aggregation
Materialization
Ingest
Conversion
Encryption
Security, Metadata/Lineage, Administration
Distributed Storage
SOURCES
Sensors
Social
Telemetry
Mobile
Tabular Data
Machine logs
DTG
![Page 22: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/22.jpg)
Access Preparation Acquisition
Data Lake Architecture
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS & APPS
USERS
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
Streams Search Aggregations
Security, Metadata/Lineage, Administration
Distributed Storage
Msg. queues Cleansing Access
Experiments Governance Files
SOURCES
Sensors
Social
Telemetry
Mobile
Tabular Data
Machine logs
DTG
![Page 23: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/23.jpg)
Access Preparation Acquisition
Hadoop Data Lake Technologies
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS & APPS
USERS
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
YARN, Ambari, Navigator, HCatalog, Sentry
HDFS, S3 Raw data, derived views
SOURCES
Sensors
Social
Telemetry
Mobile
Tabular Data
Machine logs
DTG
![Page 24: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/24.jpg)
Data Lake: Teradata 1800
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS & APPS
USERS
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
Access Preparation Acquisition
Teradata Parallel Data Environment
SOURCES
Sensors
Social
Telemetry
Mobile
Tabular Data
Machine logs
DTG
Data Lab Studio
QueryGrid
SAS mining Fuzzy Logix
SPSS Revolution R
Informatica DataStage Oracle DI
SAS DI Studio Ab Initio
Microsoft
TPT Data-mover
Listener REST APIs Attunity
Informatica, IBM Data Stage, Oracle Data Integrator, Talend
Viewpoint, Ecosystem Manager, Unity
![Page 25: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/25.jpg)
Data Lake Definition Summary
• The data lake is a design pattern • Requires and uses many technologies
• The data lake is more than Hadoop • Amazon S3, Cassandra, Teradata
• Other tools and technologies
• Hadoop is more than a data lake
• The data lake manages raw data • Refined in downstream processes Downstream
consumers
Data
sources
DTG
![Page 26: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/26.jpg)
Thank You
Questions/Comments
Email:
Follow Me
Twitter @
Rate This Session #
with the PARTNERS Mobile App
Remember To Share Your Virtual Passes
DanGraham_
417 -- rate it a 5 please
26
![Page 27: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/27.jpg)
27
![Page 28: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/28.jpg)
Data Lake Platforms
Data lake definition Hadoop Amazon
EMR Cassandra Teradata
1800
Long term data containers X X X X
Capture, refine, and explore X X X X
Raw data at scale X X X X
Low cost technologies X X X X
Feeds downstream uses X X X X
Options
Schema-on-read X X X JSON, NVPs
File system HDFS S3 CFS RDBMS
Search engines Solr Solr
SQL, Java, Python, Ruby, scripts X X X X
![Page 29: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/29.jpg)
Data Integration on demand
Data value assumed
Typically schema-on-read
Data integration up front
Data value manufactured
Typically schema-on-write
Value Creation via Data Integration
DATA LAKE
SCM
CRM
ERP INTEGRATED
DATA WAREHOUSE
DTG
![Page 30: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security](https://reader034.fdocuments.net/reader034/viewer/2022042111/5e8bdf99ee36a2275513568a/html5/thumbnails/30.jpg)
Access Preparation Acquisition
HDFS
Teradata’s Hadoop Data Lake Products
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS & APPS
USERS
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts
Listener App Center
SOURCES
Sensors
Social
Telemetry
Mobile
Tabular Data
Machine logs
DTG
Viewpoint