One Large Data Lake, Hold the Hype
-
Upload
koverse-inc -
Category
Data & Analytics
-
view
76 -
download
2
Transcript of One Large Data Lake, Hold the Hype
![Page 1: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/1.jpg)
One Large Data Lake, Hold the Hype
Rocky Mountain DataCon 2016
Jared WinickSenior Data/Solutions Engineer, Koverse
![Page 2: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/2.jpg)
2
Outline
• Issues with the usage of “Data Lake”• Defining Key Characteristics• A Data Lake Implementation Example• Discussion
![Page 3: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/3.jpg)
3
![Page 4: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/4.jpg)
4
Just because “Data Lake” isoverusedmisusedabused
doesn’t mean the concept is wrong
![Page 5: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/5.jpg)
5
The Concept of a Data Lake
We all can agree that a Data Lake is a centralized (at least logically) repository for all forms of data within an organization.
![Page 6: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/6.jpg)
6https://www.wired.com/2013/04/desktop-cluttered-help/
![Page 7: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/7.jpg)
7
The Concept of a Data Lake
…but there must be more to it than putting all your data in HDFS or S3.
![Page 8: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/8.jpg)
8
Defining the Key Characteristics
1. Indexing and search across all data2. Interactive access for all users in the enterprise3. Multi-level access control4. Integration with data science tools5. Abstractions
A Data Lake has a platform-application duality
![Page 9: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/9.jpg)
9
Indexing / Search Across All Data
• A Data Lake is often an entry point for data• It may lack structure or “correctness”• Search enables you to validate and explore your data
![Page 10: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/10.jpg)
10
Indexing / Search Across All Data
A18923 Search
{ employeeId: “A18923”, email: “[email protected]”, firstName: “Jared”, …}
Employees Data Set{ id: “a18923”, eventType: “login”, time: 1478557775010 …}
Network Events Data Set
Find data across data sets. Understand its format and structure.
![Page 11: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/11.jpg)
11
Interactive Access for Everyone
• A Data Lake is strategic and should serve many different types of users.
• Should have self-service features.• Adds up to needing to support interactive, multi-user load.
![Page 12: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/12.jpg)
12
Multi-Level Access Control• Every organization has data access control requirements these
days.• Different level of granularity for different environments/use
cases.– Data Set Level– Column/Field Level– Row/Record Level
• Far easier to engineer up front than add on later.
![Page 13: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/13.jpg)
13
Name DeptId DOB email
Multi-Level Access Control
Data Set Column
Row
![Page 14: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/14.jpg)
14
Integration With Data Science Tools
• The ultimate point of a Data Lake is to “monetize” data– For a corporation this is making or saving money– For a government this is better serving your citizens– For a research organization this is solving new problems/answering
previously unknown questions• Need to be able to analyze and transform data sets into new
data sets• From BI queries to text analytics to machine learning.
![Page 15: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/15.jpg)
15
Integration With Data Science Tools
A Data Lake needs to support multiple internal analytic “customers” within the organization.• SQL / BI tools for Data Analysts• Spark for Data Engineers• Notebooks and ML libraries for Data Scientists
![Page 16: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/16.jpg)
16
Abstractions• Provide a level of abstraction over your data
– Data Sets / Collections– Records / Rows– Transformations
• Enables a consistent API for interacting with any data regardless of its shape, size, and content– Reusability– Increased development speed
![Page 17: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/17.jpg)
17
The Koverse Data Lake
![Page 18: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/18.jpg)
18
Architecture – High Level
HDFS Zookeeper
Accumulo
Spark
Koverse
A distributed key/value store like Apache Accumulo enables storage of very large volumes of data while maintaining low latency access.
![Page 19: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/19.jpg)
19
Architecture – Distributed Key/Value Store Benefits
These benefits apply to Apache Accumulo, but also likely to Apache HBase, Cassandra and other similar systems
1. Easily scale to trillions of key/values2. Distributed storage
1. Parallel processing in Hadoop MapReduce or Spark2. Fault tolerance
3. Millisecond read latencies with efficient scanning of ranges4. Fine grained access control features
![Page 20: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/20.jpg)
20
Architecture - Details
Accumulo
Record Table
IndexTable
Statistics/Aggregations
Table
Koverse
Low latency R/W
Spark
Efficient Range Scans
Apps
Users/Apps use REST
![Page 21: One Large Data Lake, Hold the Hype](https://reader031.fdocuments.net/reader031/viewer/2022030307/58e5925f1a28abdd148b5653/html5/thumbnails/21.jpg)
21
Discussion