The concept of Datalake with Hadoop

14
MICROSOFT CONFIDENTIAL – INTERNA The concept of Data Lake: Data processing in Hadoop [email protected] @avkashchauha

description

How to build Data Lake using Hadoop and data processing in Hadoop

Transcript of The concept of Datalake with Hadoop

Page 1: The concept of Datalake with Hadoop

The concept of Data Lake: Data processing in Hadoop

[email protected]@avkashchauhan

Page 2: The concept of Datalake with Hadoop

Agenda this hour:

Data lake Concept through live examplesData Processing in Hadoop

Page 3: The concept of Datalake with Hadoop

Data lake or data graveyard……

Page 4: The concept of Datalake with Hadoop

Data Ware Houses and Databases

ETL

Database

Page 5: The concept of Datalake with Hadoop

Concerns with existing approach

Significant time investment with the data prototypingCost of data storage and repetitive processingTime to get value out of data varies on several factorsSeveral Point Solutions to process data differentlyReal users do not have access to data immediately and when they do it is subset of the real dataKnowledge Gap between those who wants to use data and those who actually process the dataOne team (IT/CO)managing every ones need cause significant bottlenecks

Page 6: The concept of Datalake with Hadoop

Concept of Data Lake with HDFS

Data Map

Data Map

Data Map

Page 7: The concept of Datalake with Hadoop

Technical Definition of Data Lake

Store all the information without any modificationData is stored without consideration of its typeSeparate team develops their own point solutions and manage their own solutions.Security and governance is applied at data access or other critical points onlyIT or Cooperate views the data same as any other teamUnstructured information is still informationDepending on demands, scaling up or down is possibleData redundancy keeps guaranteed data availability

Page 8: The concept of Datalake with Hadoop

Hadoop is the Answer……

Page 9: The concept of Datalake with Hadoop

How Hadoop is the answer?

Hadoop has HDFS as Data processing layer to store unstructured data HDFS provides fast and reliable data access to applicationsApplications designed to support LOB can process data in disk or in-memory which accessing full volume of data in real-time

Page 10: The concept of Datalake with Hadoop

HDFS: Quick Intro….

Page 11: The concept of Datalake with Hadoop

Data processing in Hadoop

Unstructured data processing through MapReduce Programming ParadigmData is stored as regular file within supported file format If a file format is not supported yet, it can be made supported programmaticallyData is accessed through any kind of program Once data is read, program can choose any way to process the data within MapReduce frameworkResult data is stored back into HDFS for later usage

Page 12: The concept of Datalake with Hadoop

What is HDFS?

HDFS is a fault tolerant and self-healing distributed file system designed to turn a cluster of industry standard servers into a massively scalable pool of storage.Developed specifically for large-scale data processing workloads where scalability, flexibility and throughput are critical.HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming, and scales to proven deployments of 100PB and beyond.

Page 13: The concept of Datalake with Hadoop

Conclusion:

Concept of Data lake is designed to use Hadoop as one single point of data storage system available for instant data processing by individual team at any given timeIt is designed to:

Reduce engineering overheadProvider faster access of data to real usersRemove repetitive processing

Page 14: The concept of Datalake with Hadoop

Thanks….