The concept of Datalake with Hadoop

The concept of Data Lake: Data processing in Hadoop

[email protected]@avkashchauhan

mailto:[email protected]

Agenda this hour:

Data lake Concept through live examplesData Processing in Hadoop

Data lake or data graveyard……

Data Ware Houses and Databases

ETL

Database

Concerns with existing approach

Significant time investment with the data prototypingCost of data storage and repetitive processingTime to get value out of data varies on several factorsSeveral Point Solutions to process data differentlyReal users do not have access to data immediately and when they do it is subset of the real dataKnowledge Gap between those who wants to use data and those who actually process the dataOne team (IT/CO)managing every ones need cause significant bottlenecks

Concept of Data Lake with HDFS

Data Map

Data Map

Data Map

Technical Definition of Data Lake

Store all the information without any modificationData is stored without consideration of its typeSeparate team develops their own point solutions and manage their own solutions.Security and governance is applied at data access or other critical points onlyIT or Cooperate views the data same as any other teamUnstructured information is still informationDepending on demands, scaling up or down is possibleData redundancy keeps guaranteed data availability

Hadoop is the Answer……

How Hadoop is the answer?

Hadoop has HDFS as Data processing layer to store unstructured data HDFS provides fast and reliable data access to applicationsApplications designed to support LOB can process data in disk or in-memory which accessing full volume of data in real-time

HDFS: Quick Intro….

Data processing in Hadoop

Unstructured data processing through MapReduce Programming ParadigmData is stored as regular file within supported file format If a file format is not supported yet, it can be made supported programmaticallyData is accessed through any kind of program Once data is read, program can choose any way to process the data within MapReduce frameworkResult data is stored back into HDFS for later usage

What is HDFS?

HDFS is a fault tolerant and self-healing distributed file system designed to turn a cluster of industry standard servers into a massively scalable pool of storage.Developed specifically for large-scale data processing workloads where scalability, flexibility and throughput are critical.HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming, and scales to proven deployments of 100PB and beyond.

Conclusion:

Concept of Data lake is designed to use Hadoop as one single point of data storage system available for instant data processing by individual team at any given timeIt is designed to:

Reduce engineering overheadProvider faster access of data to real usersRemove repetitive processing

Thanks….

The concept of Datalake with Hadoop

Technology

Transcript of The concept of Datalake with Hadoop