Marjanovic-Dusanic, Smilja 1997 - Vladarska Ideologija Nemanjica
How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic
-
Upload
institute-of-contemporary-sciences -
Category
Data & Analytics
-
view
30 -
download
2
Transcript of How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic
![Page 1: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/1.jpg)
Darko Marjanović, CEO @ Things [email protected]
How to use Big Data and Data Lake concept in business using Hadoop and Spark
![Page 2: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/2.jpg)
About me• CEO and Co Founder @ Things Solver• Co Founder @ Data Science Serbia
• Big Data, Machine Learning• Hadoop, Spark, Python
![Page 3: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/3.jpg)
Agenda• Big Data• Data Lake• Data Lake vs Data Warehouse• Hadoop, Spark, Hive• Big Data application and Lambda architecture• Examples• Data Science Lab
![Page 4: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/4.jpg)
Big Data• Big data is a term for data sets that are so large or complex that
traditional data processing applications are inadequate.
• Anything that Won't Fit in Excel :)
![Page 5: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/5.jpg)
Big DataVolumeThe quantity of generated and stored data.VarietyThe type and nature of the data.VelocityIn this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.VeracityThe quality of captured data can vary greatly, affecting accurate analysis.
![Page 6: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/6.jpg)
Big Data
• Email, HTML, Click Stream...• Facebook, Twitter...• Video, Pictures… • Logs...• Sensor Data...• Relational Databases...
![Page 7: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/7.jpg)
Big Data
![Page 8: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/8.jpg)
Data Lake“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”
Data Lake - James Dixon, Pentaho chief technology officer
![Page 9: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/9.jpg)
Data Lake• Retain All Data• Support All Data types• Support All Users• Adapt Easily to
Changes• Provide Faster Insights
![Page 10: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/10.jpg)
Data Lake
![Page 11: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/11.jpg)
Data Lake Cons• Data storage alone has no impact on the effectiveness of business
decisions• Inexpensive storage is not infinite or limitless
![Page 12: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/12.jpg)
Data WarehouseWikipedia, defines Data Warehouses as:“…central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.”
![Page 13: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/13.jpg)
Data WarehouseProblems:
• New Data Sources, Data Types• Real Time Reports• Streaming Data• Software Price• Infrastructure Price
![Page 14: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/14.jpg)
Data Lake vs Data Warehouse
![Page 15: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/15.jpg)
Data Lake vs Data Warehouse• ETL
• ETL and BI projects by nature are investments into evolving processes and therefore have no distinct end point and is an ongoing, improving and re-targeting project process.
• ETL works from the output backwards and hence on relevant data is extracted and processed.• Future ETL requirements needing data cannot be foreseen and defined in the original design.
• ELT• Isolating Loading and Transforming enables projects to be broken down into specific chunks that are more
isolated and become more manageable.• ELT is an emergent approach to data warehouse design and development requiring a change in mentality
and design approach compared to traditional ETL. • Future requirements can easily be incorporated into the warehouse structure as all data is pulled into the
Data Lake in its raw format.
![Page 16: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/16.jpg)
Hadoop• The Apache Hadoop software library is a framework that allows the
distributed processing of large data sets across clusters of computers using simple programming models.
![Page 17: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/17.jpg)
Hadoop• Pros• Linear scalability.• Commodity hardware.• Pricing and licensing. • All data types.• Analytical queries.• Integration with traditional systems.
• Cons• Implementation.• Map Reduce ease of use.• Intense calculations with little data.• In memory.• Real time analytics.
![Page 18: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/18.jpg)
Apache Spark• Apache Spark is a fast and general engine for big data processing,
with built-in modules for streaming, SQL, machine learning and graph processing.
![Page 19: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/19.jpg)
Apache Spark• Pros• 100X faster than Map Reduce.• Ease of use.• Streaming, Mllib, Graph and SQL.• Pricing and licensing.• In memory. • Integration with Hadoop.• Machine learning.
• Cons• Integration with traditional
systems.• Limited memory per machine(GC).• Configuration.
![Page 20: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/20.jpg)
Apache Spark
![Page 21: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/21.jpg)
Apache Spark• Resilient Distributed Datasets
(RDDs) are the basic units of abstraction in Spark.• RDD is an immutable, partitioned
set of objects.• RDDs are lazy evaluated.• RDDs are fully fault-tolerant. Lost
data can be recovered using the lineage graph of RDDs (by rerunning operations on the input data).
• RDD operations:• Transformations - Lazy evaluated
(executed by calling an action to improve pipelining)• -map, filter, groupByKey, join, ...• Actions - Runned immediately (to
return the value to application/storage)• -count, collect, reduce, save, ...• Don’t forget to cache()
![Page 22: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/22.jpg)
Apache Spark• Dataframes are common abstraction that go across languages, and they
represent a table, or two-dimensional array with columns and rows.
• Spark Datarames are distributed dataframes. They allow querying structured data using SQL or DSL (for example in Python or Scala).
• Like RDDs, Dataframes are also immutable structure.
• They are executed in parallel.• val df = sqlContext.read.json"pathToMyFile.json")
![Page 23: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/23.jpg)
Hive• Apache Hive is a data
warehouse infrastructure for querying, analyzing and managing large datasets residing in distributed storage.
![Page 24: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/24.jpg)
Hive• Pros• Writing ad hoc queries on large
volumes of data.• Imposing a structure on a variety of
data formats.• Interactive SQL queries over large
datasets residing in Hadoop.• SQL-like data access.• Accessing Hadoop data from
traditional DWH environment.
• Cons• Code efficiency can be lower than
in traditional Map Reduce.• Apache Hive has terrible
performance for OLTP tasks.
![Page 25: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/25.jpg)
Ecosystem• Collecting Data• Kafka, Flume…
• Managing Data• Pig, Spark, Hive, Flink, MapReduce
• Resource Manager• YARN, Mesos
• Administration• Ambari, Big Top
![Page 26: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/26.jpg)
Big Data Application
![Page 27: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/27.jpg)
Lambda Architecture• Lambda Architecture is a useful framework to think about designing
big data applications. Nathan Marz designed this generic architecture addressing common requirements for big data based on his experience working on distributed data processing systems at Twitter.
![Page 28: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/28.jpg)
Lambda Architecture• Data• Batch Layer• Serving Layer• Speed Layer
![Page 29: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/29.jpg)
Social Media Analysis
![Page 30: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/30.jpg)
IoT Big Data Application
![Page 31: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/31.jpg)
Planning and Optimizing Data Lake Architecture• Tomorrow, 12h, Big Data Track • Data Lake Architecture in Practice • Optimizing Hive and Spark for Data Lakes
![Page 32: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/32.jpg)
Data Science Lab
datascience.rs
![Page 33: How to use Big Data and Data Lake concept in business using Hadoop and Spark - Darko Marjanovic](https://reader035.fdocuments.net/reader035/viewer/2022081604/58781fc71a28aba12d8b645d/html5/thumbnails/33.jpg)
Darko Marjanović, CEO @ Things [email protected]
How to use Big Data and Data Lake concept in business using Hadoop and Spark