CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5...
Transcript of CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5...
![Page 1: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/1.jpg)
CS226
Big-Data Management
Instructor: Ahmed Eldawy
1
![Page 2: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/2.jpg)
Welcome (back) to UCR!
2
![Page 3: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/3.jpg)
Class information
Classes: Monday, Wednesday, Friday 1:00 –
1:50 PM at Humanities and Social
Sciences1501
Instructor: Ahmed Eldawy
TA: Saheli Ghosh
Office hours: TBD
Website:
http://www.cs.ucr.edu/~eldawy/19FCS226/
iLearn (Any UCRX students?)
Email: [email protected]
Subject: “[CS226] …” 3
![Page 4: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/4.jpg)
Course work
Active participation in the class (5%)
Reading and review tasks (10%)
Assignments (20%)
Mid-term (15%)
Project (50%)
4
![Page 5: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/5.jpg)
Project
Groups of 4-5 students
Milestones
Group Selection
Project proposal (5%)
Literature survey (10%)
Report outline (5%)
Class presentation (5%)
Final report (15%)
Poster presentation (10%)
5
![Page 6: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/6.jpg)
Course goals
What are your goals?
Understand what big data means
Identify the internal components of big data
platforms
Recognize the differences between different
big data platforms
Explain how a distributed query runs on big
data
6
![Page 8: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/8.jpg)
Big-data Expert
Understand how the big-data platforms really
work
Control those thousands of processors
efficiently to carry out your task
8
![Page 9: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/9.jpg)
Syllabus
Overview of big data
Big-data storage
Big-data processing
Big-data indexing
Big-SQL processing
Programming packages
9
![Page 10: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/10.jpg)
Introduction
10
![Page 11: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/11.jpg)
11
![Page 12: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/12.jpg)
12
![Page 13: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/13.jpg)
Jan 2012: World Economic Forum Report
13
![Page 14: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/14.jpg)
Interest in Big Data in the US
■March 2012: Obama administration
unveils BIG DATA initiative: $200 Million
in R&D investment
■June 2013:
Washington
Post is calling
Obama “The Big
Data President”
14
![Page 15: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/15.jpg)
Interest in Big Data in Europe
March 2014: David Cameron and Angela Merkel talking about
Big Data in a Computer Expo in Hannover, Germany
15
![Page 16: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/16.jpg)
The Market of Big Data
16
![Page 17: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/17.jpg)
Four Three V’s of Big Data
17
![Page 18: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/18.jpg)
Big Data Vs Big Computation
Full scans (e.g., log processing)
Range scans
Point lookups
Iterations
Joins (self, binary, or multiway)
Proximity queries
Closures and graph traversals
18
![Page 19: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/19.jpg)
Big Data Applications
Web search
Marketing and advertising
Data cleaning
Knowledge base
Information retrieval
Internet of Things (IoT)
Visualization
Behavioral studies
19
![Page 20: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/20.jpg)
Publicly Available Datasets
Data.gov
Data.gov.uk
Twitter Streaming API
Yahoo! Webscope
[http://webscope.sandbox.yahoo.com/]
GDELT [http://www.gdeltproject.org/]
Instagram API
20
![Page 21: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/21.jpg)
Big Data Landscape 2012
http://mattturck.com/2012/06/29/a-chart-of-the-big-data-ecosystem/21
![Page 22: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/22.jpg)
Big Data Landscape 2014
http://mattturck.com/2014/05/11/the-state-of-big-data-in-2014-a-chart/22
![Page 23: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/23.jpg)
Big Data Landscape 2016
http://mattturck.com/2016/02/01/big-data-landscape/ 23
![Page 24: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/24.jpg)
Big Data Landscape 2018
24
![Page 25: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/25.jpg)
Components
of Big Data
25
![Page 26: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/26.jpg)
Storage of Big Data
Data is growing faster
than Moore’s Law
Too much data to fit
on a single machine
Partitioning
Replication
Fault-tolerance
26
![Page 27: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/27.jpg)
Hadoop Distributed File System(HDFS)
The most widely used distributed file system
Fixed-sized partitioning
3-way replication
Write-once read-many
128MB 128MB 128MB 128MB 128MB 128MB …
…
27
![Page 28: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/28.jpg)
Indexing
Data-aware organization
Global Index partitions the records into blocks
Local Indexes organize the records in a partition
Challenges:
Big volume
HDFS limitation
New programming
paradigms
Ad-hoc indexes
Global index
Local indexes
28
![Page 29: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/29.jpg)
Fault Tolerance
Replication
Redundancy
Multiple masters
29
![Page 30: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/30.jpg)
Streaming
Sub-second latency for queries
One scan over the data
(Partial) preprocessing
Continuous queries
Eviction strategies
In-memory indexes
…1000100010101011101110101010110111010111011101110100…
Processing window
30
![Page 31: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/31.jpg)
Task ExecutionMapReduce
Map-Shuffle- Reduce
Resiliency through
materialization
Resilient Distributed Datasets (RDD)
Directed-Acyclic-Graph (DAG)
In-memory processing
Resiliency through lineages
Hyracks
Stragglers
Load balance
M1 M2 … Mm
R1 R2 Rn
31
![Page 32: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/32.jpg)
Query Optimization
Finding the most efficient query plan
e.g., grouped aggregation
Cost model (CPU – Disk – Network)
Agg
Agg
Agg
Merge
Merge
Partition
Partition
Partition
Agg
Agg
Vs
32
![Page 33: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/33.jpg)
Provenance
Debugging in distributed systems is painful
We need to keep track of transformations on
each record
33
![Page 34: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/34.jpg)
Big Graphs
Motivated by social networks
Billions of nodes and trillions of edges
Tens of thousands of insertions per second
Complex queries with graph traversals
34
![Page 35: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/35.jpg)
Hadoop Ecosystem
Hadoop Distributed File System (HDFS)
Yet Another Resource Negotiator (YARN)
MapReduce Query Engine
Administration
Pig
35
![Page 36: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/36.jpg)
Spark Ecosystem
Hadoop Distributed File System (HDFS)
Yet Another
Resource Negotiator (YARN)
Resilient Distributed Dataset (RDD) a.k.a Spark Core
Data Frames MLlib GraphX SparkRSpark
Streaming
Spark SQL
36
Kubernetes
![Page 37: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/37.jpg)
Hyracks Data-parallel Platform
Algebricks
Algebra Layer
Hadoop MapReduce
CompatibilityPregelix
HiveSterixAsteixDBOther
compilersHyracks
jobs
Pregel
Jobs
MapReduce
Jobs
PigLatinHiveQLAsterixQL
37
![Page 38: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/38.jpg)
Impala
Hadoop Distributed File System (HDFS)
Yet Another Resource Negotiator (YARN)
Query Executor
Query Planner
Query Parser
38
![Page 39: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/39.jpg)
SpatialHadoop
Hadoop Distributed File System (HDFS) + Spatial Indexing
Yet Another Resource Negotiator (YARN)
MapReduce Processing + Spatial Query Processing
Spatial Visualization
Pig Latin + Pigeon
39
![Page 40: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)](https://reader034.fdocuments.net/reader034/viewer/2022042223/5ec9771e12ef1a5709510293/html5/thumbnails/40.jpg)
Reading Material
“The Age of Analytics in a Data-driven World”
[Executive Summary]
by McKinsey & Company
40