CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data...

52
CS167 Introduction to Big-data Instructor: Ahmed Eldawy 1

Transcript of CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data...

Page 1: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

CS167

Introduction to Big-data

Instructor: Ahmed Eldawy

1

Page 2: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Welcome to UCR! (Virtually)

2

Page 3: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Class information

Classes: Tuesday, Thursday 2:00 – 3:20 PM

via Zoom

Instructor: Ahmed Eldawy

Office hours: Tuesday, Thursday 3:30-4:30

Conflicts?

TA: Tin Vu and Akil Sevim

Website: http://www.cs.ucr.edu/~eldawy/20SCS167/

Email: [email protected] Subject: “[CS167] …”

Piazza: https://piazza.com/ucr/spring2020/cs167

3

Page 4: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Class Logistics

All classes will be recorded

Ask questions in the chat window

The TA will answer your questions by text (if

possible)

The instructor will answer questions that

need further attention

Raise your hand (virtually) if you have a

question that you would like to ask verbally

4

Page 5: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Lab Logistics

All labs will be on Zoom

Attend the session that you are enrolled in

The TA will share their screen

Students will follow the instructions on their

machines

Ask questions in the chat

If you have a question, you can share your

screen with the TA to get help!!

5

Page 6: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Course work

Assignments (15%)

Labs (30%)

Mid-terms (15%+15%)

Final (25%)

All exams will be open slides, notes, and

books.

6

Page 7: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Textbook

No required textbook

Recommended textbooks

1. “Spark: The Definitive Guide: Big

Data Processing Made Simple”: 1st

Edition, by Bill Chambers and Matei Zaharia

ISBN-13: 978-1491912218

ISBN-10: 1491912219

2. “Data Analytics Made Accessible”:

2020 edition, by Anil Maheshwari

7

Page 8: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Course goals

What are your goals?

Understand what big data means

Identify the internal components of big data

platforms

Recognize the differences between different

big data platforms

Explain how a distributed query runs on big

data

8

Page 10: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Ant-Man/Wasp

10

Get smaller to understand

how ants work and what

they are capable of.

Use this knowledge to

control thousands of ants

and do amazing things!

Page 11: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Big-data Expert

Understand how the big-data platforms really

work

Control those thousands of processors

efficiently to carry out your task

11

Page 12: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Syllabus

Overview of big data

Big-data storage

Big-data processing

Big-data indexing

Big-SQL processing

Programming packages

12

Page 13: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Introduction

13

Page 14: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

14

Page 15: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

15

Page 16: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

The Market of Big Data

16

Page 17: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Job Market

https://www.techicy.com/5-best-programming-languages-to-watch-out-in-2019-for-data-science.html

17

Page 18: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Four Three V’s of Big Data

18

Page 19: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Big Data Vs Big Computation

Full scans (e.g., log processing)

Range scans

Point lookups

Iterations

Joins (self, binary, or multiway)

Proximity queries

Closures and graph traversals

19

Page 20: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Big Data Applications

Web search

Marketing and advertising

Data cleaning

Knowledge base

Information retrieval

Internet of Things (IoT)

Visualization

Behavioral studies

20

Page 21: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Publicly Available Datasets

Data.gov

Data.gov.uk

UCR STAR [https://star.cs.ucr.edu]

Twitter Streaming API

Yahoo! Webscope

[http://webscope.sandbox.yahoo.com/]

GDELT [http://www.gdeltproject.org/]

Instagram API

21

Page 22: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Big Data Landscape 2012

http://mattturck.com/2012/06/29/a-chart-of-the-big-data-ecosystem/22

Page 23: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Big Data Landscape 2014

http://mattturck.com/2014/05/11/the-state-of-big-data-in-2014-a-chart/23

Page 24: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Big Data Landscape 2016

http://mattturck.com/2016/02/01/big-data-landscape/ 24

Page 25: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Big Data Landscape 2018

25

Page 26: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Components

of Big Data

26

Page 27: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Components of Big Data

27

Coordination/Clus

ter ManagementOozie, Yarn, Kubernetes

Cloud ServicesAmazon Web Services,

Microsoft Azure, and

Google Cloud Platform

Big Data Distributed StorageHadoop Distributed File System, Cloud storage systems

(Amazon S3 and Google File System), Key-value stores

Distributed ComputingMapReduce (Hadoop and Google), Resilient Distributed

Dataset (Spark), Hyracks (AsterixDB)

High-level LanguagesSparkSQL, Pig, SQL++, HiveQL

Big-data LibrariesMLlib (Machine Learning), GraphX

Page 28: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Storage of Big Data

Data is growing faster

than Moore’s Law

Too much data to fit

on a single machine

Partitioning

Replication

Fault-tolerance

28

Page 29: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Hadoop Distributed File System(HDFS)

The most widely used distributed file system

Fixed-sized partitioning

3-way replication

Write-once read-many

See also: GFA, Amazon S3, Azure Blob Store

128MB 128MB 128MB 128MB 128MB 128MB …

29

Page 30: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Indexing

Data-aware organization

Global Index partitions the records into blocks

Local Indexes organize the records in a partition

Challenges:

Big volume

HDFS limitation

New programming

paradigms

Ad-hoc indexes

Global index

Local indexes

30

Page 31: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Fault Tolerance

Replication

Redundancy

Multiple masters

31

Page 32: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Key-value Stores

32

1 → Jack [email protected]

2 → Jill [email protected]

3 → Alex [email protected]

ID Name Email …

1 Jack [email protected]

2 Jill [email protected]

3 Alex [email protected]

Page 33: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Streaming

Sub-second latency for queries

One scan over the data

(Partial) preprocessing

Continuous queries

Eviction strategies

In-memory indexes

…1000100010101011101110101010110111010111011101110100…

Processing window

33

Page 34: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Structured/Semi-structured

34

ID Name Email …

1 Jack [email protected]

2 Jill [email protected]

3 Alex [email protected]

Document 1

{ “id”: 1, “name”:”Jack”, “email”:

[email protected]”, “address”: {“street”:

“900 university ave”, “city”: “Riverside”, state:

“CA”}, “friend_ids”: [3, 55, 123]}

Document 2

{ “id”: 2, “name”: “Jill”, “email”:

[email protected]”, “hobbies”: [“hiking”,

“cooking”]}

Page 35: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Distributed Computing

35

Coordination/

Cluster

Management

Cloud Services

High-level Languages

Big-data Libraries

Big Data Storage

Distributed Computing

Page 36: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Traditional Distributed Computing

36

Centralized

Big Data

Coordinator

WorkerWorkerWorkerWorkerWorkerWorkers

Ship data to computation paradigm

e.g., High performance computing (HPC)

Page 37: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Big-data Computing

37

Ship compute to data paradigm

Storage/Compute

Nodes

Coordinator

Send program

and task

information to

where the data is

Page 38: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Task ExecutionMapReduce

Map-Shuffle- Reduce

Resiliency through

materialization

Resilient Distributed Datasets (RDD)

Directed-Acyclic-Graph (DAG)

In-memory processing

Resiliency through lineages

Hyracks

Stragglers

Load balance

M1 M2 … Mm

R1 R2 Rn

38

Page 39: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Query Optimization

Finding the most efficient query plan

e.g., grouped aggregation

Cost model (CPU – Disk – Network)

Agg

Agg

Agg

Merge

Merge

Partition

Partition

Partition

Agg

Agg

Vs

39

Page 40: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Provenance

Debugging in distributed systems is painful

We need to keep track of transformations on

each record

40

Page 41: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Big Graphs

Motivated by social networks

Billions of nodes and trillions of edges

Tens of thousands of insertions per second

Complex queries with graph traversals

41

Page 42: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Declarative MapReduce

MapReduce has been used to create many

reusable operators (e.g., relational operators)

Filter

Aggregate

Grouped aggregated

Equi-join

Non-equi-join

42

Map

Map Reduce

Map Reduce

Map Reduce

Map Reduce

Page 43: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Declarative Languages

Describe what you want to do not how to do it

The most popular example is SQL

Can we compile SQL queries into

MapReduce program(s)?

43

Page 44: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Pig

44

A system built on-top of Hadoop (Now

supports Spark as well)

Provides a SQL-ETL-like query language

termed Pig Latin

Compiles Pig Latin programs into

MapReduce programs

Page 45: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Additional Features

Lazy execution

Nothing gets actually executed until the STORE

command is reached

Consolidation of map-only jobs

Map-only jobs (FILTER and FOREACH) can be

consolidated into a next job’s map function or a

previous job’s reduce function

45

Page 46: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

SparkSQL

Redesigned to consider Spark query model

Supports all the popular relational operators

Can be intermixed with RDD operations

Uses the Dataframe API as an enhancement

to the RDD API

46

Dataframe = RDD + schema

Page 47: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Hadoop Ecosystem

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator (YARN)

MapReduce Query Engine

Administration

Pig

47

Page 48: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Spark Ecosystem

Hadoop Distributed File System (HDFS)

Yet Another

Resource Negotiator (YARN)

Resilient Distributed Dataset (RDD) a.k.a Spark Core

Data Frames MLlib GraphX SparkRSpark

Streaming

Spark SQL

48

Kubernetes

Page 49: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Hyracks Data-parallel Platform

Algebricks

Algebra Layer

Hadoop MapReduce

CompatibilityPregelix

HiveSterixAsteixDBOther

compilersHyracks

jobs

Pregel

Jobs

MapReduce

Jobs

PigLatinHiveQLAsterixQL

49

Page 50: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Impala

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator (YARN)

Query Executor

Query Planner

Query Parser

50

Page 51: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

SpatialHadoop

Hadoop Distributed File System (HDFS) + Spatial Indexing

Yet Another Resource Negotiator (YARN)

MapReduce Processing + Spatial Query Processing

Spatial Visualization

Pig Latin + Pigeon

51

Page 52: CS167 Introduction to Big-dataeldawy/20SCS167/slides/CS167-01-Intro.pdf · Components of Big Data 27 Coordination/Clus ter Management Oozie, Yarn, Kubernetes Cloud Services Amazon

Reading Material

“The Age of Analytics in a Data-driven World”

[Executive Summary]

by McKinsey & Company

52