COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

COP 6727:Advanced Database Systems

Spring 2013

Dr. Tao LiFlorida International University

COP6727 2

Student Self-Introduction

• Name– I will try to remember your names. But if you

have a Long name, please let me know how should I call you

• Anything you want us to know

COP6727 3

Course Overview

• Meeting time– Tuesday and Thursday 12:30pm – 13:45pm

• Office hours: – Thursday 2:30pm – 4:30pm or by

appointment

• Course Webpage:– http://www.cs.fiu.edu/~taoli/class/CAP6727-S

13/index.html

COP6727 4

Course Objectives

• This is an advanced database course– Already taken COP5725

• Assume knowledge of the fundamental concepts of relational databases.

• Cover the core principles and techniques of data and information management

• Discuss advanced techniques that can be applied to traditional database systems in order to provide efficient support of new emerging applications.

Tentative Topics• Query processing and optimization• Transaction management • Database tuning • Data stream systems • Spatial databases • XML • Information retrieval and Web data management • Scalable data processing • Readings in recent developments in database systems and applications

– SQL vs. non-SQL database– Nearest neighbor queries– High-dimensional indexing– Database retrieval and ranking– Stream processing– Big Data – Incremental and online query processing– Mobile database

COP6727 5

COP6727 6

Assignments and Grading• Reading/Written Assignments• Programing Projects• Midterm Exam• Final Project/Presentations• Class attendance is mandatory. • Evaluation will be a subjective process

– Effort is very important component• Regular In-class Students

– Quizzes and Class Participation: 5%– Midterm Exam: 30%– Final Project: 30%– Assignments and Projects: 35%

• Online Students– Midterm Exam: 30%– Final Project: 30%– Homework Assignments: 40%

COP6727 7

Text and References

Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems. Third Edition, McGraw Hill, 2003. ISBN: 0-07-246563-8. Links to Textbook Homepage .

In addition, the course materials will also be drawn from recent research literature.

Lecture 1 & 2

• Lecture 1 & 2: Introduction To MapReduce(Most of slides are adapted from Bill Graham, Spiros Papadimitriou, Cloudera Tutorials)

COP6727 8

Outline

• Motivation for MapReduce

• What is MapReduce?

• What is Hadoop?

• What is Hive?

COP6727 9

Motivation for MapReduce

• The Big Data

• How to handle big data?

COP6727 10

The Big Data

• Big data is everywhere

• Documents– Blogs （ 77 million Tumblr and 56.6 million WordPress as of 2012

） , Micro blogs, News, Reviews

• Images– Instagram, Flickr (more than 6 billion images)

• Videos– Youtube, All broadcast

• Others– Map (Google Map)

– Human Genome

– aeronautics and space data

COP6727 11

Another view on “big”

• 2008: Google processes 20 PB a day

• 2009: Facebook has 2.5 PB user data + 15 TB/ day

• 2009: eBay has 6.5 PB user data + 50 TB/day

• 2011: Yahoo! has 180-200 PB of data

• 2012: Facebook ingests 500 TB/day

COP6727 12

Why do we care about those data?

• Modeling and predicting information flow• Recommend/predict links in social networks• Relevance classification / information filtering• Sentiment analysis and opinion mining• Topic modeling and evolution• Measuring influence in social networks• Concept mapping• Search• …

COP6727 13

Big data analysis

• Scalability (with reasonable cost)– Algorithms improvement– Intuitive way: divide and conquer

COP6727 14

Divide and Conquer

COP6727 15

Challenges

• Parallel processing is complicated – How do we assign tasks to workers? – What if we have more tasks than slots? – What happens when tasks fail? – How do you handle distributed

synchronization?

COP6727 16

Challenges – Con’t

• Data storage is not trivial – Traditional database is not reliable

• Data volumes are massive • Reliably storing PBs of data is challenging

– Disk/hardware/network failures – Probability of failure event increases with number of

machines

• For example: – 1000 hosts, each with 10 disks, a disk lasts 3 year– how many failures per day?

COP6727 17

What is MapReduce?

• A programming model for expressing distributed computations at a massive scale

• An execution framework for organizing and performing such computations

• An open-source implementation called Hadoop

COP6727 18

Workflow of Large Data Problem

COP6727 19

MapReduce paradigm

• Implement two functions:

Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3)

• Framework handles everything else*

• Value with same key go to same reducer

COP6727 20

MapReduce Flow

COP6727 21

An Example

COP6727 22

MapReduce paradigm – Con’t

• There’s more!• Partioners decide what key goes to what

reducer – partition(k’, numPartitions) -> partNumber – Divides key space into parallel reducers chunks – Default is hash-based

• Combiners can combine Mapper output before sending to reducer

– Reduce(k2, list(v2)) -> list(v3)

COP6727 23

MapReduce Flow

COP6727 24

MapReduce additional details

• Reduce starts after all mappers complete

• Mapper output gets written to disk

• Intermediate data can be copied sooner

• Reducer gets keys in sorted order

• Keys not sorted across reducers

• Global sort requires 1 reducer or smart partitioning

COP6727 25

MapReduce is good at

• Embarrassingly parallel algorithms

• Summing, grouping, filtering, joining

• Off-line batch jobs on massive data sets

• Analyzing an entire large dataset

COP6727 26

MapReduce can do

• Iterative jobs (e.g., PageRank, K-means Clustering)– Each iteration must read/write data to disk – IO and latency cost of an iteration is high

COP6727 27

MapReduce is not good at

• Jobs that need shared state/coordination– Tasks are shared-nothing– Shared-state requires scalable state store

• Low-latency jobs

• Jobs on small datasets

• Finding individual records

COP6727 28

Summary of MapReduce

• Simple programming model

• Scalable, fault-tolerant

• Ideal for (pre-)processing large volumes of data

COP6727 29

What is Hadoop?

• Hadoop is an open-source implementation based on GFS and MapReduce from Google

• Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System

• Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

COP6727 30

Hadoop provides

• Redundant, fault-tolerant data storage

• Parallel computation framework

• Job coordination

COP6727 31

Hadoop Stack

COP6727 32

Who uses Hadoop?

• Yahoo!

• Facebook

• Last.fm

• Rackspace

• Digg

• Apache Nutch

• ...

COP6727 33

• The Hadoop Distributed File System

• Redundant storage

• Designed to reliably store data using commodity hardware

• Designed to expect hardware failures

• Intended for large files

• Designed for batch inserts

COP6727 34

Some Concepts about HDFS

• Files are stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about

files and blocks • The SecondaryNameNode (SNN) holds a

backup of the NN data • DataNodes (DN) store and serve blocks

COP6727 35

COP6727 36

COP6727 37

If a datanode failures

• DNs check in with the NN to report health

• Upon failure NN orders DNs to replicate under- replicated blocks

COP6727 38

Jobs and Tasks in Hadoop

• Job: a user-submitted map and reduce implementation to apply to a data set

• Task: a single mapper or reducer task– Failed tasks get retried automatically – Tasks run local to their data, ideally

• JobTracker (JT) manages job submission and task delegation

• TaskTrackers (TT) ask for work and execute tasks

COP6727 39

Architecture

COP6727 40

How to handle failed tasks?

• JT will retry failed tasks up to N attempts

• After N failed attempts for a task, job fails

• Some tasks are slower than other

• Speculative execution is JT starting up multiple of the same task

• First one to complete wins, other is killed

COP6727 41

Data locality

• Move computation to the data

• Moving data between nodes has a cost

• Hadoop tries to schedule tasks on nodes with the data

• When not possible TT has to fetch data from DN

COP6727 42

Hadoop execution environment

• Local machine (standalone or pseudo- distributed)

• Virtual machine

• Cloud (e.g. Amazon EC2)

• Own cluster

COP6727 43

Demo: word count

• Demo

COP6727 44

Homework

• Write a Hadoop program to index the words within the text document dataset– Example:

• Input: – Doc1: Hello World!

– Doc2: Hello Java!

• Expected output: – Hello \t Doc1 Doc2

– World \t Doc1

– Java \t Doc2

• Due: beginning of the class on 01/10• If you have any questions, send emails to Jingxuan

Li (jli003@cs.fiu.edu)

COP6727 45

Login Info

• Below is the login information for our Hadoop cluster– Server: datamining-node03.cs.fiu.edu– U:dbstudent p:******* (announced during the class)– Gaining the access to the working directory in HDFS (Do not

modify or remove the other directories!): hadoop fs -ls /user/dbstudent

• Input dataset for the homework (every one will be working on this dataset, so do not modify it!): /user/dbstudent/dataset

• Output directory (including the source code, the indexing results) format: /user/dbstudent/output-PID

COP6727 46

What is Hive?

• Data warehousing tool on top of Hadoop• Originally developed at Facebook

– Now a Hadoop sub-project

• Data warehouse infrastructure – Execution: MapReduce – Storage: HDFS files

• Large datasets, e.g. Facebook daily logs– 30GB (Jan’08), 200GB (Mar’08), 15+TB (2009)

• Hive QL: SQL-like query language

COP6727 47

Motivation

• Missing components when using Hadoop MapReduce jobs to process data– Command-line interface for “end users”– Ad-hoc query support– … without writing full MapReduce jobs– Schema information

COP6727 48

Hive Applications

• Log processing

• Text mining

• Document indexing

• Customer-facing business intelligence

(e.g., Google Analytics)

• Predictive modeling, hypothesis testing

COP6727 49

Hive Components

• Shell: allows interactive queries like MySQL shell connected to database– Also supports web and JDBC clients

• Driver: session handles, fetch, execute• Compiler: parse, plan, optimize• Execution engine: DAG of stages (M/R,

HDFS, or metadata)• Metastore: schema, location in HDFS

COP6727 50

Data Model

• Tables– Typed columns (int, float, string, date,

boolean)– Also, list: map (for JSON-like data)

• Partitions– e.g., to range-partition tables by date

• Buckets– Hash partitions within ranges (useful for

sampling, join optimization)COP6727 51

Metastore

• Database: namespace containing a set of Tables

• Holds table definitions (column types, physical layout)

• Partition data

• Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other relational databases

COP6727 52

Physical Layout

• Warehouse directory in HDFS– e.g., /home/hive/warehouse

• Tables stored in subdirectories of warehouse

– Partitions, buckets form subdirectories of tables

• Actual data stored in flat files– Control char-delimited text, or SequenceFiles– With custom SerDe, can use arbitrary format

COP6727 53

Useful command examples

• Start Hive: bin/hive• Show all the tables: SHOW TABLES• Create a new table: CREATE TABLE

shakespeare (freq INT, word STRING) ROW FORMAT ELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE

• Loading data into the table: LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare

COP6727 54

Useful command examples – Con’t

• Select data: SELECT * FROM shakespeare WHERE freq > 100 SORT BY freq ASC LIMIT 10

• Join: INSERT OVERWRITE TABLE merged SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN kjv k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1

COP6727 55

Summary of Hive

• Supports rapid iteration of ad-hoc queries

• Can perform complex joins with minimal code

• Scales to handle much more data than many similar systems

COP6727 56

References

• White, T., Hadoop: The definitive guide, 2012

• http://hadoop.apache.org/

• http://hive.apache.org/

• MapReduce tutorial: http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html#Example%3A+WordCount+v1.0

• Bill Graham, http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/BillGraham_IntroToHadoop_Aug30.pdf

• Spiros Papadimitriou, Jimeng Sun, and Rong Yan, http://cs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/07-1.pdf

• Cloudera, http://blog.cloudera.com/wp-content/uploads/2010/01/6-IntroToHive.pdf

COP6727 57

Exercises

• To be announced

COP6727 58

COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Documents

Transcript of COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University.

Cop Aneka - itssb.com · 2 Cop Aneka Cop Peralatan Sukan Code : Cop - 001 / WM 72/ EM 79 Cop Peralatan di Ruang Tamu Code : Cop - 002/ WM 72/ EM 79 (0.4kg) (0.4kg)

REPUBLIC ACT NO. 6727 - chanrobles.com.ph ACT NO. 6727.IMPLEMENTING RULE… · Boards shall have technical supervision over the regional office ... “In the performance of its wage

· Cop 1036, Cop 1038, Cop 1238 Cop 1440, Cop ISSO, Cop 1838 L (mm) cop 1440. cop 1550. cop 1838 cop 1638. cop 1838AW cop 1840, cop …

COP TAR AR4 COP IPCC FAR Rio SAR AR5 SBSTA …IPCC FAR Rio SAR COP 1 COP 3 COP 7 TAR AR4 COP 15 COP 16 COP 21 UNFCCC CBD 20 AR5 RIO 11 IPCC +20 Wetland Guidelines Supplement Wetland

RULES IMPLEMENTING REPUBLIC ACT NO. 6727 IMPLEMENTING REPUBLIC ACT NO. 6727.pdfRULES IMPLEMENTING REPUBLIC ACT NO. 6727 ... cities of Batangas, Cavite, Lipa, ... minimum wage rates

DG(SANTE) 2019-6727 FINAL REPORT OF AN AUDIT CARRIED …

ISSN 1198-6727 Fisheries Centre Research Reports

SNI 19-6727-2002 LPI

Rules Implementing RA 6727

Manual cop-dvr4 h264-cop-dvr8h264

R.a. 6727 - Wage Rationalization Act - BAR

issn 1198-6727 B i - University of British Columbiafisheries.sites.olt.ubc.ca/files/2016/08/Brazil_FCRR-June-11-2015.pdf · ISSN 1198-6727 Fisheries Centre, University of British

BEI COP! NEU COP® SHOP

BARRAS de INICIO - drillingtandil.com.ardrillingtandil.com.ar/pdf/mitsubishi/barras_inicio.pdf · 62 SHANK ADAPTER SHANK ADAPTER COP 1440, COP 1550, COP 1638, COP 1838, COP 2238 9.4

IS 6727 (1972): Fireclay checker bricks for open hearth ...

ra 6727 8188 labor law philippines

RESOLUCIÓN No 6727 DE 2019 EL CONSEJO NACIONAL …

- … · Atlas copco COP 1550, COP 1838 ME/HE Atlas copco COP 1440, COP 1550, COP 1638, COP 1838, COP 2238 Bench drilling Bench drilling, Production

REPUBLIC ACT NO. 6727 ACT NO. 6727... · 2009-04-15 · republic act no.6727 june 9, 1989 rules implementing republic act no. 6727 an act to rationalize wage policy determination

Good cop, bad cop