Post on 26-Dec-2015
COP 6727:Advanced Database Systems
Spring 2013
Dr. Tao LiFlorida International University
COP6727 2
Student Self-Introduction
• Name– I will try to remember your names. But if you
have a Long name, please let me know how should I call you
• Anything you want us to know
COP6727 3
Course Overview
• Meeting time– Tuesday and Thursday 12:30pm – 13:45pm
• Office hours: – Thursday 2:30pm – 4:30pm or by
appointment
• Course Webpage:– http://www.cs.fiu.edu/~taoli/class/CAP6727-S
13/index.html
COP6727 4
Course Objectives
• This is an advanced database course– Already taken COP5725
• Assume knowledge of the fundamental concepts of relational databases.
• Cover the core principles and techniques of data and information management
• Discuss advanced techniques that can be applied to traditional database systems in order to provide efficient support of new emerging applications.
Tentative Topics• Query processing and optimization• Transaction management • Database tuning • Data stream systems • Spatial databases • XML • Information retrieval and Web data management • Scalable data processing • Readings in recent developments in database systems and applications
– SQL vs. non-SQL database– Nearest neighbor queries– High-dimensional indexing– Database retrieval and ranking– Stream processing– Big Data – Incremental and online query processing– Mobile database
COP6727 5
COP6727 6
Assignments and Grading• Reading/Written Assignments• Programing Projects• Midterm Exam• Final Project/Presentations• Class attendance is mandatory. • Evaluation will be a subjective process
– Effort is very important component• Regular In-class Students
– Quizzes and Class Participation: 5%– Midterm Exam: 30%– Final Project: 30%– Assignments and Projects: 35%
• Online Students– Midterm Exam: 30%– Final Project: 30%– Homework Assignments: 40%
COP6727 7
Text and References
Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems. Third Edition, McGraw Hill, 2003. ISBN: 0-07-246563-8. Links to Textbook Homepage .
In addition, the course materials will also be drawn from recent research literature.
Lecture 1 & 2
• Lecture 1 & 2: Introduction To MapReduce(Most of slides are adapted from Bill Graham, Spiros Papadimitriou, Cloudera Tutorials)
COP6727 8
Outline
• Motivation for MapReduce
• What is MapReduce?
• What is Hadoop?
• What is Hive?
COP6727 9
Motivation for MapReduce
• The Big Data
• How to handle big data?
COP6727 10
The Big Data
• Big data is everywhere
• Documents– Blogs ( 77 million Tumblr and 56.6 million WordPress as of 2012
) , Micro blogs, News, Reviews
• Images– Instagram, Flickr (more than 6 billion images)
• Videos– Youtube, All broadcast
• Others– Map (Google Map)
– Human Genome
– aeronautics and space data
COP6727 11
Another view on “big”
• 2008: Google processes 20 PB a day
• 2009: Facebook has 2.5 PB user data + 15 TB/ day
• 2009: eBay has 6.5 PB user data + 50 TB/day
• 2011: Yahoo! has 180-200 PB of data
• 2012: Facebook ingests 500 TB/day
COP6727 12
Why do we care about those data?
• Modeling and predicting information flow• Recommend/predict links in social networks• Relevance classification / information filtering• Sentiment analysis and opinion mining• Topic modeling and evolution• Measuring influence in social networks• Concept mapping• Search• …
COP6727 13
Big data analysis
• Scalability (with reasonable cost)– Algorithms improvement– Intuitive way: divide and conquer
COP6727 14
Divide and Conquer
COP6727 15
Challenges
• Parallel processing is complicated – How do we assign tasks to workers? – What if we have more tasks than slots? – What happens when tasks fail? – How do you handle distributed
synchronization?
COP6727 16
Challenges – Con’t
• Data storage is not trivial – Traditional database is not reliable
• Data volumes are massive • Reliably storing PBs of data is challenging
– Disk/hardware/network failures – Probability of failure event increases with number of
machines
• For example: – 1000 hosts, each with 10 disks, a disk lasts 3 year– how many failures per day?
COP6727 17
What is MapReduce?
• A programming model for expressing distributed computations at a massive scale
• An execution framework for organizing and performing such computations
• An open-source implementation called Hadoop
COP6727 18
Workflow of Large Data Problem
COP6727 19
MapReduce paradigm
• Implement two functions:
Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3)
• Framework handles everything else*
• Value with same key go to same reducer
COP6727 20
MapReduce Flow
COP6727 21
An Example
COP6727 22
MapReduce paradigm – Con’t
• There’s more!• Partioners decide what key goes to what
reducer – partition(k’, numPartitions) -> partNumber – Divides key space into parallel reducers chunks – Default is hash-based
• Combiners can combine Mapper output before sending to reducer
– Reduce(k2, list(v2)) -> list(v3)
COP6727 23
MapReduce Flow
COP6727 24
MapReduce additional details
• Reduce starts after all mappers complete
• Mapper output gets written to disk
• Intermediate data can be copied sooner
• Reducer gets keys in sorted order
• Keys not sorted across reducers
• Global sort requires 1 reducer or smart partitioning
COP6727 25
MapReduce is good at
• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data sets
• Analyzing an entire large dataset
COP6727 26
MapReduce can do
• Iterative jobs (e.g., PageRank, K-means Clustering)– Each iteration must read/write data to disk – IO and latency cost of an iteration is high
COP6727 27
MapReduce is not good at
• Jobs that need shared state/coordination– Tasks are shared-nothing– Shared-state requires scalable state store
• Low-latency jobs
• Jobs on small datasets
• Finding individual records
COP6727 28
Summary of MapReduce
• Simple programming model
• Scalable, fault-tolerant
• Ideal for (pre-)processing large volumes of data
COP6727 29
What is Hadoop?
• Hadoop is an open-source implementation based on GFS and MapReduce from Google
• Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System
• Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004
COP6727 30
Hadoop provides
• Redundant, fault-tolerant data storage
• Parallel computation framework
• Job coordination
COP6727 31
Hadoop Stack
COP6727 32
Who uses Hadoop?
• Yahoo!
• Last.fm
• Rackspace
• Digg
• Apache Nutch
• ...
COP6727 33
HDFS
• The Hadoop Distributed File System
• Redundant storage
• Designed to reliably store data using commodity hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
COP6727 34
Some Concepts about HDFS
• Files are stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about
files and blocks • The SecondaryNameNode (SNN) holds a
backup of the NN data • DataNodes (DN) store and serve blocks
COP6727 35
Write
COP6727 36
Read
COP6727 37
If a datanode failures
• DNs check in with the NN to report health
• Upon failure NN orders DNs to replicate under- replicated blocks
COP6727 38
Jobs and Tasks in Hadoop
• Job: a user-submitted map and reduce implementation to apply to a data set
• Task: a single mapper or reducer task– Failed tasks get retried automatically – Tasks run local to their data, ideally
• JobTracker (JT) manages job submission and task delegation
• TaskTrackers (TT) ask for work and execute tasks
COP6727 39
Architecture
COP6727 40
How to handle failed tasks?
• JT will retry failed tasks up to N attempts
• After N failed attempts for a task, job fails
• Some tasks are slower than other
• Speculative execution is JT starting up multiple of the same task
• First one to complete wins, other is killed
COP6727 41
Data locality
• Move computation to the data
• Moving data between nodes has a cost
• Hadoop tries to schedule tasks on nodes with the data
• When not possible TT has to fetch data from DN
COP6727 42
Hadoop execution environment
• Local machine (standalone or pseudo- distributed)
• Virtual machine
• Cloud (e.g. Amazon EC2)
• Own cluster
COP6727 43
Demo: word count
• Demo
COP6727 44
Homework
• Write a Hadoop program to index the words within the text document dataset– Example:
• Input: – Doc1: Hello World!
– Doc2: Hello Java!
• Expected output: – Hello \t Doc1 Doc2
– World \t Doc1
– Java \t Doc2
• Due: beginning of the class on 01/10• If you have any questions, send emails to Jingxuan
Li (jli003@cs.fiu.edu)
COP6727 45
Login Info
• Below is the login information for our Hadoop cluster– Server: datamining-node03.cs.fiu.edu– U:dbstudent p:******* (announced during the class)– Gaining the access to the working directory in HDFS (Do not
modify or remove the other directories!): hadoop fs -ls /user/dbstudent
• Input dataset for the homework (every one will be working on this dataset, so do not modify it!): /user/dbstudent/dataset
• Output directory (including the source code, the indexing results) format: /user/dbstudent/output-PID
COP6727 46
What is Hive?
• Data warehousing tool on top of Hadoop• Originally developed at Facebook
– Now a Hadoop sub-project
• Data warehouse infrastructure – Execution: MapReduce – Storage: HDFS files
• Large datasets, e.g. Facebook daily logs– 30GB (Jan’08), 200GB (Mar’08), 15+TB (2009)
• Hive QL: SQL-like query language
COP6727 47
Motivation
• Missing components when using Hadoop MapReduce jobs to process data– Command-line interface for “end users”– Ad-hoc query support– … without writing full MapReduce jobs– Schema information
COP6727 48
Hive Applications
• Log processing
• Text mining
• Document indexing
• Customer-facing business intelligence
(e.g., Google Analytics)
• Predictive modeling, hypothesis testing
COP6727 49
Hive Components
• Shell: allows interactive queries like MySQL shell connected to database– Also supports web and JDBC clients
• Driver: session handles, fetch, execute• Compiler: parse, plan, optimize• Execution engine: DAG of stages (M/R,
HDFS, or metadata)• Metastore: schema, location in HDFS
COP6727 50
Data Model
• Tables– Typed columns (int, float, string, date,
boolean)– Also, list: map (for JSON-like data)
• Partitions– e.g., to range-partition tables by date
• Buckets– Hash partitions within ranges (useful for
sampling, join optimization)COP6727 51
Metastore
• Database: namespace containing a set of Tables
• Holds table definitions (column types, physical layout)
• Partition data
• Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other relational databases
COP6727 52
Physical Layout
• Warehouse directory in HDFS– e.g., /home/hive/warehouse
• Tables stored in subdirectories of warehouse
– Partitions, buckets form subdirectories of tables
• Actual data stored in flat files– Control char-delimited text, or SequenceFiles– With custom SerDe, can use arbitrary format
COP6727 53
Useful command examples
• Start Hive: bin/hive• Show all the tables: SHOW TABLES• Create a new table: CREATE TABLE
shakespeare (freq INT, word STRING) ROW FORMAT ELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE
• Loading data into the table: LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare
COP6727 54
Useful command examples – Con’t
• Select data: SELECT * FROM shakespeare WHERE freq > 100 SORT BY freq ASC LIMIT 10
• Join: INSERT OVERWRITE TABLE merged SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN kjv k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
COP6727 55
Summary of Hive
• Supports rapid iteration of ad-hoc queries
• Can perform complex joins with minimal code
• Scales to handle much more data than many similar systems
COP6727 56
References
• White, T., Hadoop: The definitive guide, 2012
• http://hadoop.apache.org/
• http://hive.apache.org/
• MapReduce tutorial: http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html#Example%3A+WordCount+v1.0
• Bill Graham, http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/BillGraham_IntroToHadoop_Aug30.pdf
• Spiros Papadimitriou, Jimeng Sun, and Rong Yan, http://cs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/07-1.pdf
• Cloudera, http://blog.cloudera.com/wp-content/uploads/2010/01/6-IntroToHive.pdf
COP6727 57
Exercises
• To be announced
COP6727 58