CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

25
1/25 Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015 18 Mar. 2015

Transcript of CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

Page 1: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

1/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

CS-495/595Hive

Lecture #9

Dr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck Cartledge

18 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 2015

Page 2: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

2/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Table of contents I

1 Miscellanea

2 Assignment #3

3 The Book

4 Chapter 12

5 Break

6 Project

7 Conclusion

8 References

Page 3: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

3/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Corrections and additions since last lecture.

Assignment #3 due in a fewhours.

Page 4: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

4/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Pay attention to the assignment details.

Things like:

Getting the average of thecorrect procedure based onthe numeric code

Grouping the practitionersby type

Getting the average for thestate based on the numericcode

Addressing those geographicareas that aren’t in yourcartographic file

If appropriate a “heat scale”

Page 5: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

5/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Hadoop, The Definitive Guide

Version 3 is specified in thesyllabus [5]

Version 4 came out inNovember 2015

We’ll use Version 3 as muchas possible

Page 6: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

6/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Overview

Where to get it.

“. . . was created to make itpossible for analysts withSQL skills . . . to run querieson huge data . . . ”

Installable much like Pig

Image from [3].

Available from “https://hive.apache.org/downloads.html”

Page 7: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

7/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Overview

How to get it running.

Assuming that you havehadoop up and running

There are “logical” conflictsbetween hadoop and Hive

Page 8: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

8/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Overview

Different interfaces

Like so many in the hadoopecosystem.

A command line interface(CLI, our old friend)

Web interface (not sure ifthis works on our cluster)

JDBC and ODBC

All feed a compiler, andoptimizer, and executor.

Page 9: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

9/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Overview

Simple things like creating tables

Tables can be created fromexternal data files

MapReduce underlyingeverything, so rows andfields are user definable

Tables can be partitioned foroptimization

Page 10: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

10/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Overview

Blow up the same image.

Page 11: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

11/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Overview

More information about architecture

Focusing on where data is stored.

Actual data is stored eitherlocally, or in the HDFS

Metadata is stored in themetastore used by hive

metastore is a Derbydatabase (data about thehive is stored in a RDBMS)

Hive tables are stored on theHDFS under/user/hive/warehouse

Image from [5].

Page 12: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

12/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Overview

Comparison with Traditional Databases.

Schema enforcement

Traditional — enforced atload time (called Write)Hive — enforced at querytime (called Read)

Updates, transactions, andindexes

Traditional — these aremainstaysHive — HDFS doesn’tsupport these actions

Image from [1].

Page 13: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

13/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Overview

Same image

Page 14: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

14/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Overview

Hive has its roots in MySQL

Relatively small differencesin syntax between HiveQLand MySQL

Limitations on VIEWs,Indexes, and updates(workaround is to createnew tables)

HiveQL supports“partitioning” the table(column or row) to better fitthe HDFS, and MapReduceparadigm

HiveQL does not supportupdating existing records

Image from [2].

HiveQL has grown “organically” to meet the needs of its users.

Page 15: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

15/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Overview

Same image

Page 16: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

16/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Break time.

Take about 10 minutes.

Page 17: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

17/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Additional details

Techniques and tools to solve the problem

Database “heavy lifting”

MapReduce

Pig Latin

Hive

Display and analysis

Custom code (language dejure)

Excel

Some cartographic package

Page 18: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

18/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Additional details

Class membership

Undergrads1 CAMPBELL, CHRISTOPHER G.2 CRUZ, JOSHUA T.3 DAVIS, RANDALL A.4 JIANG, MING H.5 PHELPS, NATHAN A.6 ZHANG, HEMIAO

Graduates1 ALFURAYJ, HAIFA S.2 ARAB, MARYAM3 BETHU, ANVESH4 DASARI, VICTOR PRABHU5 GARNER, KEVIN M.6 HAVANUR, SRINIVAS J.7 LAMBI, ROHIT D.8 PATEL, PRIYANK A.9 POTINENI, BHAVYATEJA10 SADANA, PRANEET11 SAJJAN, PRASANNA KUMAR

BASAVARAJ

So many people.

Page 19: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

19/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Additional details

Team membership

Undergrads1 Campbell and Cruz2 Davis and Phelps3 Jiang and Zhang

Graduates1 Alfurayj and Arab2 Bethu and Dasari3 Garner, Havanur and

Sajjan4 Lambi and Sadana5 Patel and Potineni

Page 20: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

20/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Additional details

Scatter graphs

We don’t have a priori knowledge of any relationship betweenMedicare billings and pharmaceutical payments. So create ascatter graph of the data.

Just plot one value versusthe other and look

Looking if there is anyapparent relationship

Relationship may not exist,may not be linear, may notbe monotonic

Page 21: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

21/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Additional details

Examples of scatter graphs

Different types of relationships:

None (shotgun)

Positive (strong positive)

Negative (strong negative)

Independent (or low)

Independent andnon-monotonic (or low)

Spurious

Image from [4].

The scatter graph will be our guide for computations.

Page 22: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

22/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

Additional details

Presentation

Short (on the order of 5minutes)

Address the 6 questions

Power point, or other format

Still need the “standard”PDF submission

Page 23: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

23/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

What have we covered?

Gave some “final” hints/directionsfor assignment #3Talked about Hive, HiveQL,origins, strengths, and weaknessesTalked about the project

Next lecture: Discussion of current real-world applications of BigData

Page 24: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

24/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

References I

[1] Amr A. Awadallah, Schema-on-read vs schema-on-write,http://www.slideshare.net/awadallah/schemaonread-

vs-schemaonwrite, 2014.

[2] Marc Holmes, Hive for sql users, http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/, 2013.

[3] Jasper Pei Lee, Hive a sql-like wrapper over hadoop,https://jasperpeilee.wordpress.com/2011/11/22/

hive-a-sql-like-wrapper-over-hadoop/, 2011.

[4] Bioscience Staff, Numbers numerical methods for biosciencestudents,http://web.anglia.ac.uk/numbers/graphsCharts.html.

[5] Tom White, Hadoop: The definitive guide, 3rd edition, O’ReillyMedia, Inc., 2012.

Page 25: CS-495/595 Hive Lecture #9 Dr. Chuck Cartledge Dr. Chuck ...

25/25

Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References

References II