CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing...

Post on 17-Jun-2020

2 views 0 download

Transcript of CS 626 Large Scale Data Sciencejzhang/CS626/Lecture1.pdf · • Contact Mr. Jarad Downing...

CS 626 Large Scale Data Science

Jun ZhangDepartment of Computer Science

University of KentuckyBased on materials prepared by Dr. Licong Cui

Lecture 1 – Introduction

1

Outline

Course Logistics

Student Introduction

Introduction to Big Data

2

Course Logistics

• Class hours: TR 12:30 pm - 1:45 pm• Class location: F. Paul Anderson Tower Room 255• Office hours: MW: 9:00am – 10:00am• Course documents:

http://www.cs.uky.edu/~jzhang/CS626/cs626.htmlo Syllabuso Files

- Slides- Homework and Project Assignments

3

Course Description

• Data => Actionable information• Big Data Techniques– Hadoop/MapReduce– HBase– Hive– Pig– Spark

• Real-world data science problems

4

Prerequisites and Expected Background

• Algorithm design and analysis• Database systems (e.g. MySQL)• Programming languages– Java (preferred)– Python

• Linux basics (e.g., ssh, scp)• Your own computer requirements:– 64-bit OS– 10+ GB RAM

5

Alternative Hardware Systems

• Use CS Department’s OpenStack cluster• Contact Mr. Jarad Downing jarad@cs.uky.edu for

obtaining an account and knowing the requirements

• The Cloudera system has been installed on OpenStack

• More information about OpenStack is at:https://www.cs.uky.edu/docs/users/openstack.html

6

What Do You Need for the OpenStack Cluster?

• You need to connect to the UK campus via VPN, see:https://www.cs.uky.edu/docs/users/vpn.html

• You need to install nomachine, it can be foundhere: https://www.nomachine.com

• You need to use your UK ID address and the credentials (cloudera/cloudera) to connect.

7

Textbook (Optional)

• Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale (4th Edition)

• Author: Tom White • ISBN-13: 978-1491901632 • ISBN-10: 1491901632

8

Grading Criteria

• Homework/Programming assignments (40%)• Paper presentation (20%)• Project (30%) – Project team: each team consists of up to 3 members– Clear statement of contribution for each team member– Deliverables: mid-project report (5%), live demos (5%),

and final project report (20%)• Attendance and participation (10%)– Attendance: 5%– Participation: 5% (participating discussions in class)

9

Grading Scale

85 – 100% = A75 – 84% = B60 – 74% = C< 60% = E

10

Course Policies

• Academic Integrity– Independently complete

homework/programming assignments.– Proper acknowledgement is required if you

borrow idea or content from other sources.• Submission Policy– See each assignment for deadlines.– Late submission will not be accepted.

11

Course Policies

• Attendance Policy– In order to meet federal regulations, the

instructor will monitor student participation in this class through attendance or assignments. Students whose attendance or participation cannot be determined one time during the first three weeks of the semester may be dropped from the course.

12

Course Policies

• Attendance Policy– University policy: students are expected to

withdraw from the class if more than 20% of the classes scheduled for the semester are missed (excused or unexcused)

• Excused Absences– http://www.uky.edu/Ombud/

13

Student Introduction

14

Introduction to Big Data

Why Big Data?o What launches Big Data era?

o What makes Big Data valuable?

Characteristics of Big Data

15

What launches Big Data era?

Retail2 billion products sold in 2014

Social media 204 million emails/min

1.8 million likes, 200,000 photos/min

278,000 tweets/min

40,000 queries/sec, 3.5 billion/day

HealthcareA Samaritan Medical Center Watertown NY: 120 TB as of 2013

16

What Makes Big Data Valuable?

Big Data Better Models

Higher Precision

17

Example: Recommendation Engines

18

Example: Using Big Data to Help Patients

Big Data for precision medicineo Personalized healthcare

o Predict/Prevent disease

Data sourceso Genome

o Sensors

o Electronic Health Record (EHR)

o People19

Genome Data

200 GB/genome

20

Sensor Data

21

Electronic Health Record (EHR)

22

People-generated Data- Fitness Device Data

2-5 GB/day

23

How Big Data Can Help?

Integration

Genome Data

Sensor DataElectronic

Health Records

People-generated

Data

24

How Big Data Can Help?

Integration Personalization Precision

25

Basic principles for big data integration

• Create a common understanding of data definition

• Develop a set of data services to qualify the data and make it consistent and ultimate trustworthy

• Set up a streamlined way to integrate your big data sources and system of record

26

Characteristics of Big Data – 6V’s

• Veracity• Valence

Volume Variety Velocity

Value

27

Volume of big data

• The amount of data• Facebook has 250 billion images, and 2.5

trillion posts (2016)• The amount of data is ever increasing• How to store the data• How to process the data

28

Variety of big data

• Ever increasing different forms of data• Photographs, sensor data, tweets,

encrypted packages• Traditional data tables • E-mail messages, with attachments• Photos, videos and audio recordings

29

Velocity of big data

• The speed at which big data is created, stored, and/or analyzed.

• Facebook users upload 900 million photos every day

• Packet analysis for cybersercurity• Search engine query• Internet of Things

30

Veracity of big data

• Quality and trustfulness of data• Accuracy, preciseness, reliability• Any bias, noises, and abnormality in

data?• Falsification?• No good data, no good results

31

Valence of big data

• Connectedness of big data in the form of graphs

• Data bond with each other• Forming connection between disparate

data• Positive valence and negative valence

32

Value of big data

• The ability to convert big data information into a monetary reward

• The final goal of big data• Data mining?• Decision and results

33