COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf ·...

27
BIG DATA ANALYTICS FALL 2015 COSC282 1

Transcript of COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf ·...

Page 1: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

B I G D ATA A N A LY T I C S FA L L 2 0 1 5

C O S C 2 8 2

1

Page 2: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

W H AT I S B I G D ATA

to you?

2

Page 3: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

3

source:http://semanticommunity.info/@api/deki/files/18406/

WhatIsaPetabyte-RobertAmes.png

Page 4: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

Q U I C K I N T R O D U C T I O N

4

Page 5: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

A Q U I C K I N T R O D U C T I O N O F M EH T T P : / / I N F O S E N S E . C S . G E O R G E T O W N . E D U /

5

Page 6: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

W H AT B R I N G S M E H E R E

• To have a wonderful semester with you :-)

• As an educator, I want to teach a class that is timely and useful, helping you in the job market

• For myself, to be honest, I am not a big fan of huge data volume. However, big data not just means bigger volume, it also means higher data variety and faster data change rate (velocity)

• I am a fan of complexity. ;-)

6

Three Vs of Big Data

Page 7: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

B I G D ATA R E L AT E S T O E V E R Y O N EW H A T B R I N G S M E H E R E

7

Page 8: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

C O U R S E P U R P O S E

• able to code and design large scale data analytics tools

• master spark programming

• understand how web search engine works

• focus on text analytics, but the techniques we will learn are generic

• have fun!

8

Page 9: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

L E T ’ S S TA R T T O T H I N K

9

Page 10: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

I F Y O U A R E T H E G O D O F D ATA

• What are the typical uses of your data?

• understand trends and patterns

• prediction

• search

10

Page 11: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

H O W G O O G L E W O R K S

11

Page 12: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

I F Y O U A R E T H E G O D O F D ATA

• What will be the challenges/problems when your data is big?

• What is your solution?

• divide-and-conquer

• parallelization

• compression

12

Page 13: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

M A P R E D U C E

13

Page 14: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

M A P R E D U C E I S A L I T T L E B I T O U T D AT E D• It is great at one-pass computation

• but not efficient enough for multiple-pass algorithms

• things that require repeatedly hashing or other operations

• states go to file systems

• a lot of I/Os

• slow

14

Page 15: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

S PA R K

• Key idea:

• Load things in the memory

• Resilient Distributed Datasets (RDDs)

• Clean APIs in Java, Scala, Python, R

• not for c++

• We will learn Scala

15

Page 16: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

C O U R S E P L A N

• Key topics

• Spark

• Web search engine

• September - Spark Essentials

• October - Text Processing and Basic Search Engine

• November - PageRank and Web search

• December - Other Apps (Recommender Systems, Dynamic Search, Social Search)

16

Page 17: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

– G R A C E H U I YA N G

“A homework is worth a thousand lectures.”

17

Page 18: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

A S S I G N M E N T S

• Build Google’s pagerank algorithm over Wikipedia

• A big project broken into small pieces

• (nearly) weekly

• 10 + 1 of them

• (almost) all due on some Wednesday 11:59PM

18

Page 19: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

19

Page 20: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

H I G H L I G H T O F T O D AY

• Install Spark

• First Spark Program

20

Page 21: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

S PA R K I N S TA L L AT I O N

• Note: Please do not install/run Spark using:

• Homebrew on MacOSX

• Cygwin on Windows

21

Page 22: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

S T E P 1 - I N S TA L L J AVA J D K 6 / 7 O N M A C O S X O R W I N D O W S

• oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html

• follow the license agreement instructions

• then click the download for your OS

• need JDK instead of JRE (for Maven, etc.)

22

Page 23: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

S T E P 2 : G E T S PA R K

• We will use Spark 1.1.0

• 1. copy from the USB sticks

• 2. connect into the newly created directory

• or you could download from spark.apache.org/downloads.html

23

Page 24: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

S T E P 3 : R U N S PA R K S H E L L

• we’ll run Spark’s interactive shell…

• within the “spark” directory, run:

• ./bin/spark-shell

• then from the “scala>” prompt,

• let’s create some data…

• val data = 1 to 10000

24

Page 25: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

S T E P 4 : C R E AT E A N R D D

• create an RDD based on that data…

• val distData = sc.parallelize(data)

• then use a filter to select values less than 20…

• distData.filter(_ < 20).collect()

25

Page 26: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

A S S I G N M E N T 1

• Use a filter to select values less than your age

• Submit the screen captures of the results of above programs

26

Page 27: COSC282 BIG DATA ANALYTICS FALL 2015people.cs.georgetown.edu/~huiyang/cosc-282/intro.pdf · 2015-09-02 · WHAT BRINGS ME HERE • To have a wonderful semester with you :-) • As

S U M M A R Y

• big data

• spark

• search engine

• syllabus

• installation of spark

• assignment 1

• due next Wednesday

27