· 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the...

––

–

•–

•

•––

••

Prepare data for the next iteration

●○○

●○○○

●○○

●○○

https://spark.apache.org/sql/

https://spark.apache.org/streaming/

https://spark.apache.org/mllib/

https://spark.apache.org/graphx/

Status of RDD actions being computed

Info about cached RDDs and memory usage

In-depth job info

●

Resilient Distributed

Datasets (RDD)DataFrame DataSet

● Distributed collection of JVM objects

● Functional operators (map, filter, etc.)

● Distributed collection of Row objects

● Expression-based operations

● Fast, efficient internal representations

● Internally rows, externally JVM objects

● Type safe and fast

● Slower than dataframes

●●●●

RDD1 RDD1’

RDD2 RDD2’

RDD3 RDD3’

Machine B

Machine A

Machine C

RDD Operation(e.g. map, filter)

>>> input_RDD = sc.textFile("text.file")

>>> transform_RDD = input_RDD.filter(lambda x: "abcd" in x)

>>> print "Number of “abcd”:" + transform_RDD.count()

>>> output.saveAsTextFile(“hdfs:///output”)

●●

○○○

●●

●

people.json{"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}

val df = spark.read.json("people.json")

val sqlDF = df.filter($"age" > 20).show()+---+----+

|age|name|

+---+----+

| 30|Andy|

+---+----+

df.filter($"age" > 20).select(“name”).write.format(“parquet”).save(“output”)

Note: Parquet is a column-based storage format for Hadoop. You will need special dependencies to read this file

Task Points Description Language

●

●●●● map filter ● reduceByKey aggregateByKey

groupByKey

●●●●

○

●○○○○

●

●

●○ ⇒○ ⇒

●

● How do we measure influence?○ Intuitively, it should be the node with the most followers

● Influence scores are initialized to 1.0/number of vertices

0.333 0.333

0.333

● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are

redistributed between the users they are following

0.333 0.333

0.333


redistributed between the users they are following

0.333/2 = 0.167

0.333 + 0.333/2 = 0.500

0.333From Node 2

From Node 1

From Node 1From Node 0


redistributed between the users they are following● Convergence is achieved when the scores of nodes do not

change between iterations ● Pagerank is guaranteed to converge

0.333/2 = 0.167

0.333 + 0.333/2 = 0.500

0.333

From Node 2

From Node 1

From Node 1From Node 0


redistributed between the users they are following● Convergence is achieved when the scores of nodes do not

change between iterations● Pagerank is guaranteed to converge

0.208 0.396

0.396

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {

// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {

(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}

// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)}

● Dangling or sink vertex○ No outgoing edges○ Redistribute contribution equally among all vertices

● Isolated vertex○ No incoming and outgoing edges○ No isolated nodes in Project 4.1 dataset

● Damping factor d○ Represents the probability that a user clicking on links

will continue clicking on them, traveling down an edge○ Use d = 0.85

Dangling vertexIsolated vertex

● Adjacency matrix:

● Transition matrix: (rows sum to 1)

Formula for calculating rank

d = 0.85


d = 0.85

Note: contributions from isolated and dangling vertices are constant in an iteration

Let


d = 0.85

Note: contributions from isolated and dangling vertices are constant in an iteration

Let

This simplifies the formula to


d = 0.85

● Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight

● Databricks is an Apache Spark-based analytics platform optimized for Azure

● One-click setup, an interactive workspace, and an optimized Databricks runtime

● Optimized connectors to Azure storage platforms for fast data access

● Software-as-a-Service

● reduceByKeygroupByKey aggregateByKey

●○○ ./spark/bin/spark-shell ○ ./spark/bin/pyspark

●○○○

●○○

●○○

●

https://spark.apache.org/docs/latest/sql-programming-guide.html

https://spark.apache.org/docs/latest/rdd-programming-guide.html

● Ensuring correctness○ Make sure total scores sum to 1.0 in every iteration○ Understand closures in Spark

■ Do not do something like thisval data = Array(1,2,3,4,5)

var counter = 0

var rdd = sc.parallelize(data)

rdd.foreach(x => counter += x)

println("Counter value: " + counter)

○ Graph representation■ Adjacency lists use less memory than matrices

○ More detailed walkthroughs and sample calculations can be found here

https://s3.amazonaws.com/15619public/webcontent/pagerank_examples.pdf

● Optimization○ Eliminate repeated calculations○ Use the Spark Web UI

■ Monitor your instances to make sure they are fully utilized

■ Identify bottlenecks○ Understand RDD manipulations

■ Actions vs transformations■ Lazy transformations

○ Explore parameter tuning to optimize resource usage○ Be careful with repartition on your RDDs

tWITTER DATA ANALYTICS:TEAM PROJECT

Team Project

33

Team Project● Phase 1:

○ Q1○ Q2 (MySQL AND HBase)

● Phase 2○ Q1○ Q2 & Q3 (MySQL AND HBase)

● Phase 3○ Q1○ Q2 & Q3 (MySQL OR HBase)

34

Team Project Deadlines● Writeup and queries were released on

Monday, October 29th, 2018.● Phase 2 milestones:

○ Q2:■ Q2 on scoreboard, due on Sunday, 11/11

○ Phase 2, Live test:■ Q1, Q2 and Q3, on Sunday, 11/11

○ Phase 2, code and report:■ due on Tuesday, 11/13

36

Query 3, Definitions

● time_start, time_end: in Unix time / Epoch time format, e.g. time_start=1480000000

● uid_start, uid_end: marks the user search boundary, e.g. uid_end=492600000

● n1: the maximum number of topic words that should be included in the response

● n2: the maximum number of tweets that should be included in the response

38

Query 3: Effective Word Count (EWC)EWC:● one or more consecutive alphanumeric

characters (A through Z, a through z, 0 through 9) with zero or more ' or/and - characters.

Query 3 is su-per-b! I'mmmm lovin' it! ⇐ 6 EWC

Don’t forget to remove the short URL and stop words before calculation

38

Query 3, Impact Score

Impact Score = EWC*(favorite_count+retweet_count+followers_count)

Consider negative impact_score as 0.

38

Query 3, Topic WordsTopic words:● After filtering short urls● Exclude stop words● Before censor● Case insensitive (lower case)

TF-IDF:● TF: term frequency of a topic word w● IDF: ln(Total number of tweets in range/

Number of tweets with w in it) 38

Query 3, Topic ScoreTopic Score = sum(x * ln(y + 1)) (i from 1 to n)

n: The total number of tweets in the given time and uid range

x: TF-IDF score of word w in tweet Ti

y: The impact score of Ti

38

Query 3 Example

word1:score1\tword2:score2...\twordn1:scoren1Impactscore1\ttid1\ttext1…..

Example:channel:2270.04 amp:1586.31 new:1166.24 just:1153.70 love:1063.31 like:1015.71 good:937.63

26200650 461159182406672384 I just buyed the comedy album of my bestest friend in the entire world @briangaar. https://t.co/hwDB4veaYG #RacesAsToad... 38

Don’t forget to censor the tweets

Warning!!! Any Hadoop Cluster

For any hadoop cluster on AWS, Azure or GCP

● Don’t open ports to the public, except○ Ports: 22, 80, 25, 443, or 465

● Follow the HBase Primer and use SSH tunnel to communicate with your Yarn UI

38

Note:● There will be a report due at the end of each phase, where you are expected to discuss optimizations● WARNING: Check your AWS instance limits on the new account (should be > 10 instances)

Phase (and query due) Start Deadline Code and Report Due

●

●

●

●

●

50

●

○●

○●

○●

○○

Questions?

· 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the...

Documents

Transcript of · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the...