HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited.
· 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the...
Transcript of · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the...
––
–
•–
•
•––
••
Prepare data for the next iteration
●○○
●○○○
●○○
●○○
Status of RDD actions being computed
Info about cached RDDs and memory usage
In-depth job info
●
Resilient Distributed
Datasets (RDD)DataFrame DataSet
● Distributed collection of JVM objects
● Functional operators (map, filter, etc.)
● Distributed collection of Row objects
● Expression-based operations
● Fast, efficient internal representations
● Internally rows, externally JVM objects
● Type safe and fast
● Slower than dataframes
●●●●
RDD1 RDD1’
RDD2 RDD2’
RDD3 RDD3’
Machine B
Machine A
Machine C
RDD Operation(e.g. map, filter)
>>> input_RDD = sc.textFile("text.file")
>>> transform_RDD = input_RDD.filter(lambda x: "abcd" in x)
>>> print "Number of “abcd”:" + transform_RDD.count()
>>> output.saveAsTextFile(“hdfs:///output”)
●●
○○○
●●
●
people.json{"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}
val df = spark.read.json("people.json")
val sqlDF = df.filter($"age" > 20).show()+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
df.filter($"age" > 20).select(“name”).write.format(“parquet”).save(“output”)
Note: Parquet is a column-based storage format for Hadoop. You will need special dependencies to read this file
Task Points Description Language
●
●●●● map filter ● reduceByKey aggregateByKey
groupByKey
●●●●
○
●○○○○
●
●
●○ ⇒○ ⇒
●
● How do we measure influence?○ Intuitively, it should be the node with the most followers
● Influence scores are initialized to 1.0/number of vertices
0.333 0.333
0.333
● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are
redistributed between the users they are following
0.333 0.333
0.333
● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are
redistributed between the users they are following
0.333/2 = 0.167
0.333 + 0.333/2 = 0.500
0.333From Node 2
From Node 1
From Node 1From Node 0
● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are
redistributed between the users they are following● Convergence is achieved when the scores of nodes do not
change between iterations ● Pagerank is guaranteed to converge
0.333/2 = 0.167
0.333 + 0.333/2 = 0.500
0.333
From Node 2
From Node 1
From Node 1From Node 0
● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are
redistributed between the users they are following● Convergence is achieved when the scores of nodes do not
change between iterations● Pagerank is guaranteed to converge
0.208 0.396
0.396
val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {
// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {
(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))
}
// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)
.mapValues(sum => a/N + (1-a)*sum)}
● Dangling or sink vertex○ No outgoing edges○ Redistribute contribution equally among all vertices
● Isolated vertex○ No incoming and outgoing edges○ No isolated nodes in Project 4.1 dataset
● Damping factor d○ Represents the probability that a user clicking on links
will continue clicking on them, traveling down an edge○ Use d = 0.85
Dangling vertexIsolated vertex
● Adjacency matrix:
● Transition matrix: (rows sum to 1)
Formula for calculating rank
d = 0.85
Formula for calculating rank
d = 0.85
Note: contributions from isolated and dangling vertices are constant in an iteration
Let
Formula for calculating rank
d = 0.85
Note: contributions from isolated and dangling vertices are constant in an iteration
Let
This simplifies the formula to
Formula for calculating rank
d = 0.85
Formula for calculating rank
d = 0.85
● Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight
● Databricks is an Apache Spark-based analytics platform optimized for Azure
● One-click setup, an interactive workspace, and an optimized Databricks runtime
● Optimized connectors to Azure storage platforms for fast data access
● Software-as-a-Service
● reduceByKeygroupByKey aggregateByKey
●○○ ./spark/bin/spark-shell ○ ./spark/bin/pyspark
●○○○
●○○
●○○
●
● Ensuring correctness○ Make sure total scores sum to 1.0 in every iteration○ Understand closures in Spark
■ Do not do something like thisval data = Array(1,2,3,4,5)
var counter = 0
var rdd = sc.parallelize(data)
rdd.foreach(x => counter += x)
println("Counter value: " + counter)
○ Graph representation■ Adjacency lists use less memory than matrices
○ More detailed walkthroughs and sample calculations can be found here
● Optimization○ Eliminate repeated calculations○ Use the Spark Web UI
■ Monitor your instances to make sure they are fully utilized
■ Identify bottlenecks○ Understand RDD manipulations
■ Actions vs transformations■ Lazy transformations
○ Explore parameter tuning to optimize resource usage○ Be careful with repartition on your RDDs
tWITTER DATA ANALYTICS:TEAM PROJECT
Team Project
33
Team Project● Phase 1:
○ Q1○ Q2 (MySQL AND HBase)
● Phase 2○ Q1○ Q2 & Q3 (MySQL AND HBase)
● Phase 3○ Q1○ Q2 & Q3 (MySQL OR HBase)
34
Team Project Deadlines● Writeup and queries were released on
Monday, October 29th, 2018.● Phase 2 milestones:
○ Q2:■ Q2 on scoreboard, due on Sunday, 11/11
○ Phase 2, Live test:■ Q1, Q2 and Q3, on Sunday, 11/11
○ Phase 2, code and report:■ due on Tuesday, 11/13
36
Query 3, Definitions
● time_start, time_end: in Unix time / Epoch time format, e.g. time_start=1480000000
● uid_start, uid_end: marks the user search boundary, e.g. uid_end=492600000
● n1: the maximum number of topic words that should be included in the response
● n2: the maximum number of tweets that should be included in the response
38
Query 3: Effective Word Count (EWC)EWC:● one or more consecutive alphanumeric
characters (A through Z, a through z, 0 through 9) with zero or more ' or/and - characters.
Query 3 is su-per-b! I'mmmm lovin' it! ⇐ 6 EWC
Don’t forget to remove the short URL and stop words before calculation
38
Query 3, Impact Score
Impact Score = EWC*(favorite_count+retweet_count+followers_count)
Consider negative impact_score as 0.
38
Query 3, Topic WordsTopic words:● After filtering short urls● Exclude stop words● Before censor● Case insensitive (lower case)
TF-IDF:● TF: term frequency of a topic word w● IDF: ln(Total number of tweets in range/
Number of tweets with w in it) 38
Query 3, Topic ScoreTopic Score = sum(x * ln(y + 1)) (i from 1 to n)
n: The total number of tweets in the given time and uid range
x: TF-IDF score of word w in tweet Ti
y: The impact score of Ti
38
Query 3 Example
word1:score1\tword2:score2...\twordn1:scoren1Impactscore1\ttid1\ttext1…..
Example:channel:2270.04 amp:1586.31 new:1166.24 just:1153.70 love:1063.31 like:1015.71 good:937.63
26200650 461159182406672384 I just buyed the comedy album of my bestest friend in the entire world @briangaar. https://t.co/hwDB4veaYG #RacesAsToad... 38
Don’t forget to censor the tweets
Warning!!! Any Hadoop Cluster
For any hadoop cluster on AWS, Azure or GCP
● Don’t open ports to the public, except○ Ports: 22, 80, 25, 443, or 465
● Follow the HBase Primer and use SSH tunnel to communicate with your Yarn UI
38
Note:● There will be a report due at the end of each phase, where you are expected to discuss optimizations● WARNING: Check your AWS instance limits on the new account (should be > 10 instances)
Phase (and query due) Start Deadline Code and Report Due
●
●
●
●
●
50
●
○●
○●
○●
○○
Questions?