Jump Start into Apache® Spark™ and Databricks
-
Upload
databricks -
Category
Technology
-
view
2.875 -
download
2
Transcript of Jump Start into Apache® Spark™ and Databricks
![Page 2: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/2.jpg)
Technology Evangelist, Databricks(Working with Spark since v0.5)
Formerly :• Senior Director of Data Sciences Engineering at Concur (now part of SAP)• Principal Program Manager at Microsoft
Hands-on Data Engineer :Architect for more than 15 years, developing internet-scale infrastructure for both on-premises and cloud including Bing’s Audience Insights, Yahoo’s 24TB SSAS cube, and Isotope Incubation Team (HDInsight).
About Me: Denny Lee
![Page 3: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/3.jpg)
Founded by the creators of Apache Spark in 2013
Share of Spark code contributed by Databricksin 2014
75%
3
Data Value
Created Databricks on top of Spark to make big data simple.
We are Databricks, the company behind Spark.
![Page 4: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/4.jpg)
…
Apache Spark Engine
Spark Core
SparkStreamingSpark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R APIs
Standard libraries
![Page 5: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/5.jpg)
Open SourceEcosystem
![Page 6: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/6.jpg)
Large-Scale Usage
Largest cluster8000 Nodes (Tencent)
Largest single job1 PB (Alibaba, Databricks)
Top Streaming Intake1 TB/hour (HHMI Janelia Farm)
2014 On-Disk Sort RecordFastest Open Source Engine for sorting a PB
![Page 7: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/7.jpg)
Notable Users
Source: Slide 5 of Spark Community Update
Companies That Presented at Spark Summit 2015 in San Francisco
![Page 8: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/8.jpg)
Quick StartQuick Start Using Python | Quick Start Using Scala
![Page 9: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/9.jpg)
Quick Start with Python
textFile = sc.textFile("/mnt/tardis6/docs/README.md")textFile.count()
![Page 10: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/10.jpg)
Quick Start with Scala
textFile = sc.textFile("/mnt/tardis6/docs/README.md")textFile.count()
![Page 11: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/11.jpg)
RDDs
• RDDs have actions, which return values, and transformations, which return pointers to new RDDs.
• Transformations are lazy and executed when an action is run• Transformations: map(), flatMap(), filter(), mapPartitions(), mapPartitionsWithIndex(),
sample(), union(), distinct(), groupByKey(), reduceByKey(), sortByKey(), join(), cogroup(), pipe(), coalesce(), repartition(), partitionBy(), ...
• Actions: reduce(), collect(), count(), first(), take(), takeSample(), takeOrdered(), saveAsTextFile(), saveAsSequenceFile(), saveAsObjectFile(), countByKey(), foreach(), ...
• Persist (cache) distributed data in memory or disk
![Page 12: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/12.jpg)
Spark API Performance
![Page 13: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/13.jpg)
History of Spark APIs
RDD(2011)
DataFrame(2013)
• Distribute collection of JVM objects
• Functional Operators (map, filter, etc.)
• Distribute collection of Row objects
• Expression-based operations and UDFs
• Logical plans and optimizer
• Fast/efficient internal representations
DataSet(2015)
• Internally rows, externally JVM objects
• “Best of both worlds”type safe + fast
![Page 14: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/14.jpg)
Benefit of Logical Plan:Performance Parity Across Languages
0 2 4 6 8 10
Java/Scala
Python
Java/Scala
Python
R
SQL
Runtime for an example aggregation workload (secs)
DataFrame
RDD
![Page 15: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/15.jpg)
0
50
100
150
200
250
300
350
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Spark 1.3.1, 1.4, and 1.5 for 9 queries
1.5RunA1 1.5RunA2 1.5RunB1 1.5RunB2 1.4RunA1 1.4RunA2
NYC Taxi Dataset
![Page 16: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/16.jpg)
Dataset API in Spark 1.6
Typed interface over DataFrames / Tungsten
case class Person(name: String, age: Long)
val dataframe = read.json(“people.json”)val ds: Dataset[Person] = dataframe.as[Person]
ds.filter(p => p.name.startsWith(“M”)) .toDF().groupBy($“name”).avg(“age”)
![Page 17: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/17.jpg)
Dataset
“Encoder” converts from JVM Object into a Dataset Row
Checkout [SPARK-9999]
JVM Object
Dataset Row
encoder
![Page 18: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/18.jpg)
Tungsten Execution
PythonSQL R Streaming
DataFrame (& Dataset)
AdvancedAnalytics
![Page 19: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/19.jpg)
Ad Tech ExampleAdTech Sample Notebook (Part 1)
![Page 20: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/20.jpg)
Create External Table with RegEx
CREATE EXTERNAL TABLE accesslog ( ipaddress STRING,...
)ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES (
"input.regex" = '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \\"(\\S+) (\\S+) (\\S+)\\" (\\d{3}) (\\d+) \\"(.*)\\" \\"(.*)\\" (\\S+) \\"(\\S+), (\\S+), (\\S+), (\\S+)\\"’)LOCATION
"/mnt/mdl/accesslogs/"
![Page 21: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/21.jpg)
External Web Service Call via Mapper
# Obtain the unique agents from the accesslog tableipaddresses = sqlContext.sql("select distinct ip1 from \accesslog where ip1 is not null").rdd
# getCCA2: Obtains two letter country code based on IP addressdef getCCA2(ip):
url = 'http://freegeoip.net/csv/' + ipstr = urllib2.urlopen(url).read() return str.split(",")[1]
# Loop through distinct IP addresses and obtain two-letter country codesmappedIPs = ipaddresses.map(lambda x: (x[0], getCCA2(x[0])))
![Page 22: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/22.jpg)
Join DataFrames and Register Temp Table
# Join countrycodes with mappedIPsDF so we can have IP address and # three-letter ISO country codesmappedIP3 = mappedIP2 \
.join(countryCodesDF, mappedIP2.cca2 == countryCodesDF.cca2, "left_outer") \
.select(mappedIP2.ip, mappedIP2.cca2, countryCodesDF.cca3, countryCodesDF.cn)
# Register the mapping tablemappedIP3.registerTempTable("mappedIP3")
![Page 23: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/23.jpg)
Add Columns to DataFrames with UDFs
from user_agents import parsefrom pyspark.sql.types import StringTypefrom pyspark.sql.functions import udf
# Create UDFs to extract out Browser Family informationdef browserFamily(ua_string) : return xstr(parse(xstr(ua_string)).browser.family)udfBrowserFamily = udf(browserFamily, StringType())
# Obtain the unique agents from the accesslog tableuserAgentTbl = sqlContext.sql("select distinct agent from accesslog")
# Add new columns to the UserAgentInfo DataFrame containing browser informationuserAgentInfo = userAgentTbl.withColumn('browserFamily', \
udfBrowserFamily(userAgentTbl.agent))
![Page 24: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/24.jpg)
Use Python UDFs with Spark SQL
# Define function (converts Apache web log time)def weblog2Time(weblog_timestr): ...
# Define and Register UDFudfWeblog2Time = udf(weblog2Time, DateType())sqlContext.registerFunction("udfWeblog2Time", lambda x: weblog2Time(x))
# Create DataFrameaccessLogsPrime = sqlContext.sql("select hash(a.ip1, a.agent) as UserId,
m.cca3, udfWeblog2Time(a.datetime),...")udfWeblog2Time(a.datetime)
![Page 25: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/25.jpg)
![Page 26: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/26.jpg)
![Page 27: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/27.jpg)
![Page 28: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/28.jpg)
![Page 29: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/29.jpg)
References
Spark DataFrames: Simple and Fast Analysis on Structured Data [Michael Armbrust]
Apache Spark 1.6 presented by Databricks co-founder Patrick Wendell
Announcing Spark 1.6
Introducing Spark Datasets
Spark SQL Data Sources API: Unified Data Access for the Spark Platform
![Page 30: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/30.jpg)
Join us at Spark Summit EastFebruary 16-18, 2016 | New York City
![Page 31: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/31.jpg)
Thanks!
![Page 32: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/32.jpg)
Appendix
![Page 33: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/33.jpg)
Spark Survey 2015 Highlights
![Page 34: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/34.jpg)
Spark adoption is growing rapidly
Spark use is growing beyond Hadoop
Spark is increasing access to big data
Spark Survey Report 2015 Highlights
TOP 3 APACHE SPARK TAKE AWAYS
![Page 35: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/35.jpg)
![Page 36: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/36.jpg)
![Page 37: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/37.jpg)
![Page 38: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/38.jpg)
![Page 39: Jump Start into Apache® Spark™ and Databricks](https://reader030.fdocuments.net/reader030/viewer/2022020406/58f15ffe1a28abf1658b458f/html5/thumbnails/39.jpg)
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO
Source: Slide 5 of Spark Community Update