Transformations and actions a visual guide training
-
Upload
spark-summit -
Category
Data & Analytics
-
view
190 -
download
2
Transcript of Transformations and actions a visual guide training
![Page 1: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/1.jpg)
TRANSFORMATIONS AND ACTIONS A Visual Guide of the
APIhttp://training.databricks.com/visualapi.pdf
![Page 2: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/2.jpg)
Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark API under the MIT license to the Spark community.
Jeff’s original, creative work can be found here and you can read more about Jeff’s project in his blog post.
After talking to Jeff, Databricks commissioned Adam Breindel to further evolve Jeff’s work into the diagrams you see in this deck.
Blog: data-frack
![Page 3: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/3.jpg)
making big data simple
Databricks Cloud:“A unified platform for building Big Data pipelines – from ETL to Exploration and Dashboards, to Advanced Analytics and Data Products.”
• Founded in late 2013
• by the creators of Apache Spark
• Original team from UC Berkeley AMPLab
• Raised $47 Million in 2 rounds
• ~55 employees
• We’re hiring!
• Level 2/3 support partnerships with
• Hortonworks
• MapR
• DataStax
(http://databricks.workable.com)
![Page 4: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/4.jpg)
key
RDD Elements
original item
transformed type
object on driver
RDD
partition(s)
A
B
user functions
user input
input
emitted value
Legend
![Page 5: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/5.jpg)
Randomized operation
Legend
Set Theory / Relational operation
Numeric calculation
![Page 6: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/6.jpg)
Operations =
TRANSFORMATIONS
ACTIONS
+
![Page 7: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/7.jpg)
• map• filter• flatMap• mapPartitions• mapPartitionsWithIndex• groupBy• sortBy
= medium
Essential Core & Intermediate Spark Operations
TR
AN
SFO
RM
AT
ION
SA
CT
IO
NS
General • sample
• randomSplit
Math / Statistical
= easy
Set Theory / Relational • union• intersection• subtract• distinct• cartesian• zip
• takeOrdered
Data Structure / I/O
• saveAsTextFile• saveAsSequenceFile• saveAsObjectFile• saveAsHadoopDataset• saveAsHadoopFile• saveAsNewAPIHadoopDataset• saveAsNewAPIHadoopFile
• keyBy• zipWithIndex• zipWithUniqueID• zipPartitions• coalesce• repartition• repartitionAndSortWithinPartitions
• pipe
• count• takeSample• max• min• sum• histogram• mean• variance• stdev• sampleVariance• countApprox• countApproxDistinct
• reduce• collect• aggregate• fold• first• take• forEach• top• treeAggregate• treeReduce• forEachPartition• collectAsMap
![Page 8: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/8.jpg)
= medium
Essential Core & Intermediate PairRDD Operations
TR
AN
SFO
RM
AT
ION
SA
CT
IO
NS
General • sampleByKey
Math / Statistical
= easy
Set Theory / Relational
Data Structure
• keys• values
• partitionBy
• countByKey• countByValue• countByValueApprox• countApproxDistinctByKey
• countApproxDistinctByKey
• countByKeyApprox• sampleByKeyExact
• cogroup (=groupWith)• join• subtractByKey• fullOuterJoin• leftOuterJoin• rightOuterJoin
• flatMapValues• groupByKey• reduceByKey• reduceByKeyLocally• foldByKey• aggregateByKey• sortByKey• combineByKey
![Page 9: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/9.jpg)
vs
narrow
wide
each partition of the parent RDD is used by at most one partition of the child RDD
multiple child RDD partitions may depend on a single parent RDD partition
![Page 10: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/10.jpg)
“One of the challenges in providing RDDs as an abstraction is choosing a representation for them that can track lineage across a wide range of transformations.”
“The most interesting question in designing this interface is how to represent dependencies between RDDs.”
“We found it both sufficient and useful to classify dependencies into two types: • narrow dependencies, where each partition of the parent RDD
is used by at most one partition of the child RDD• wide dependencies, where multiple child partitions may
depend on it.”
LINEAGE
![Page 11: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/11.jpg)
narrow
wide
each partition of the parent RDD is used by at most one partition of the child RDD
multiple child RDD partitions may depend on a single parent RDD partition
map, filter
union
join w/ inputs co-partitioned
groupByKey
join w/ inputs not co-partitioned
![Page 12: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/12.jpg)
TRANSFORMATIONS Core Operations
![Page 13: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/13.jpg)
MAP
3 items in RDD
(partitions not
shown)
RDD: x
![Page 14: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/14.jpg)
MAP
User function applied item by item
RDD: x RDD: y
![Page 15: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/15.jpg)
MAPRDD: x RDD: y
![Page 16: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/16.jpg)
MAPRDD: x RDD: y
![Page 17: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/17.jpg)
MAPRDD: x RDD: y
After map() has been applied…
before after
![Page 18: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/18.jpg)
MAPRDD: x RDD: y
Return a new RDD by applying a function to each element of this RDD.
![Page 19: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/19.jpg)
MAP
x = sc.parallelize(["b", "a", "c"]) y = x.map(lambda z: (z, 1))print(x.collect())print(y.collect())
['b', 'a', 'c']
[('b', 1), ('a', 1), ('c', 1)]
RDD: x
RDD: y
x:
y:
map(f, preservesPartitioning=False)
Return a new RDD by applying a function to each element of this RDD
val x = sc.parallelize(Array("b", "a", "c"))val y = x.map(z => (z,1))println(x.collect().mkString(", "))println(y.collect().mkString(", "))
![Page 20: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/20.jpg)
FILTER
3 items in RDD
(partitions not
shown)
RDD: x
![Page 21: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/21.jpg)
FILTER
Apply user function: keep item if function returns true
RDD: x RDD: y
emitsTrue
![Page 22: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/22.jpg)
FILTERRDD: x RDD: y
emitsFalse
![Page 23: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/23.jpg)
FILTERRDD: x RDD: y
emitsTrue
![Page 24: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/24.jpg)
FILTERRDD: x RDD: y
After filter() has been applied…
before after
![Page 25: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/25.jpg)
FILTER
x = sc.parallelize([1,2,3])y = x.filter(lambda x: x%2 == 1) #keep odd valuesprint(x.collect())print(y.collect())
[1, 2, 3]
[1, 3]
RDD: x
RDD: y
x:
y:
filter(f)
Return a new RDD containing only the elements that satisfy a predicate
val x = sc.parallelize(Array(1,2,3))val y = x.filter(n => n%2 == 1)println(x.collect().mkString(", "))println(y.collect().mkString(", "))
![Page 26: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/26.jpg)
FLATMAP
3 items in RDD
(partitions not
shown)
RDD: x
![Page 27: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/27.jpg)
FLATMAPRDD: x RDD: y
![Page 28: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/28.jpg)
FLATMAPRDD: x RDD: y
![Page 29: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/29.jpg)
FLATMAPRDD: x RDD: y
![Page 30: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/30.jpg)
FLATMAPRDD: x RDD: y
After flatmap() has been applied…
before after
![Page 31: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/31.jpg)
FLATMAPRDD: x RDD: y
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results
![Page 32: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/32.jpg)
FLATMAP
x = sc.parallelize([1,2,3]) y = x.flatMap(lambda x: (x, x*100, 42))print(x.collect())print(y.collect())
[1, 2, 3]
[1, 100, 42, 2, 200, 42, 3, 300, 42]
x:
y:
RDD: x
RDD: y
flatMap(f, preservesPartitioning=False)
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results
val x = sc.parallelize(Array(1,2,3))val y = x.flatMap(n => Array(n, n*100, 42))println(x.collect().mkString(", "))println(y.collect().mkString(", "))
![Page 33: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/33.jpg)
GROUPBY
4 items in RDD
(partitions not
shown)
RDD: x
James
Anna
Fred
John
![Page 34: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/34.jpg)
GROUPBYRDD: x
James
Anna
Fred
John
emits‘J’
J [ “John” ]
RDD: y
![Page 35: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/35.jpg)
F [ “Fred” ]
GROUPBYRDD: x
James
Anna
Fred
emits‘F’
J [ “John” ]John
RDD: y
![Page 36: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/36.jpg)
[ “Fred” ]
GROUPBYRDD: x
James
Anna
emits‘A’J [ “John” ]
A [ “Anna” ]
Fred
John
F
RDD: y
![Page 37: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/37.jpg)
[ “Fred” ]
GROUPBYRDD: x
James
Anna
emits‘J’J [ “John”, “James”
]
[ “Anna” ]
Fred
John
F
A
RDD: y
![Page 38: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/38.jpg)
GROUPBY
x = sc.parallelize(['John', 'Fred', 'Anna', 'James'])y = x.groupBy(lambda w: w[0])print [(k, list(v)) for (k, v) in y.collect()]
['John', 'Fred', 'Anna', 'James']
[('A',['Anna']),('J',['John','James']),('F',['Fred'])]
RDD: x
RDD: y
x:
y:
groupBy(f, numPartitions=None)
Group the data in the original RDD. Create pairs where the key is the output of a user function, and the value is all items for which the function yields this key.
val x = sc.parallelize(Array("John", "Fred", "Anna", "James"))
val y = x.groupBy(w => w.charAt(0))println(y.collect().mkString(", "))
![Page 39: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/39.jpg)
GROUPBYKEY
5 items in RDD
(partitions not
shown)
Pair RDD: x
B
B
A
A
A
5
4
3
2
1
![Page 40: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/40.jpg)
GROUPBYKEYPair RDD: x
5
4
3
2
1
RDD: y
A [ 2 , 3 , 1 ]
B
B
A
A
A
![Page 41: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/41.jpg)
GROUPBYKEYPair RDD: x
RDD: y
B [ 5 , 4 ]
A [ 2 , 3 , 1 ]
5
4
3
2
1
B
B
A
A
A
![Page 42: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/42.jpg)
GROUPBYKEY
x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)])y = x.groupByKey()print(x.collect())print(list((j[0], list(j[1])) for j in y.collect()))
[('B', 5),('B', 4),('A', 3),('A', 2),('A', 1)]
[('A', [2, 3, 1]),('B',[5, 4])]
RDD: x
RDD: y
x:
y:
groupByKey(numPartitions=None)
Group the values for each key in the original RDD. Create a new pair where the original key corresponds to this collected group of values.
val x = sc.parallelize( Array(('B',5),('B',4),('A',3),('A',2),('A',1)))val y = x.groupByKey()println(x.collect().mkString(", "))println(y.collect().mkString(", "))
![Page 43: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/43.jpg)
MAPPARTITIONSRDD: x RDD: y
partitions
A
B
A
B
![Page 44: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/44.jpg)
REDUCEBYKEY VS GROUPBYKEY
val words = Array("one", "two", "two", "three", "three", "three")val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect()
val wordCountsWithGroup = wordPairsRDD .groupByKey() .map(t => (t._1, t._2.sum)) .collect()
![Page 45: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/45.jpg)
REDUCEBYKEY
(a, 1)(b, 1)
(a, 1)(b, 1)
(a, 1)(a, 1) (a, 2)
(b, 2)(b, 1)(b, 1)
(a, 1)(a, 1)
(a, 3)(b, 2)
(a, 1)(b, 1)(b, 1)
(a, 1)(a, 2) (a, 6)
(a, 3)
(b, 1)(b, 2) (b, 5)
(b, 2)
a b
![Page 46: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/46.jpg)
GROUPBYKEY
(a, 1)(b, 1)
(a, 1)(a, 1)(b, 1)(b, 1)
(a, 1)(a, 1)(a, 1)(b, 1)(b, 1)
(a, 1)(a, 1)
(a, 6)(a, 1)
(b, 5)
a b
(a, 1)(a, 1)(a, 1)
(b, 1)(b, 1)(b, 1)(b, 1)(b, 1)
![Page 47: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/47.jpg)
MAPPARTITIONS
x:
y:
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD
A
B
A
B
x = sc.parallelize([1,2,3], 2)
def f(iterator): yield sum(iterator); yield 42
y = x.mapPartitions(f)
# glom() flattens elements on the same partitionprint(x.glom().collect())print(y.glom().collect())
[[1], [2, 3]]
[[1, 42], [5, 42]]
![Page 48: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/48.jpg)
MAPPARTITIONS
x:
y:
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD
A
B
A
B
Array(Array(1), Array(2, 3))
Array(Array(1, 42), Array(5, 42))
val x = sc.parallelize(Array(1,2,3), 2)
def f(i:Iterator[Int])={ (i.sum,42).productIterator }
val y = x.mapPartitions(f)
// glom() flattens elements on the same partitionval xOut = x.glom().collect()val yOut = y.glom().collect()
![Page 49: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/49.jpg)
MAPPARTITIONSWITHINDEXRDD: x RDD: y
partitions
A
B
A
B
input
partition index
![Page 50: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/50.jpg)
x:
y:
mapPartitionsWithIndex(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition
A
B
A
B
x = sc.parallelize([1,2,3], 2)
def f(partitionIndex, iterator): yield (partitionIndex, sum(iterator))
y = x.mapPartitionsWithIndex(f)
# glom() flattens elements on the same partitionprint(x.glom().collect())print(y.glom().collect())
[[1], [2, 3]]
[[0, 1], [1, 5]]
MAPPARTITIONSWITHINDEX
partition index
B A
![Page 51: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/51.jpg)
x:
y:
mapPartitionsWithIndex(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.
A
B
A
B
Array(Array(1), Array(2, 3))
Array(Array(0, 1), Array(1, 5))
MAPPARTITIONSWITHINDEX
partition index
B A
val x = sc.parallelize(Array(1,2,3), 2)
def f(partitionIndex:Int, i:Iterator[Int]) = { (partitionIndex, i.sum).productIterator }
val y = x.mapPartitionsWithIndex(f)
// glom() flattens elements on the same partitionval xOut = x.glom().collect()val yOut = y.glom().collect()
![Page 52: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/52.jpg)
SAMPLERDD: x RDD: y
1
3
5
4
3
2
1
![Page 53: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/53.jpg)
SAMPLE
x = sc.parallelize([1, 2, 3, 4, 5])y = x.sample(False, 0.4, 42)print(x.collect())print(y.collect())
[1, 2, 3, 4, 5]
[1, 3]
RDD: x
RDD: y
x:
y:
sample(withReplacement, fraction, seed=None)
Return a new RDD containing a statistical sample of the original RDD
val x = sc.parallelize(Array(1, 2, 3, 4, 5))val y = x.sample(false, 0.4)
// omitting seed will yield different outputprintln(y.collect().mkString(", "))
![Page 54: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/54.jpg)
UNION RDD: x RDD: y
4
3
3
2
1
A
BC
4
3
3
2
1
A
B
C
RDD: z
![Page 55: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/55.jpg)
UNION
x = sc.parallelize([1,2,3], 2)y = sc.parallelize([3,4], 1)z = x.union(y)print(z.glom().collect())
[1, 2, 3]
[3, 4]
[[1], [2, 3], [3, 4]]
x:
y:
union(otherRDD)
Return a new RDD containing all items from two original RDDs. Duplicates are not culled.
val x = sc.parallelize(Array(1,2,3), 2)val y = sc.parallelize(Array(3,4), 1)val z = x.union(y)val zOut = z.glom().collect()
z:
![Page 56: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/56.jpg)
5B
JOIN RDD: x RDD: y
42B
A1A
3A
![Page 57: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/57.jpg)
JOIN RDD: x RDD: y
RDD: z
(1, 3)A
5B
42B
A1A
3A
![Page 58: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/58.jpg)
JOIN RDD: x RDD: y
RDD: z(1, 4)A
(1, 3)A
5B
42B
A1A
3A
![Page 59: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/59.jpg)
JOIN RDD: x RDD: y
(2, 5) RDD: zB
(1, 4)A
(1, 3)A
5B
42B
A1A
3A
![Page 60: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/60.jpg)
JOIN
x = sc.parallelize([("a", 1), ("b", 2)])y = sc.parallelize([("a", 3), ("a", 4), ("b", 5)])z = x.join(y)print(z.collect()) [("a", 1), ("b", 2)]
[("a", 3), ("a", 4), ("b", 5)]
[('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]
x:
y:
union(otherRDD, numPartitions=None)
Return a new RDD containing all pairs of elements having the same key in the original RDDs
val x = sc.parallelize(Array(("a", 1), ("b", 2)))val y = sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5)))val z = x.join(y)println(z.collect().mkString(", "))
z:
![Page 61: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/61.jpg)
DISTINCT
4
3
3
2
1
RDD: x
![Page 62: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/62.jpg)
DISTINCTRDD: x
4
3
3
2
1
4
3
3
2
1
RDD: y
![Page 63: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/63.jpg)
DISTINCTRDD: x
4
3
3
2
1
4
3
2
1
RDD: y
![Page 64: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/64.jpg)
DISTINCT
x = sc.parallelize([1,2,3,3,4])y = x.distinct()
print(y.collect())[1, 2, 3, 3, 4]
[1, 2, 3, 4]
x:
y:
distinct(numPartitions=None)
Return a new RDD containing distinct items from the original RDD (omitting all duplicates)
val x = sc.parallelize(Array(1,2,3,3,4))val y = x.distinct()
println(y.collect().mkString(", "))
**
*¤¤
![Page 65: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/65.jpg)
COALESCERDD: x
A
B
C
![Page 66: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/66.jpg)
COALESCERDD: x
B
C
RDD: y
AAB
![Page 67: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/67.jpg)
C
COALESCERDD: x
B
C
RDD: y
AAB
![Page 68: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/68.jpg)
COALESCE
x = sc.parallelize([1, 2, 3, 4, 5], 3)y = x.coalesce(2)print(x.glom().collect())print(y.glom().collect())
[[1], [2, 3], [4, 5]]
[[1], [2, 3, 4, 5]]
x:
y:
coalesce(numPartitions, shuffle=False)
Return a new RDD which is reduced to a smaller number of partitions
val x = sc.parallelize(Array(1, 2, 3, 4, 5), 3)val y = x.coalesce(2)val xOut = x.glom().collect()val yOut = y.glom().collect()
C
B
C
AAB
![Page 69: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/69.jpg)
KEYBYRDD: x
James
Anna
Fred
John
RDD: y
J “John”
emits‘J’
![Page 70: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/70.jpg)
KEYBYRDD: x
James
Anna
‘F’
Fred “Fred”F
RDD: y
J “John”John
![Page 71: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/71.jpg)
James
“Anna”A
KEYBYRDD: x
Anna
‘A’
Fred
John
“Fred”F
RDD: y
J “John”
![Page 72: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/72.jpg)
J “James”
“Anna”A
KEYBYRDD: x
James
Anna
emits‘J’Fred
John
“Fred”F
RDD: y
J “John”
![Page 73: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/73.jpg)
KEYBY
x = sc.parallelize(['John', 'Fred', 'Anna', 'James'])y = x.keyBy(lambda w: w[0])print y.collect()
['John', 'Fred', 'Anna', 'James']
[('J','John'),('F','Fred'),('A','Anna'),('J','James')]
RDD: x
RDD: y
x:
y:
keyBy(f)
Create a Pair RDD, forming one pair for each item in the original RDD. The pair’s key is calculated from the value via a user-supplied function.
val x = sc.parallelize(Array("John", "Fred", "Anna", "James"))
val y = x.keyBy(w => w.charAt(0))println(y.collect().mkString(", "))
![Page 74: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/74.jpg)
PARTITIONBYRDD: x
J “John”
“Anna”A
“Fred”F
J “James”
![Page 75: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/75.jpg)
PARTITIONBYRDD: x
J “John”
“Anna”A
“Fred”F
J “James”
RDD: yRDD: y
J “James”
1
![Page 76: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/76.jpg)
PARTITIONBYRDD: x
J “John”
“Anna”A
“Fred”F
RDD: yRDD: y
J “James”
“Fred”F
0
J “James”
![Page 77: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/77.jpg)
PARTITIONBYRDD: x
J “John”
“Anna”A
RDD: yRDD: y
J “James”
“Anna”A
“Fred”F
0
“Fred”F
J “James”
![Page 78: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/78.jpg)
PARTITIONBYRDD: x
J “John”
RDD: yRDD: y
J “John”
J “James”
“Anna”A
“Fred”F1
“Anna”A
“Fred”F
J “James”
![Page 79: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/79.jpg)
PARTITIONBY
x = sc.parallelize([('J','James'),('F','Fred'), ('A','Anna'),('J','John')], 3)
y = x.partitionBy(2, lambda w: 0 if w[0] < 'H' else 1)print x.glom().collect()print y.glom().collect()
[[('J', 'James')], [('F', 'Fred')], [('A', 'Anna'), ('J', 'John')]]
[[('A', 'Anna'), ('F', 'Fred')], [('J', 'James'), ('J', 'John')]]
x:
y:
partitionBy(numPartitions, partitioner=portable_hash)
Return a new RDD with the specified number of partitions, placing original items into the partition returned by a user supplied function
![Page 80: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/80.jpg)
PARTITIONBY
Array(Array((A,Anna), (F,Fred)), Array((J,John), (J,James)))
Array(Array((F,Fred), (A,Anna)), Array((J,John), (J,James)))
x:
y:
partitionBy(numPartitions, partitioner=portable_hash)
Return a new RDD with the specified number of partitions, placing original items into the partition returned by a user supplied function.
import org.apache.spark.Partitionerval x = sc.parallelize(Array(('J',"James"),('F',"Fred"),
('A',"Anna"),('J',"John")), 3)
val y = x.partitionBy(new Partitioner() { val numPartitions = 2 def getPartition(k:Any) = {
if (k.asInstanceOf[Char] < 'H') 0 else 1 }})
val yOut = y.glom().collect()
![Page 81: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/81.jpg)
ZIP RDD: x RDD: y
3
2
1
A
B
9
4
1
A
B
![Page 82: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/82.jpg)
ZIP RDD: x RDD: y
3
2
1
A
B
4
A
RDD: z
9
4
1
A
B
1 1
![Page 83: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/83.jpg)
ZIP RDD: x RDD: y
3
2
1
A
B
4
A
RDD: z
9
4
1
A
B
2
1
4
1
![Page 84: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/84.jpg)
ZIP RDD: x RDD: y
3
2
1
A
B
4
3
A
B
RDD: z
9
4
1
A
B
2
1
9
4
1
![Page 85: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/85.jpg)
ZIP
x = sc.parallelize([1, 2, 3])y = x.map(lambda n:n*n)z = x.zip(y)
print(z.collect()) [1, 2, 3]
[1, 4, 9]
[(1, 1), (2, 4), (3, 9)]
x:
y:
zip(otherRDD)
Return a new RDD containing pairs whose key is the item in the original RDD, and whose value is that item’s corresponding element (same partition, same index) in a second RDD
val x = sc.parallelize(Array(1,2,3))val y = x.map(n=>n*n)val z = x.zip(y)
println(z.collect().mkString(", "))
z:
![Page 86: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/86.jpg)
ACTIONS Core Operations
![Page 87: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/87.jpg)
vs
distributed
driver
occurs across the cluster
result must fit in driver JVM
![Page 88: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/88.jpg)
GETNUMPARTITIONS
partition(s)
A
B
2
![Page 89: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/89.jpg)
x:
y:
getNumPartitions()
Return the number of partitions in RDD
[[1], [2, 3]]
2
GETNUMPARTITIONSA
B
2
x = sc.parallelize([1,2,3], 2)y = x.getNumPartitions()
print(x.glom().collect())print(y)
val x = sc.parallelize(Array(1,2,3), 2)val y = x.partitions.sizeval xOut = x.glom().collect()println(y)
![Page 90: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/90.jpg)
COLLECT
partition(s)
A
B [ ]
![Page 91: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/91.jpg)
x:
y:
collect()
Return all items in the RDD to the driver in a single list
[[1], [2, 3]]
[1, 2, 3]
A
B
x = sc.parallelize([1,2,3], 2)y = x.collect()
print(x.glom().collect())print(y)
val x = sc.parallelize(Array(1,2,3), 2)val y = x.collect()
val xOut = x.glom().collect()println(y)
COLLECT [ ]
![Page 92: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/92.jpg)
REDUCE
3
2
1
emits
4
3
![Page 93: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/93.jpg)
4
REDUCE
3
2
1
emits
3input
6
![Page 94: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/94.jpg)
4
REDUCE
3
2
1
10input
6
10
![Page 95: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/95.jpg)
x:
y:
reduce(f)
Aggregate all the elements of the RDD by applying a user function pairwise to elements and partial results, and returns a result to the driver
[1, 2, 3, 4]
10
x = sc.parallelize([1,2,3,4])y = x.reduce(lambda a,b: a+b)
print(x.collect())print(y)
val x = sc.parallelize(Array(1,2,3,4))val y = x.reduce((a,b) => a+b)
println(x.collect.mkString(", "))println(y)
REDUCE******
***
***
![Page 96: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/96.jpg)
AGGREGATE
3
2
1
4
A
B
![Page 97: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/97.jpg)
AGGREGATE
3
2
1
4
![Page 98: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/98.jpg)
AGGREGATE
3
2
1
4
emits
([], 0)
([1], 1)
![Page 99: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/99.jpg)
AGGREGATE
3
2
1
4
([], 0)
([2], 2) ([1],
1)
![Page 100: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/100.jpg)
AGGREGATE
3
2
1
4
([2], 2) ([1],
1)([1,2], 3)
![Page 101: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/101.jpg)
AGGREGATE
3
2
1
4
([2], 2) ([1],
1)([1,2], 3)
([3], 3)
([], 0)
![Page 102: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/102.jpg)
AGGREGATE
3
2
1
4
([2], 2) ([1],
1)([1,2], 3)
([3], 3)
([], 0)
([4], 4)
![Page 103: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/103.jpg)
AGGREGATE
3
2
1
4
([2], 2) ([1],
1)([1,2], 3)
([4], 4) ([3],
3)([3,4], 7)
![Page 104: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/104.jpg)
AGGREGATE
3
2
1
4
([2], 2) ([1],
1)([1,2], 3)
([4], 4) ([3],
3)([3,4], 7)
![Page 105: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/105.jpg)
AGGREGATE
3
2
1
4
([1,2], 3)
([3,4], 7)
![Page 106: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/106.jpg)
AGGREGATE
3
2
1
4
([1,2], 3)
([3,4], 7)
([1,2,3,4], 10)
([1,2,3,4], 10)
![Page 107: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/107.jpg)
aggregate(identity, seqOp, combOp)
Aggregate all the elements of the RDD by: - applying a user function to combine elements with user-supplied objects, - then combining those user-defined results via a second user function, - and finally returning a result to the driver.
seqOp = lambda data, item: (data[0] + [item], data[1] + item)combOp = lambda d1, d2: (d1[0] + d2[0], d1[1] + d2[1])
x = sc.parallelize([1,2,3,4])y = x.aggregate(([], 0), seqOp, combOp)
print(y)
AGGREGATE*******
*****
[( ),#]
x:
y:
[1, 2, 3, 4]
([1, 2, 3, 4], 10)
![Page 108: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/108.jpg)
def seqOp = (data:(Array[Int], Int), item:Int) => (data._1 :+ item, data._2 + item)
def combOp = (d1:(Array[Int], Int), d2:(Array[Int], Int)) => (d1._1.union(d2._1), d1._2 + d2._2) val x = sc.parallelize(Array(1,2,3,4))val y = x.aggregate((Array[Int](), 0))(seqOp, combOp)
println(y)
x:
y:
aggregate(identity, seqOp, combOp)
Aggregate all the elements of the RDD by: - applying a user function to combine elements with user-supplied objects, - then combining those user-defined results via a second user function, - and finally returning a result to the driver.
[1, 2, 3, 4]
(Array(3, 1, 2, 4),10)
AGGREGATE*******
*****
[( ),#]
![Page 109: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/109.jpg)
MAX
42
1
4
![Page 110: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/110.jpg)
x:
y:
max()
Return the maximum item in the RDD
[2, 4, 1]
4
MAX4
x = sc.parallelize([2,4,1])y = x.max()
print(x.collect())print(y)
val x = sc.parallelize(Array(2,4,1))val y = x.max
println(x.collect().mkString(", "))println(y)
2
1
4max
![Page 111: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/111.jpg)
SUM
72
1
4
![Page 112: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/112.jpg)
x:
y:
sum()
Return the sum of the items in the RDD
[2, 4, 1]
7
SUM7
x = sc.parallelize([2,4,1])y = x.sum()
print(x.collect())print(y)
val x = sc.parallelize(Array(2,4,1))val y = x.sum
println(x.collect().mkString(", "))println(y)
2
1
4Σ
![Page 113: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/113.jpg)
MEAN
2.333333332
1
4
![Page 114: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/114.jpg)
x:
y:
mean()
Return the mean of the items in the RDD
[2, 4, 1]
2.3333333
MEAN2.3333333
x = sc.parallelize([2,4,1])y = x.mean()
print(x.collect())print(y)
val x = sc.parallelize(Array(2,4,1))val y = x.mean
println(x.collect().mkString(", "))println(y)
2
1
4x
![Page 115: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/115.jpg)
STDEV
1.24721912
1
4
![Page 116: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/116.jpg)
x:
y:
stdev()
Return the standard deviation of the items in the RDD
[2, 4, 1]
1.2472191
STDEV1.2472191
x = sc.parallelize([2,4,1])y = x.stdev()
print(x.collect())print(y)
val x = sc.parallelize(Array(2,4,1))val y = x.stdev
println(x.collect().mkString(", "))println(y)
2
1
4σ
![Page 117: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/117.jpg)
COUNTBYKEY
{'A': 1, 'J': 2, 'F': 1}
J “John”
“Anna”A
“Fred”F
J “James”
![Page 118: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/118.jpg)
x:
y:
countByKey()
Return a map of keys and counts of their occurrences in the RDD
[('J', 'James'), ('F','Fred'), ('A','Anna'), ('J','John')]
{'A': 1, 'J': 2, 'F': 1}
COUNTBYKEY{A: 1, 'J': 2, 'F': 1}
x = sc.parallelize([('J', 'James'), ('F','Fred'), ('A','Anna'), ('J','John')])
y = x.countByKey()print(y)
val x = sc.parallelize(Array(('J',"James"),('F',"Fred"),
('A',"Anna"),('J',"John")))
val y = x.countByKey()println(y)
![Page 119: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/119.jpg)
SAVEASTEXTFILE
![Page 120: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/120.jpg)
x:
y:
saveAsTextFile(path, compressionCodecClass=None)Save the RDD to the filesystem indicated in the path
[2, 4, 1]
[u'2', u'4', u'1']
SAVEASTEXTFILE
dbutils.fs.rm("/temp/demo", True)x = sc.parallelize([2,4,1])x.saveAsTextFile("/temp/demo")
y = sc.textFile("/temp/demo")print(y.collect())
dbutils.fs.rm("/temp/demo", true)val x = sc.parallelize(Array(2,4,1))x.saveAsTextFile("/temp/demo")
val y = sc.textFile("/temp/demo")println(y.collect().mkString(", "))
![Page 121: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/121.jpg)
LAB
![Page 122: Transformations and actions a visual guide training](https://reader035.fdocuments.net/reader035/viewer/2022062420/55cd178bbb61eb3c7a8b456b/html5/thumbnails/122.jpg)
Q&A