Apache Spark - Yandex
date post
02-Nov-2021Category
Documents
view
0download
0
Embed Size (px)
Transcript of Apache Spark - Yandex
3
! Apache Solr based search engine e-commerce
! .
!
! →
5
9
! map-reduce ! 3
!
13
Whole program comparison
Whole program comparison
Transformations
def filter(f: T => Boolean): RDD[T] def map[U: ClassTag](f: T => U): RDD[U] def foreachPartition(f: Iterator[T] => Unit) def zipWithIndex(): RDD[(T, Long)]
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] def reduceByKey(func: (V,V) => V,numPartitions: Int):RDD[(K, V)]
In one partition(narrow)
Hadoop
Spark
Scala val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
18
19
20
SQL
23
25
! Classification and regression – linear support vector machine (SVM) – logistic regression – linear least squares, Lasso, and ridge regression – decision tree – naive Bayes ! Collaborative filtering
– alternating least squares (ALS)
– stochastic gradient descent – limited-memory BFGS (L-BFGS)
27
28
30
Batch Interac+ve Streaming
…
…
API
34
200
400
600
800
1000
1200
1400
35
! Apache Solr based search engine e-commerce
! .
!
! →
5
9
! map-reduce ! 3
!
13
Whole program comparison
Whole program comparison
Transformations
def filter(f: T => Boolean): RDD[T] def map[U: ClassTag](f: T => U): RDD[U] def foreachPartition(f: Iterator[T] => Unit) def zipWithIndex(): RDD[(T, Long)]
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] def reduceByKey(func: (V,V) => V,numPartitions: Int):RDD[(K, V)]
In one partition(narrow)
Hadoop
Spark
Scala val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
18
19
20
SQL
23
25
! Classification and regression – linear support vector machine (SVM) – logistic regression – linear least squares, Lasso, and ridge regression – decision tree – naive Bayes ! Collaborative filtering
– alternating least squares (ALS)
– stochastic gradient descent – limited-memory BFGS (L-BFGS)
27
28
30
Batch Interac+ve Streaming
…
…
API
34
200
400
600
800
1000
1200
1400
35