Download - Scala+spark 2nd

Transcript
Page 1: Scala+spark 2nd

Scalable Language“ ”

China Mobile

Page 2: Scala+spark 2nd

• 引入1• 函数式编程 (FP)2• 面向对象 (OO)3• 类型系统 (Type

System)4• 单子 (Monad)5

Page 3: Scala+spark 2nd

搭建了当前的 Javac

Generic Java 的设计者之一

Martin Odersky

--Scala 的设计者

编译成 Java 字节码与 Java 几乎无缝调用

静态类型强大的类型系统

Who?

Page 4: Scala+spark 2nd

Lisp

Erlang

Haskell

Java天下语言出 Lisp, 且 Scala 的设计哲学是和 Lisp 比较近的

What?

JVM与 Java 相互调用

Monad

并行计算模型

类型系统

Scala

Page 5: Scala+spark 2nd

Twitt

er

2009

年 , 从

Ruby

迁 移 到

Scala 。

Guardian2011 年,从

Java 迁 移 到Scala 。

Coursera

Spark

Meetup

Linkedin

Gilt

Foursquare

谁在使用 Scala?Scala 的使用比较多样化,既有 Spark 的应用,也有很多网站使用 Scala 做

后端

Page 6: Scala+spark 2nd

19楼

受 Actor 模 型

吸引,由 Java

迁移

Scala 。

豌豆荚有邓草原在,国内首屈一指

的Scala 大牛。阿里中间

件团队

Spark

蘑菇街看处方

乔布堂

唯品会

谁在使用 Scala?大公司基本都是由 Spark 驱动,且用 Scala 做中间件的较多,对外暴露语言无关的接

Page 7: Scala+spark 2nd

优点

1 多范式混合,表达能力强2 可以调用 Java 包,兼容性强3 静态强类型,直接编译为二 进制码,速度与 Java 不相上下

Page 8: Scala+spark 2nd

3 类型系统很复杂,学习 曲线陡峭

缺点

1 函数式编程2 函数式编程

Page 9: Scala+spark 2nd

λ表达式Expr = Iden

| Iden => Expr

| (Expr) (Expr)

x, y, name, id, person

x => name, x => id, x => x

x(y), y(x), (x => name) y, (x => +(x)(1)) 3

x => y => +(x)(y)

Page 10: Scala+spark 2nd

λ演算α 变换 Β 规约

η 变换

x => x == y => y, x => +(x)(z) == y => +(y)(z)

(x => +(x)(3)) 2 == +(2)(3)

(x => y => +(x)(y)) 2 3 == +(2)(3)

x => f(x) == f

Page 11: Scala+spark 2nd

丘奇数Zero = f => x => xOne = f => x => f(x)Two = f => x => f(f(x))

Succ = n => f => x => f(n(f)(x)) type ChurchNumber[A] = (A => A) => A => A def zero[A]: ChurchNumber[A] = f => a => a def succ[A](n: ChurchNumber[A]): ChurchNumber[A] = f => a => f(n(f)(a))

val a1: Int = 0 val f1: Int => Int = x => x + 1

val a2: List[Int] = List() val f2: List[Int] => List[Int] = list => 1 :: list

val a3: String = "" val f3: String => String = s => "|" + s println(zero(f1)(a1); println(succ(succ(zero))(f1)(a1)

Number = Zero | Succ Number

Page 12: Scala+spark 2nd

type Segment = (List[Int], List[Int], List[Int])object Split { def unapply (xs: List[Int]) = { val pivot = xs(xs.size / 2) @tailrec def partition (s: Segment, ys: List[Int]): Segment = { val (left, mid, right) = s ys match { case Nil => s case head :: tail if head < pivot => partition((head :: left, mid, right), tail) case head :: tail if head == pivot => partition((left, head :: mid, right), tail) case head :: tail if head > pivot => partition((left, mid, head :: right), tail) } } Some(partition((Nil, Nil, Nil), xs)) }}

def qsort(xs: List[Int]): List[Int] = xs match { case Nil => xs case Split(left, pivot, right) => qsort(left) ::: pivot ::: qsort(right)}

Quick Sort

尾递归Extractor

模式匹配

Guard

Page 13: Scala+spark 2nd

Pattern Matching

def sum(list: List[Int]): Int = if (list.isEmpty) 0 else list.head + sum(list.tail)

def sum(list: List[Int]): Int = list match { case List() => result case head :: tail => head + sum(tail)}

Page 14: Scala+spark 2nd

尾递归def sum(list: List[Int], acc: Int): Int = list match { case Nil => result case head :: tail => sum(tail, result + head)}

Page 15: Scala+spark 2nd
Page 16: Scala+spark 2nd

var list = (1 to 100).toArray

for (int i = 1; i <= 100; i++) { list[i] += 1}

list = list.map(1 +)

为什么要函数式编程

Page 17: Scala+spark 2nd

var list = (1 to 100).toArray

for (int i = 1; i <= 100; i++) { list[i] += 1}

list = list.view.map(1 +)

为什么要函数式编程

Page 18: Scala+spark 2nd

var list = (1 to 100).toArray

for (int i = 1; i <= 100; i++) { list[i] += 1}

list = list.par.map(1 +)

为什么要函数式编程

Page 19: Scala+spark 2nd

6 ^ 6

6 * 6 * 6 * 6 * 6 * 6

def ^(x: Int, y: Int) = { if (y == 0) 1 else if (y % 2 == 0) ^(x * x, y / 2) else x * ^(x, y – 1)}

为什么要函数式编程

Page 20: Scala+spark 2nd

5 + 3

柯里化

fold(z: Int)(f: (Int, Int) => Int): Int

val list = List(1, 2, 3, 4)

def fold0 = list.foldLeft(0)def fold1 = list.foldLeft(1)

: Int

5 + : Int => Int

+ : (Int, Int) => Int

fold0((x, y) => x + y)fold1((x, y) => x * y)

: ((Int, Int) => Int) => Int : ((Int, Int) => Int) => Int

+ : Int => Int => Int

Page 21: Scala+spark 2nd

副作用

class Pair[A](var x: A, var y: A) { def modifyX(x: A) = this.x = x def modifyY(y: A) = this.y = y}

var pair = new Pair(1, 2)var pair1 = new Pair(pair, pair)var pair2 = new Pair(pair, new Pair(1, 2))

pair.modifyX(3)

值与址

Page 22: Scala+spark 2nd

副作用结合律

var variable = 0

implicit class FooInt(i: Int) { def |+|(j: Int) = { variable = (i + j) / 2 i + j + variable }}

(1 |+| 2) |+| 31 |+| (2 |+| 3)

= 10= 12

Page 23: Scala+spark 2nd

副作用结合律

var variable = 0

implicit class FooInt(i: Int) { def |+|(j: Int) = { variable += 1 i + j * variable }}

(1 |+| 2) |+| 31 |+| (2 |+| 3)

= 9= 11

Page 24: Scala+spark 2nd

map(f: T => U): A[U]filter(f: T => Boolean): A[T]flatMap(f: T => A[T]): A[T]groupBy(f: T => K): A[(K, List[T])]sortBy(f: T => K): A[T]

NEW

Count: IntForce: A[T]

Reduce(f: (T, T) => T): T

Higher-Order Functions

Tranformation

Action

Page 25: Scala+spark 2nd

A

B

Map

[A] -> (A -> B) -> [B] 高阶函数

List(1, 2, 3, 4).map(_.toString)

(A -> B) -> ([A] -> [B])

Page 26: Scala+spark 2nd

A

A

Filter

?

A

[A] -> (A -> Boolean) -> [A] 高阶函数

List(1, 2, 3, 4).filter(_ < 3)

(A -> Boolean) -> ([A] -> [A])

Page 27: Scala+spark 2nd

A

B

Fold

自然元素

[A] -> B -> (B -> A -> B) -> [B]高阶函数

val list = List(“one”, “two”, “three”)list.foldLeft(0)((sum, str) => { if (str.contains(“o”) sum + 1 else sum})

B -> (B -> A -> B) -> ([A] -> [B])

Page 28: Scala+spark 2nd

[A]

A

Flatten

[[A]] -> [A] 高阶函数

List(List(1, 2), List(3, 5)).flatten

Page 29: Scala+spark 2nd

Quick Sort

object Split { def unapply (xs: List[Int]) = { val pivot = xs(xs.size / 2) Some(xs.partitionBy(pivot)) }}

def qsort(xs: List[Int]): List[Int] = xs match { case Nil => xs case Split(left, pivot, right) => qsort(left) ::: pivot ::: qsort(right)}

Page 30: Scala+spark 2nd

Quick Sort

type Segment = (List[Int], List[Int], List[Int])implicit class ListWithPartition(list: List[Int]) extends AnyVal { def partitionBy(p: Int): Segment = { val idenElem = (List[Int](), List[Int](), List[Int]()) def partition(result: Segment, x: Int): Segment = { val (left, mid, right) = result if (x < p) (x :: left, mid, right) else if (x == p) (left, x :: mid, right) else (left, mid, x :: right) } list.foldLeft(idenElem)(partition) }}

隐式转换

Page 31: Scala+spark 2nd

A

B

Map

[A] -> (A -> B) -> [B]

Par

高阶函数

Page 32: Scala+spark 2nd

惰性求值惰性求值val foo = List(1, 2, 3, 4, 5)baz = foo.map(5 +).map(3 +).filter(_ > 10).map(4 *)baz.take(2)我们却得到了foo.map(5 +)foo.map(5 +).map(3 +)foo.map(5 +).map(3 +).filter(_ > 10)三个中间结果

在命令式语言中:for(int i = 0; i < 5; ++i) { int x = foo[i] + 5 + 3 if (x > 10) bar.add(x * 4) else continue;{

在我们声明时我们想要的是一个愿望 ( 计算 )而不是结果

Page 33: Scala+spark 2nd

A

B

Map

[A] -> (A -> B) -> [B]

View

高阶函数

Page 34: Scala+spark 2nd

val fibs: Stream[Int] = 0 #:: 1 #:: fibs.zip(fibs.tail).map(n => n._1 + n._2)

流与惰性求值

Quora

惰性求值zip = ([A], [B]) => [(A, B)]

Page 35: Scala+spark 2nd

惰性求值Lazy val x = 3 + 3

def number = {println("OK"); 3 + 3}

class LazyValue(expr: => Int) { var evaluated: Boolean = false var value: Int = -1

def get: Int = { if (!evaluated) { value = expr evaluated = true } value }}

val lazyValue = new LazyValue(number)

println(lazyValue.get)

println(lazyValue.get)

Thinking in Java

Map 可以用装饰器模式来实现

Call By Name

Page 36: Scala+spark 2nd

面向对象Scala 是一门面向对象的语言,至少面向对象的纯度要比 Java 高。包括 1 , 2 , 1.1 ,等在内都是对象。

我们所见到的 1 + 2 实际上是 1.+(2)但在编译时会用原始类型来替代。而函数 x: Int => x.toString

则是 Function1[Int, String]

所以,你可以 map(5 +) 但不能 map(+ 5)

Page 37: Scala+spark 2nd

一些语法糖class Sugar(i: Int) { def unary_- = -i def apply(expr: => Unit) = for (j <- 1 to i) expr def +(that: Int) = i + that def +:(that: Int) = I + that}

val sugar = new Sugar(2)

-sugarsugar(println("aha"))sugar + 55 + sugar

前缀

中缀省略方法名所有字母|^&< >= !: 注意右结合+ -* / %其他字符

右结合目的是为了做好DSL和延续函数式编程习惯请注意谨慎使用

Page 38: Scala+spark 2nd

Mix-in 是一种多继承的手段,同 Interface 一样,通过限制第二个父类的方式来限制多继承的复杂关系,但它具有默认的实现。1. 通常的继承提供单一继承2. 第二个以及以上的父类必须是 Trait3. 不能单独生成实例Scala 中的 Trait 可以在编译时进行混合也可以在运行时混合。

Trait & Mix-in

但显然,一个人也可以跑可以唱歌…… .. 不过他还可以编程 .

设想我们要描述一种鸟,它可以唱歌也可以跑;由于它是一只鸟,它当然可以飞。abstract class Bird(kind: String) { val name: String def singMyName = println(s"$name is singing") val capability: Int def run = println(s"I can run $capability meters!!!") def fly = println(s"flying of kind: $kind")}

( 虽然我不歧视鸟类,不过如果碰到会编程的鸟请通知我 )

继承

Page 39: Scala+spark 2nd

trait Runnable { val capability: Int def run = println(s"I can run $capability meters!!!")}

trait Singer { val name: String def singMyName = println(s"$name is singing")}

abstract class Bird(kind: String) { def fly = println(s"flying of kind: $kind")}

继承

Page 40: Scala+spark 2nd

class Nightingale extends Bird("Nightingale") with Singer with Runnable { val capability = 20 val name = "poly"}

val myTinyBird = new NightingalemyTinyBird.flymyTinyBird.singMyNamemyTinyBird.run

class Coder(language: String) { val capability = 10 val name = "Handemelindo" def code = println(s"coding in $language")}

val me = new Coder("Scala") with Runnable with Singerme.codeme.singMyNameme.run

继承

Page 41: Scala+spark 2nd

一个小伙伴

object Sugar { def apply(i: Int) = new Sugar(i)}

可以在此实现工厂模式

伴生对象

Page 42: Scala+spark 2nd

一些小伙伴trait class Treecase class Leaf(info: String) extends Treecase class Node(left: Tree, right: Tree) extends Tree

def traverse(tree: Tree): Unit = { tree match { case Leaf(info) => println(info) case Node(left, right) => { traverse(left) traverse(right) } }}

val tree: Tree = new Node(new Node(new Leaf("1"), new Leaf("2")), new Leaf("3"))traverse(tree)

Case Class与 ADT

继承作为和类型case class作为积类型Tree = Leaf String | Node Tree Tree

Page 43: Scala+spark 2nd

类型系统如果你是一个 C 程序员,那么类型是:

如果你是一个 Java 程序员,那么类型是:

如果你是一个 R 程序员,那么类型是:

如果你是一个 Ruby 程序员,那么类型是:

而对于 Scala 程序员,类型是:

用来告诉计算机它需要用多少字节来存放这些数字的指标

用来表示存放实例的地方这样编译器就可以检查你的程序是否连续一致

用来标志对这些变量应该用何种统计计算

你应该回避的东西如同 UML 之于 Java ,是正确性的保证,是程序的蓝图猜猜这是什么: e.g. [(K1, V1)] -> [(K2, [V2])] -> [(K2, V3)]

MapReduce

Page 44: Scala+spark 2nd

*

Any

Int

1

Pair[Int, Int]

(1, 2)

List[Int]

[1, 2, 3]

* * * * *

List Pair

Kind

Type

Value类型构造器

类别子类型

Generics of a Higher Kind - Martin Odersky

=> => =>

Proper Type

Page 45: Scala+spark 2nd

type Int :: *type String :: *type (Int => String) :: *type List[Int] :: *

type List :: ?type Function1 :: ??

做一些抽象练习吧

type List :: * => *type function1 :: * => * => * Function1[-T, +R]

def id(x: Int) = xtype Id[A] = A

type id[A[_], B] = A[B]def id(f: Int => Int, x: Int) = f(x)

Page 46: Scala+spark 2nd

type Pair[K[_], V[_]] = (K[A], V[A]) forSome { type A }

(* -> *) -> (* -> *) -> *

设想,我们的程序要返回结果:(Set(x,x,x,x,x), List(x,x,x,x,x,x,x,x,x,x))

val pair: Pair[Set, List] = (Set(“42”), List(52))

val pair: Pair[Set, List] = (Set(42), List(52))

做一些抽象练习吧

Page 47: Scala+spark 2nd

回想起 type function1 :: * => * => *

又例如,我们有以下这个函数: def foo[A[_]](bar: A[Int]): A[Int] = bar

可以喂给它 (* => *) ,例如val foo1 = foo[List](List(1, 2, 3, 5, 8, 13))

如果我们有:def baz(x: Int) = println(x)

Type Lambda

肿么办?因此: * => * = *[Unit] => *[Unit] val foo2 = foo[ ({type F[X] = Function1[X, Unit]})#F ](baz)

Page 48: Scala+spark 2nd

trait Monoid[A]{ val zero: A def append(x: A, y: A): A}

object IntNum extends Monoid[Int] { val zero = 0 def append(x: Int, y: Int) = x + y}

object DoubleNum extends Monoid[Double] { val zero = 0d def append(x: Double, y: Double) = x + y}def sum[A](nums: List[A])(tc: Monoid[A]) = nums.foldLeft(tc.zero)(tc.append)

sum(List(1, 2, 3, 5, 8, 13))(IntNum)sum(List(3.14, 1.68, 2.72))(DoubleNum)

对态射进行抽象

Page 49: Scala+spark 2nd

trait Monoid[A]{ val zero: A def append(x: A, y: A): A}

object IntNum extends Monoid[Int] { val zero = 0 def append(x: Int, y: Int) = x + y}

object DoubleNum extends Monoid[Double] { val zero = 0d def append(x: Double, y: Double) = x + y}def sum[A](nums: List[A])(implicit tc: Monoid[A]) = nums.foldLeft(tc.zero)(tc.append)

sum(List(1, 2, 3, 5, 8, 13))sum(List(3.14, 1.68, 2.72))

implicit

implicit

Type Class

1.抽象分离2. 可组合3. 可覆盖4. 类型安全

val list = List(1,3,234,56,5346,34)list.sorted sorted[B >: A](implicit ord: math.Ording[B])

Type Class

Page 50: Scala+spark 2nd

类型类的作用List(1, 2, 3 5) -> “1,2,3,4”

(1,2) -> “1,2”

(List(1,2,3,5), List(8,13,21)) -> “1,2,3,5,8,13,21”

(List(1,2,3,5), (42.0, List(“a”, “b”))) -> “1,2,3,5,42.0,a,b”

Page 51: Scala+spark 2nd

类型类的作用trait Writable[A] { def write(a: A): String}Implicit def numericWritable[A: Numeric]: Writable[A] = new Writable[A] { def write(a: A): String = a.toString}Implicit val stringWritable: Writable[String] = new Writable[String] { def write(a: String): String = a}Implicit def listWritable[A: Writable]: Writable[List[A]] = new Writable[List[A]] { def write(a: List[A]): String = { val writableA = implicitly[Writable[A]] a.map(writableA.write).mkString(“,”) }}Implicit def PairWritable[A: Writable, B: Writable]: Writable[(A, B)] = new Writable[(A, B)] { def write(p: (A, B)): String = { val writableA = implicitly[Writable[A]] val writableB = implicitly[Writable[B]] writableA.write(p._1) + “,” + writableB.write(p._2) } }

Page 52: Scala+spark 2nd

赫尔曼 外尔 -----思维的数学方式:现在到了数学抽象中最关键的一步:让我们忘记这些符号所表示的对象。我们不应在这里停步,有许多操作可以应用于这些符号,而根本不必考虑他们到底代表着什么东西。

Monad

自函子范畴上的幺半群

Philip Wadler

Page 53: Scala+spark 2nd

( 1)封闭性( Closure):对于任意 a , b∈G ,有 a*b∈G ( 2)结合律( Associativity):对于任意 a , b , c∈G ,有( a*b) *c=a*( b*c) ( 3)幺元 ( Identity):存在幺元 e ,使得对于任意a∈G , e*a=a*e=a ( 4)逆元:对于任意 a∈G ,存在逆元 a^-1 ,使得 a^-1*a=a*a^-1=e

Group

什么是群 (Group)

什么是半群 (SemiGroup)只满足 1,2,

什么是幺半群 (Monoid)满足 1,2,3

Page 54: Scala+spark 2nd

Monoid

废话少说,放码过来trait SemiGroup[T] { def append(a: T, b: T): T}

trait Monoid[T] extends SemiGroup[T] { def zero: T}

class listMonoid[T] extends Monoid[List[T]]{ def zero = Nil def append(a: List[T], b: List[T]) = a ++ b}

Page 55: Scala+spark 2nd

Functor函子 (Functor)是什么

Int List[Int]

String List[String]

Functor

Page 56: Scala+spark 2nd

Functor函子 (Functor)是什么

trait Functor[F[_]] { def map[A, B](f: (A) => B)(a: F[A]): F[B]}

map[B](f: (A) => B): List[B]

Page 57: Scala+spark 2nd
Page 58: Scala+spark 2nd

Monad自函子上的幺半群

回想一下幺半群的单位元回想一下 fold 函数什么是自函子上的单位元呢?什么是自函子上的结合运算呢?Unit x >>= f ≡ f xM >>= unit ≡ m(m >>= f) >>= g ≡ m >>= (λx . F x >>= g)

单位元:将元素提升进计算语境结合律:结合简单运算形成复杂运算

Page 59: Scala+spark 2nd

一些常见MonadOption

Option或叫Maybe ,表示可能失败的计算由 Some(Value)或 None 表示Some(x) fMap (f: A => Some[B]) = Some(f(x))None fMap(f: A => Some[B]) = NoneUnit = Someval maybe: Option[Int] = Some(4)val none: Option[Int] = None

def calculate(maybe: Option[Int]): Option[Int] = for { value <- maybe} yield value + 5

calculate(maybe)calculate(none)

Page 60: Scala+spark 2nd

一些常见MonadList

集合本身是 Proper type ,它代表的是不确定性Unit = List

val list1 = List(2, 4, 6, 8)val list2 = List(1, 3, 5, 7)

for { value1 <- list1.map(1 +) value2 <- list2} yield value1 + value2

Page 61: Scala+spark 2nd

Future

Future 可以将计算包裹起来,它代表的是未来的结果Unit = List

val future1= Future(SomeProcess)val future2 = Future(AnotherProcess)

for { value1 <- future1.map(SomeTransformation) value2 <- future2} yield value1 + value2

一些常见Monad

Page 62: Scala+spark 2nd

for { (name, date(year, _, day)) <- nameList if name.length > 3 char <- name} yield char -> s”$name-$day@$year”

Usage

nameList.flatMap { case (name, date(year, _, day)) => if (name.length > 3) { name.map { char => char -> s"$name-$day@$year" } } else Map() case _ => Map() }

val date = “””(\d\d\d)-(\d\d)-(\d\d)”””.rval nameList = Map( “haskell” -> “1900-12-12”, “godel” -> “1906-04-28”, “church” -> “1903-06-14”, “turing” -> “1912/06/23”)

Page 63: Scala+spark 2nd

var map = Map[Char, String]() var i = 0 val list = nameList.toArray while (i < list.size) { val name = list(i)._1 val theDate = list(i)._2 if (theDate.matches("\\d\\d\\d\\d-\\d\\d-\\d\\d")) { val parts = theDate.split("-") val year = parts(0) val day = parts(2) val charArray = name.toCharArray var j = 0 while (j < charArray.length) { val char = charArray(j) map += char -> (name + "-" + day + "2" + year) j += 1 } } i += 1 }

Page 64: Scala+spark 2nd
Page 65: Scala+spark 2nd

• 介绍1• 从 FP看MR2• 从 FP看 RDD3• RDD4• MLlib5

Page 66: Scala+spark 2nd

Spark

Spark Map Reduce生态系统 Spark平台已经基本成熟,

但相关的 Mllib、 Spark SQL 等依然在发展中

非常成熟,有很多应用计算模型 类 Monadic( 不是 Monad) , Functor Map Reduce

存储 主要是内存 主要是磁盘编程风格 面向集合 面向接口

一种通用并行计算框架

Page 67: Scala+spark 2nd

Spark

Map Reduce Monadic

Spark SQL MLlib GraphXSpark

Streaming

Spark

本地运行模式

独立运行模式

YARN Mesos

HDFS Amazon S3 Hypertable Hbase etc.

Page 68: Scala+spark 2nd

优点

1 面向集合,便于开发2 支持的计算方式较 MR 要多3 内存计算速度更快,可以进行持久化以便于迭代;数据不 “大”,还可兼顾

“ 快”

Page 69: Scala+spark 2nd

缺点

1 内存消耗快,注意使用 kryo 等序列化库2 惰性求值的计算时间不宜估计优化难度高

Page 70: Scala+spark 2nd

[(K1, V1)] -> [(K2, [V2])] -> [(K2, V3)]

Word Count

[Line]

flatMap(_.split(“\\s+”)).map((_, 1))

groupBy(_._1)

[(Word, n)][(Word, 1)] -> ->[(Word, [1])]

->

reduceBy(_._1)(_._2 + _._2)

Map Reduce

Page 71: Scala+spark 2nd

map(f: T => U)filter(f: T => Boolean)flatMap(f: T => Seq[U])sample(fraction: Float)groupByKey()reduceByKey(f: (V, V) => V)mapValues(f: V => W)

NEW

Count()Collect()

Reduce(f: (T, T) => T)Lookup(k: K)

Save(path: String)take(n: Int)

RDD

Tranformation

Action

union()join()

cogroup()crossProduct

sort(c Comparator[K])partitionBy(p: Partitioner[K])

Page 72: Scala+spark 2nd

[(K1, V1)] -> [(K2, [V2])] -> [(K2, V3)]

Word Count

lines = spark.textFile("hdfs://...")

words = lines.flatMap(_.split(“//s+”))wordCounts = words.map((_, 1))result = wordCounts.reduceByKey(_ + _)

result.save(“hdfs://…”)

RDD

Page 73: Scala+spark 2nd

什么是 RDD

RDD的特点• 不可变的、已分区的集合• 只能通过读取文件或 Transformation 的方式来创建• 容错• 可控制存储级别• 可缓存• 粗粒度模型• 静态类型的

new StorageLevel(useDisk, useMemory, deserialized, replication)

cache()方法

通过血统重新计算

Page 74: Scala+spark 2nd

什么是 RDD

一个惰性的并行计算集合• 惰性:• 惰性的优点:单次计算,信息量充足,可自动批处理。 每一个 Transformation 代表着该数据将被执行何种操作

• 并行:我们将数据放在计算语境中 计算语境会自动将计算并行化 RDD 是面向集合的

Page 75: Scala+spark 2nd

RDD的实现一个五元组

• Partitions: 一片数据原子,例如 HDFS 的块,代表数据• Preferred Location: 列出了 partition 可以从哪里进行更快速的访问• Dependencies: 与父节点的依赖,子节点是由父节点计算出来的• Computation: 代表计算,在父节点的数据上应用该计算则可得到子节点的数据• Metadata: 储存例如该节点的地址和分片方式的元数据

Page 76: Scala+spark 2nd

RDD的实现

对于我们目前见到的惰性计算,他们都是线性的,可以表示为+5 *7 _ % 2 == 0Map FilterMap Collect

但其他的计算呢?

如何表示惰性计算

Page 77: Scala+spark 2nd

RDD的实现如何表示惰性计算

DAG通过拓扑排序 :1. 追踪到源头开始进行计算2. 将不需要混合的数据划分到同一组处理当中

Page 78: Scala+spark 2nd

RDD的实现血统 (Lineage)

表示计算之间的联系:• Narrow Dependencies :开销小

• Wide Dependencies:开销大

如Map, Union 。表现为一个或多个父 RDD 的分区对应于一个子 RDD分区可以本地化如 GroupBy 。表现为一个父 RDD分区对应多个子 RDD分区需要 Shuffling

Page 79: Scala+spark 2nd

RDD的执行

Cluster ManagerSparkContext

Task Task

CacheExecutor

Task Task

CacheExecutor

Page 80: Scala+spark 2nd

RDD的执行

1.RDD 直接从外部数据源创建( HDFS、本地文件等)2.RDD经历一系列的 TRANSFORMATION

3.执行 ACTION ,将最后一个 RDD 进行转换,输出到外部数据源。

同时:自动优化分块,分发闭包,混合数据,均衡负载

Page 81: Scala+spark 2nd

MLlibSVM with SGDLR with SGD or LBFGSNB各类决策树随机森林GBT LabeledPoint(Double, Vector)

Classification

val data = sc.textFile(“….")val parsedData = data.map { line => val parts = line.split(' ') LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)} val numIterations = 20val model = SVMWithSGD.train(parsedData, numIterations) val labelAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count

Page 82: Scala+spark 2nd

MLlib

LabeledPoint(Double <- Vector)

RegressionLinearRidgeLassoIsotonic

val data = sc.textFile(“….")val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x => x.toDouble).toArray)}

val numIterations = 20val model = LinearRegressionWithSGD.train(parsedData, numIterations) val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _) / valuesAndPreds.count

Page 83: Scala+spark 2nd

MLlibClustering: k均值及其变种 k均值 ++Gaussian MixtureLDA

Vector

Clustering

val data = sc.textFile(“….")val parsedData = data.map( _.split(' ').map(_.toDouble)) val numIterations = 20val numClusters = 2val clusters = KMeans.train(parsedData, numClusters, numIterations) val WSSSE = clusters.computeCost(parsedData)

Page 84: Scala+spark 2nd

MLlib

支持显性和隐性的ALS Rating(Int, Int, Double)

Collaborate Filtering

val data = sc.textFile(“….")val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble)})val numIterations = 20val model = ALS.train(ratings, 1, 20, 0.01) val usersProducts = ratings.map{ case Rating(user, product, rate) => (user, product)}val predictions = model.predict(usersProducts).map{ case Rating(user, product, rate) => ((user, product), rate)}val ratesAndPreds = ratings.map{ case Rating(user, product, rate) => ((user, product), rate)}.join(predictions)val MSE = ratesAndPreds.map{ case ((user, product), (r1, r2)) => math.pow((r1 - r2), 2)}.reduce(_ + _) / ratesAndPreds.count

Page 85: Scala+spark 2nd

MLlibFP-Growth

Array[Item]Frequent Pattern

val data = sc.textFile(“….")val transactions: RDD[Array[String]] = data.map(_.split(“,”))

val fpg = new FPGrowth() .setMinSupport(0.2) .setNumPartitions(10)val model = fpg.run(transactions)

model.freqItemsets.collect().foreach { itemset => println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)}

Page 87: Scala+spark 2nd

Happy!Hacking!

China Mobile

THANKSof your attention!