Scala+spark 2nd

Scalable Language“ ”

China Mobile

• 引入1• 函数式编程 (FP)2• 面向对象 (OO)3• 类型系统 (Type

System)4• 单子 (Monad)5

搭建了当前的 Javac

Generic Java 的设计者之一

Martin Odersky

--Scala 的设计者

编译成 Java 字节码与 Java 几乎无缝调用

静态类型强大的类型系统

Who?

Lisp

Erlang

Haskell

Java天下语言出 Lisp, 且 Scala 的设计哲学是和 Lisp 比较近的

What?

JVM与 Java 相互调用

Monad

并行计算模型

类型系统

Scala

http://en.wikipedia.org/wiki/Scala_(programming_language)

Twitt

er

2009

年，从

Ruby

迁移到

Scala 。

Guardian2011 年，从

Java 迁移到Scala 。

Coursera

Spark

Meetup

Linkedin

Gilt

Foursquare

谁在使用 Scala？Scala 的使用比较多样化，既有 Spark 的应用，也有很多网站使用 Scala 做

后端

19楼

受 Actor 模型

吸引，由 Java

迁移

到

Scala 。

豌豆荚有邓草原在，国内首屈一指

的Scala 大牛。阿里中间

件团队

Spark

蘑菇街看处方

乔布堂

唯品会

谁在使用 Scala？大公司基本都是由 Spark 驱动，且用 Scala 做中间件的较多，对外暴露语言无关的接

口

优点

1 多范式混合，表达能力强2 可以调用 Java 包，兼容性强3 静态强类型，直接编译为二进制码，速度与 Java 不相上下

3 类型系统很复杂，学习曲线陡峭

缺点

1 函数式编程2 函数式编程

λ表达式Expr = Iden

| Iden => Expr

| (Expr) (Expr)

x, y, name, id, person

x => name, x => id, x => x

x(y), y(x), (x => name) y, (x => +(x)(1)) 3

x => y => +(x)(y)

λ演算α 变换 Β 规约

η 变换

x => x == y => y, x => +(x)(z) == y => +(y)(z)

(x => +(x)(3)) 2 == +(2)(3)

(x => y => +(x)(y)) 2 3 == +(2)(3)

x => f(x) == f

丘奇数Zero = f => x => xOne = f => x => f(x)Two = f => x => f(f(x))

Succ = n => f => x => f(n(f)(x)) type ChurchNumber[A] = (A => A) => A => A def zero[A]: ChurchNumber[A] = f => a => a def succ[A](n: ChurchNumber[A]): ChurchNumber[A] = f => a => f(n(f)(a))

val a1: Int = 0 val f1: Int => Int = x => x + 1

val a2: List[Int] = List() val f2: List[Int] => List[Int] = list => 1 :: list

val a3: String = "" val f3: String => String = s => "|" + s println(zero(f1)(a1); println(succ(succ(zero))(f1)(a1)

Number = Zero | Succ Number

type Segment = (List[Int], List[Int], List[Int])object Split { def unapply (xs: List[Int]) = { val pivot = xs(xs.size / 2) @tailrec def partition (s: Segment, ys: List[Int]): Segment = { val (left, mid, right) = s ys match { case Nil => s case head :: tail if head < pivot => partition((head :: left, mid, right), tail) case head :: tail if head == pivot => partition((left, head :: mid, right), tail) case head :: tail if head > pivot => partition((left, mid, head :: right), tail) } } Some(partition((Nil, Nil, Nil), xs)) }}

def qsort(xs: List[Int]): List[Int] = xs match { case Nil => xs case Split(left, pivot, right) => qsort(left) ::: pivot ::: qsort(right)}

Quick Sort

尾递归Extractor

模式匹配

Guard

Pattern Matching

def sum(list: List[Int]): Int = if (list.isEmpty) 0 else list.head + sum(list.tail)

def sum(list: List[Int]): Int = list match { case List() => result case head :: tail => head + sum(tail)}

尾递归def sum(list: List[Int], acc: Int): Int = list match { case Nil => result case head :: tail => sum(tail, result + head)}

var list = (1 to 100).toArray

for (int i = 1; i <= 100; i++) { list[i] += 1}

list = list.map(1 +)

为什么要函数式编程


for (int i = 1; i <= 100; i++) { list[i] += 1}

list = list.view.map(1 +)



for (int i = 1; i <= 100; i++) { list[i] += 1}

list = list.par.map(1 +)


6 ^ 6

6 * 6 * 6 * 6 * 6 * 6

def ^(x: Int, y: Int) = { if (y == 0) 1 else if (y % 2 == 0) ^(x * x, y / 2) else x * ^(x, y – 1)}


5 + 3

柯里化

fold(z: Int)(f: (Int, Int) => Int): Int

val list = List(1, 2, 3, 4)

def fold0 = list.foldLeft(0)def fold1 = list.foldLeft(1)

: Int

5 + : Int => Int

+ : (Int, Int) => Int

fold0((x, y) => x + y)fold1((x, y) => x * y)

: ((Int, Int) => Int) => Int : ((Int, Int) => Int) => Int

+ : Int => Int => Int

副作用

class Pair[A](var x: A, var y: A) { def modifyX(x: A) = this.x = x def modifyY(y: A) = this.y = y}

var pair = new Pair(1, 2)var pair1 = new Pair(pair, pair)var pair2 = new Pair(pair, new Pair(1, 2))

pair.modifyX(3)

值与址

副作用结合律

var variable = 0

implicit class FooInt(i: Int) { def |+|(j: Int) = { variable = (i + j) / 2 i + j + variable }}

(1 |+| 2) |+| 31 |+| (2 |+| 3)

= 10= 12

副作用结合律

var variable = 0

implicit class FooInt(i: Int) { def |+|(j: Int) = { variable += 1 i + j * variable }}

(1 |+| 2) |+| 31 |+| (2 |+| 3)

= 9= 11

map(f: T => U): A[U]filter(f: T => Boolean): A[T]flatMap(f: T => A[T]): A[T]groupBy(f: T => K): A[(K, List[T])]sortBy(f: T => K): A[T]

NEW

Count: IntForce: A[T]

Reduce(f: (T, T) => T): T

Higher-Order Functions

Tranformation

Action

A

B

Map

[A] -> (A -> B) -> [B] 高阶函数

List(1, 2, 3, 4).map(_.toString)

(A -> B) -> ([A] -> [B])

A

A

Filter

?

A

[A] -> (A -> Boolean) -> [A] 高阶函数

List(1, 2, 3, 4).filter(_ < 3)

(A -> Boolean) -> ([A] -> [A])

A

B

Fold

自然元素

[A] -> B -> (B -> A -> B) -> [B]高阶函数

val list = List(“one”, “two”, “three”)list.foldLeft(0)((sum, str) => { if (str.contains(“o”) sum + 1 else sum})

B -> (B -> A -> B) -> ([A] -> [B])

[A]

A

Flatten

[[A]] -> [A] 高阶函数

List(List(1, 2), List(3, 5)).flatten

Quick Sort

object Split { def unapply (xs: List[Int]) = { val pivot = xs(xs.size / 2) Some(xs.partitionBy(pivot)) }}

def qsort(xs: List[Int]): List[Int] = xs match { case Nil => xs case Split(left, pivot, right) => qsort(left) ::: pivot ::: qsort(right)}

Quick Sort

type Segment = (List[Int], List[Int], List[Int])implicit class ListWithPartition(list: List[Int]) extends AnyVal { def partitionBy(p: Int): Segment = { val idenElem = (List[Int](), List[Int](), List[Int]()) def partition(result: Segment, x: Int): Segment = { val (left, mid, right) = result if (x < p) (x :: left, mid, right) else if (x == p) (left, x :: mid, right) else (left, mid, x :: right) } list.foldLeft(idenElem)(partition) }}

隐式转换

A

B

Map

[A] -> (A -> B) -> [B]

Par

高阶函数

惰性求值惰性求值val foo = List(1, 2, 3, 4, 5)baz = foo.map(5 +).map(3 +).filter(_ > 10).map(4 *)baz.take(2)我们却得到了foo.map(5 +)foo.map(5 +).map(3 +)foo.map(5 +).map(3 +).filter(_ > 10)三个中间结果

在命令式语言中：for(int i = 0; i < 5; ++i) { int x = foo[i] + 5 + 3 if (x > 10) bar.add(x * 4) else continue;{

在我们声明时我们想要的是一个愿望 ( 计算 )而不是结果

A

B

Map

[A] -> (A -> B) -> [B]

View

高阶函数

val fibs: Stream[Int] = 0 #:: 1 #:: fibs.zip(fibs.tail).map(n => n._1 + n._2)

流与惰性求值

Quora

惰性求值zip = ([A], [B]) => [(A, B)]

http://www.quora.com/What-are-common-idioms-for-caching-and-memoization-in-pure-functional-languages-such-as-Haskell

惰性求值Lazy val x = 3 + 3

def number = {println("OK"); 3 + 3}

class LazyValue(expr: => Int) { var evaluated: Boolean = false var value: Int = -1

def get: Int = { if (!evaluated) { value = expr evaluated = true } value }}

val lazyValue = new LazyValue(number)

println(lazyValue.get)

println(lazyValue.get)

Thinking in Java

Map 可以用装饰器模式来实现

Call By Name

面向对象Scala 是一门面向对象的语言，至少面向对象的纯度要比 Java 高。包括 1 ， 2 ， 1.1 ，等在内都是对象。

我们所见到的 1 + 2 实际上是 1.+(2)但在编译时会用原始类型来替代。而函数 x: Int => x.toString

则是 Function1[Int, String]

所以，你可以 map(5 +) 但不能 map(+ 5)

一些语法糖class Sugar(i: Int) { def unary_- = -i def apply(expr: => Unit) = for (j <- 1 to i) expr def +(that: Int) = i + that def +:(that: Int) = I + that}

val sugar = new Sugar(2)

-sugarsugar(println("aha"))sugar + 55 + sugar

前缀

中缀省略方法名所有字母|^&< >= !：注意右结合+ -* / %其他字符

右结合目的是为了做好DSL和延续函数式编程习惯请注意谨慎使用

Mix-in 是一种多继承的手段，同 Interface 一样，通过限制第二个父类的方式来限制多继承的复杂关系，但它具有默认的实现。1. 通常的继承提供单一继承2. 第二个以及以上的父类必须是 Trait3. 不能单独生成实例Scala 中的 Trait 可以在编译时进行混合也可以在运行时混合。

Trait & Mix-in

但显然，一个人也可以跑可以唱歌…… .. 不过他还可以编程 .

设想我们要描述一种鸟，它可以唱歌也可以跑；由于它是一只鸟，它当然可以飞。abstract class Bird(kind: String) { val name: String def singMyName = println(s"$name is singing") val capability: Int def run = println(s"I can run $capability meters!!!") def fly = println(s"flying of kind: $kind")}

( 虽然我不歧视鸟类，不过如果碰到会编程的鸟请通知我 )

继承

trait Runnable { val capability: Int def run = println(s"I can run $capability meters!!!")}

trait Singer { val name: String def singMyName = println(s"$name is singing")}

abstract class Bird(kind: String) { def fly = println(s"flying of kind: $kind")}

继承

class Nightingale extends Bird("Nightingale") with Singer with Runnable { val capability = 20 val name = "poly"}

val myTinyBird = new NightingalemyTinyBird.flymyTinyBird.singMyNamemyTinyBird.run

class Coder(language: String) { val capability = 10 val name = "Handemelindo" def code = println(s"coding in $language")}

val me = new Coder("Scala") with Runnable with Singerme.codeme.singMyNameme.run

继承

一个小伙伴

object Sugar { def apply(i: Int) = new Sugar(i)}

可以在此实现工厂模式

伴生对象

一些小伙伴trait class Treecase class Leaf(info: String) extends Treecase class Node(left: Tree, right: Tree) extends Tree

def traverse(tree: Tree): Unit = { tree match { case Leaf(info) => println(info) case Node(left, right) => { traverse(left) traverse(right) } }}

val tree: Tree = new Node(new Node(new Leaf("1"), new Leaf("2")), new Leaf("3"))traverse(tree)

Case Class与 ADT

继承作为和类型case class作为积类型Tree = Leaf String | Node Tree Tree

类型系统如果你是一个 C 程序员，那么类型是：

如果你是一个 Java 程序员，那么类型是：

如果你是一个 R 程序员，那么类型是：

如果你是一个 Ruby 程序员，那么类型是：

而对于 Scala 程序员，类型是：

用来告诉计算机它需要用多少字节来存放这些数字的指标

用来表示存放实例的地方这样编译器就可以检查你的程序是否连续一致

用来标志对这些变量应该用何种统计计算

你应该回避的东西如同 UML 之于 Java ，是正确性的保证，是程序的蓝图猜猜这是什么： e.g. [(K1, V1)] -> [(K2, [V2])] -> [(K2, V3)]

MapReduce

*

Any

Int

1

Pair[Int, Int]

(1, 2)

List[Int]

[1, 2, 3]

* * * * *

List Pair

Kind

Type

Value类型构造器

类别子类型

Generics of a Higher Kind - Martin Odersky

=> => =>

Proper Type

http://adriaanm.github.io/files/higher.pdf

type Int :: *type String :: *type (Int => String) :: *type List[Int] :: *

type List :: ?type Function1 :: ??

做一些抽象练习吧

type List :: * => *type function1 :: * => * => * Function1[-T, +R]

def id(x: Int) = xtype Id[A] = A

type id[A[_], B] = A[B]def id(f: Int => Int, x: Int) = f(x)

type Pair[K[_], V[_]] = (K[A], V[A]) forSome { type A }

(* -> *) -> (* -> *) -> *

设想，我们的程序要返回结果：(Set(x,x,x,x,x), List(x,x,x,x,x,x,x,x,x,x))

val pair: Pair[Set, List] = (Set(“42”), List(52))

val pair: Pair[Set, List] = (Set(42), List(52))

做一些抽象练习吧

回想起 type function1 :: * => * => *

又例如，我们有以下这个函数： def foo[A[_]](bar: A[Int]): A[Int] = bar

可以喂给它 (* => *) ，例如val foo1 = foo[List](List(1, 2, 3, 5, 8, 13))

如果我们有：def baz(x: Int) = println(x)

Type Lambda

肿么办？因此： * => * = *[Unit] => *[Unit] val foo2 = foo[ ({type F[X] = Function1[X, Unit]})#F ](baz)

trait Monoid[A]{ val zero: A def append(x: A, y: A): A}

object IntNum extends Monoid[Int] { val zero = 0 def append(x: Int, y: Int) = x + y}

object DoubleNum extends Monoid[Double] { val zero = 0d def append(x: Double, y: Double) = x + y}def sum[A](nums: List[A])(tc: Monoid[A]) = nums.foldLeft(tc.zero)(tc.append)

sum(List(1, 2, 3, 5, 8, 13))(IntNum)sum(List(3.14, 1.68, 2.72))(DoubleNum)

对态射进行抽象

trait Monoid[A]{ val zero: A def append(x: A, y: A): A}

object IntNum extends Monoid[Int] { val zero = 0 def append(x: Int, y: Int) = x + y}

object DoubleNum extends Monoid[Double] { val zero = 0d def append(x: Double, y: Double) = x + y}def sum[A](nums: List[A])(implicit tc: Monoid[A]) = nums.foldLeft(tc.zero)(tc.append)

sum(List(1, 2, 3, 5, 8, 13))sum(List(3.14, 1.68, 2.72))

implicit

implicit

Type Class

1.抽象分离2. 可组合3. 可覆盖4. 类型安全

val list = List(1,3,234,56,5346,34)list.sorted sorted[B >: A](implicit ord: math.Ording[B])

Type Class

类型类的作用List(1, 2, 3 5) -> “1,2,3,4”

(1,2) -> “1,2”

(List(1,2,3,5), List(8,13,21)) -> “1,2,3,5,8,13,21”

(List(1,2,3,5), (42.0, List(“a”, “b”))) -> “1,2,3,5,42.0,a,b”

类型类的作用trait Writable[A] { def write(a: A): String}Implicit def numericWritable[A: Numeric]: Writable[A] = new Writable[A] { def write(a: A): String = a.toString}Implicit val stringWritable: Writable[String] = new Writable[String] { def write(a: String): String = a}Implicit def listWritable[A: Writable]: Writable[List[A]] = new Writable[List[A]] { def write(a: List[A]): String = { val writableA = implicitly[Writable[A]] a.map(writableA.write).mkString(“,”) }}Implicit def PairWritable[A: Writable, B: Writable]: Writable[(A, B)] = new Writable[(A, B)] { def write(p: (A, B)): String = { val writableA = implicitly[Writable[A]] val writableB = implicitly[Writable[B]] writableA.write(p._1) + “,” + writableB.write(p._2) } }

赫尔曼外尔 -----思维的数学方式：现在到了数学抽象中最关键的一步：让我们忘记这些符号所表示的对象。我们不应在这里停步，有许多操作可以应用于这些符号，而根本不必考虑他们到底代表着什么东西。

Monad

自函子范畴上的幺半群

Philip Wadler

（ 1）封闭性（ Closure）：对于任意 a ， b∈G ，有 a*b∈G （ 2）结合律（ Associativity）：对于任意 a ， b ， c∈G ，有（ a*b） *c=a*（ b*c）（ 3）幺元（ Identity）：存在幺元 e ，使得对于任意a∈G ， e*a=a*e=a （ 4）逆元：对于任意 a∈G ，存在逆元 a^-1 ，使得 a^-1*a=a*a^-1=e

Group

什么是群 (Group)

什么是半群 (SemiGroup)只满足 1,2,

什么是幺半群 (Monoid)满足 1,2,3

Monoid

废话少说，放码过来trait SemiGroup[T] { def append(a: T, b: T): T}

trait Monoid[T] extends SemiGroup[T] { def zero: T}

class listMonoid[T] extends Monoid[List[T]]{ def zero = Nil def append(a: List[T], b: List[T]) = a ++ b}

Functor函子 (Functor)是什么

Int List[Int]

String List[String]

Functor

Functor函子 (Functor)是什么

trait Functor[F[_]] { def map[A, B](f: (A) => B)(a: F[A]): F[B]}

map[B](f: (A) => B): List[B]

Monad自函子上的幺半群

回想一下幺半群的单位元回想一下 fold 函数什么是自函子上的单位元呢？什么是自函子上的结合运算呢？Unit x >>= f ≡ f xM >>= unit ≡ m(m >>= f) >>= g ≡ m >>= (λx . F x >>= g)

单位元：将元素提升进计算语境结合律：结合简单运算形成复杂运算

一些常见MonadOption

Option或叫Maybe ，表示可能失败的计算由 Some(Value)或 None 表示Some(x) fMap (f: A => Some[B]) = Some(f(x))None fMap(f: A => Some[B]) = NoneUnit = Someval maybe: Option[Int] = Some(4)val none: Option[Int] = None

def calculate(maybe: Option[Int]): Option[Int] = for { value <- maybe} yield value + 5

calculate(maybe)calculate(none)

一些常见MonadList

集合本身是 Proper type ，它代表的是不确定性Unit = List

val list1 = List(2, 4, 6, 8)val list2 = List(1, 3, 5, 7)

for { value1 <- list1.map(1 +) value2 <- list2} yield value1 + value2

Future

Future 可以将计算包裹起来，它代表的是未来的结果Unit = List

val future1= Future(SomeProcess)val future2 = Future(AnotherProcess)

for { value1 <- future1.map(SomeTransformation) value2 <- future2} yield value1 + value2

一些常见Monad

for { (name, date(year, _, day)) <- nameList if name.length > 3 char <- name} yield char -> s”$name-$day@$year”

Usage

nameList.flatMap { case (name, date(year, _, day)) => if (name.length > 3) { name.map { char => char -> s"$name-$day@$year" } } else Map() case _ => Map() }

val date = “””(\d\d\d)-(\d\d)-(\d\d)”””.rval nameList = Map( “haskell” -> “1900-12-12”, “godel” -> “1906-04-28”, “church” -> “1903-06-14”, “turing” -> “1912/06/23”)

var map = Map[Char, String]() var i = 0 val list = nameList.toArray while (i < list.size) { val name = list(i)._1 val theDate = list(i)._2 if (theDate.matches("\\d\\d\\d\\d-\\d\\d-\\d\\d")) { val parts = theDate.split("-") val year = parts(0) val day = parts(2) val charArray = name.toCharArray var j = 0 while (j < charArray.length) { val char = charArray(j) map += char -> (name + "-" + day + "2" + year) j += 1 } } i += 1 }

• 介绍1• 从 FP看MR2• 从 FP看 RDD3• RDD4• MLlib5

Spark

Spark Map Reduce生态系统 Spark平台已经基本成熟，

但相关的 Mllib、 Spark SQL 等依然在发展中

非常成熟，有很多应用计算模型类 Monadic( 不是 Monad) ， Functor Map Reduce

存储主要是内存主要是磁盘编程风格面向集合面向接口

一种通用并行计算框架

Spark

Map Reduce Monadic

Spark SQL MLlib GraphXSpark

Streaming

Spark

本地运行模式

独立运行模式

YARN Mesos

HDFS Amazon S3 Hypertable Hbase etc.

优点

1 面向集合，便于开发2 支持的计算方式较 MR 要多3 内存计算速度更快，可以进行持久化以便于迭代；数据不 “大”，还可兼顾

“ 快”

缺点

1 内存消耗快，注意使用 kryo 等序列化库2 惰性求值的计算时间不宜估计优化难度高

[(K1, V1)] -> [(K2, [V2])] -> [(K2, V3)]

Word Count

[Line]

flatMap(_.split(“\\s+”)).map((_, 1))

groupBy(_._1)

[(Word, n)][(Word, 1)] -> ->[(Word, [1])]

->

reduceBy(_._1)(_._2 + _._2)

Map Reduce

map(f: T => U)filter(f: T => Boolean)flatMap(f: T => Seq[U])sample(fraction: Float)groupByKey()reduceByKey(f: (V, V) => V)mapValues(f: V => W)

NEW

Count()Collect()

Reduce(f: (T, T) => T)Lookup(k: K)

Save(path: String)take(n: Int)

RDD

Tranformation

Action

union()join()

cogroup()crossProduct

sort(c Comparator[K])partitionBy(p: Partitioner[K])

[(K1, V1)] -> [(K2, [V2])] -> [(K2, V3)]

Word Count

lines = spark.textFile("hdfs://...")

words = lines.flatMap(_.split(“//s+”))wordCounts = words.map((_, 1))result = wordCounts.reduceByKey(_ + _)

result.save(“hdfs://…”)

RDD

什么是 RDD

RDD的特点• 不可变的、已分区的集合• 只能通过读取文件或 Transformation 的方式来创建• 容错• 可控制存储级别• 可缓存• 粗粒度模型• 静态类型的

new StorageLevel(useDisk, useMemory, deserialized, replication)

cache()方法

通过血统重新计算

什么是 RDD

一个惰性的并行计算集合• 惰性：• 惰性的优点：单次计算，信息量充足，可自动批处理。每一个 Transformation 代表着该数据将被执行何种操作

• 并行：我们将数据放在计算语境中计算语境会自动将计算并行化 RDD 是面向集合的

RDD的实现一个五元组

• Partitions: 一片数据原子，例如 HDFS 的块，代表数据• Preferred Location: 列出了 partition 可以从哪里进行更快速的访问• Dependencies: 与父节点的依赖，子节点是由父节点计算出来的• Computation: 代表计算，在父节点的数据上应用该计算则可得到子节点的数据• Metadata: 储存例如该节点的地址和分片方式的元数据

RDD的实现

对于我们目前见到的惰性计算，他们都是线性的，可以表示为+5 *7 _ % 2 == 0Map FilterMap Collect

但其他的计算呢？

如何表示惰性计算

RDD的实现如何表示惰性计算

DAG通过拓扑排序 :1. 追踪到源头开始进行计算2. 将不需要混合的数据划分到同一组处理当中

RDD的实现血统 (Lineage)

表示计算之间的联系：• Narrow Dependencies ：开销小

• Wide Dependencies:开销大

如Map, Union 。表现为一个或多个父 RDD 的分区对应于一个子 RDD分区可以本地化如 GroupBy 。表现为一个父 RDD分区对应多个子 RDD分区需要 Shuffling

RDD的执行

Cluster ManagerSparkContext

Task Task

CacheExecutor

Task Task

CacheExecutor

RDD的执行

1.RDD 直接从外部数据源创建（ HDFS、本地文件等）2.RDD经历一系列的 TRANSFORMATION

3.执行 ACTION ，将最后一个 RDD 进行转换，输出到外部数据源。

同时：自动优化分块，分发闭包，混合数据，均衡负载

MLlibSVM with SGDLR with SGD or LBFGSNB各类决策树随机森林GBT LabeledPoint(Double, Vector)

Classification

val data = sc.textFile(“….")val parsedData = data.map { line => val parts = line.split(' ') LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)} val numIterations = 20val model = SVMWithSGD.train(parsedData, numIterations) val labelAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count

MLlib

LabeledPoint(Double <- Vector)

RegressionLinearRidgeLassoIsotonic

val data = sc.textFile(“….")val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x => x.toDouble).toArray)}

val numIterations = 20val model = LinearRegressionWithSGD.train(parsedData, numIterations) val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _) / valuesAndPreds.count

MLlibClustering: k均值及其变种 k均值 ++Gaussian MixtureLDA

Vector

Clustering

val data = sc.textFile(“….")val parsedData = data.map( _.split(' ').map(_.toDouble)) val numIterations = 20val numClusters = 2val clusters = KMeans.train(parsedData, numClusters, numIterations) val WSSSE = clusters.computeCost(parsedData)

MLlib

支持显性和隐性的ALS Rating(Int, Int, Double)

Collaborate Filtering

val data = sc.textFile(“….")val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble)})val numIterations = 20val model = ALS.train(ratings, 1, 20, 0.01) val usersProducts = ratings.map{ case Rating(user, product, rate) => (user, product)}val predictions = model.predict(usersProducts).map{ case Rating(user, product, rate) => ((user, product), rate)}val ratesAndPreds = ratings.map{ case Rating(user, product, rate) => ((user, product), rate)}.join(predictions)val MSE = ratesAndPreds.map{ case ((user, product), (r1, r2)) => math.pow((r1 - r2), 2)}.reduce(_ + _) / ratesAndPreds.count

MLlibFP-Growth

Array[Item]Frequent Pattern

val data = sc.textFile(“….")val transactions: RDD[Array[String]] = data.map(_.split(“,”))

val fpg = new FPGrowth() .setMinSupport(0.2) .setNumPartitions(10)val model = fpg.run(transactions)

model.freqItemsets.collect().foreach { itemset => println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)}

相关资料论文 : -- 点我官方文档： -- 点我官方 API ： -- 点我EDX 上 Berkeley 的 spark课程： -- 点我 EDX 上 Berkeley 的 MLlib课程： -- 点我

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

https://spark.apache.org/docs/latest/mllib-guide.html

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package

https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x

https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x

Happy!Hacking!

China Mobile

THANKSof your attention!

Scala+spark 2nd

Engineering

Transcript of Scala+spark 2nd