Spark...

download Spark MLlib§ƒ‚³ƒƒ³ƒ‰‚¨ƒ³‚¸ƒ³‚’½œ£è©±

of 43

  • date post

    16-Apr-2017
  • Category

    Engineering

  • view

    740
  • download

    3

Embed Size (px)

Transcript of Spark...

  • Spark MLlib

    2016/12/21

  • Agenda

    SparkEMR

    MLlib

    Spark

    MLlib

  • Spark

    Spark Apache SparkSpark Framework

    YARN Yet Another Resource Negotiatornpm

  • @uryyyyyyy

    Spark / / React / etc...

    React NativeKubernetes

  • D

    D http://qiita.com/cedretaber/items/90b7b34ee710eb5cc965

    http://qiita.com/cedretaber/items/90b7b34ee710eb5cc965

  • Opt Technologies

    Geek Night

    https://ichigayageek.connpass.com/

    http://tech-magazine.opt.ne.jp/

    https://ichigayageek.connpass.com/http://tech-magazine.opt.ne.jp/

  • SparkEMR

  • AWS EMR

    HadoopSaaS

    Hadoop / Spark / Ganglia / Hive

    http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-release-components.html

    http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-hadoop-daemons.html

    AWSCLI etc...

    Google Cloud Dataproc

    http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-release-components.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-hadoop-daemons.html

  • EMRSparkSpark2.0EMR

    http://qiita.com/uryyyyyyy/items/15f2e8f153aa86375227

    http://qiita.com/uryyyyyyy/items/15f2e8f153aa86375227http://qiita.com/uryyyyyyy/items/15f2e8f153aa86375227

  • EMRS3S3

    HDFSHadoop

    Hadoopaws-hadoop

    EMRHadoop

    LocalSparkspark-submit / shell

    `--packages org.apache.hadoop:hadoop-aws:2.7.2`

    EMR 4HadoopS3

    `DirectOutputCommitter`

    http://qiita.com/uryyyyyyy/items/e9ec40a8c748d82d4bc4

    http://qiita.com/uryyyyyyy/items/e9ec40a8c748d82d4bc4#spark%E3%81%AE%E5%AE%9F%E8%A1%8Chttp://qiita.com/uryyyyyyy/items/e9ec40a8c748d82d4bc4#spark%E3%81%AE%E5%AE%9F%E8%A1%8C

  • EMREMRS3

    http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-manage-view-web-log-files.html

    EMRS3/Localhttp://qiita.com/uryyyyyyy/items/8bf386da45bcd8fde387

    S35

    S3ssh

    http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-manage-view-web-log-files.htmlhttp://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-manage-view-web-log-files.htmlhttp://qiita.com/uryyyyyyy/items/8bf386da45bcd8fde387http://qiita.com/uryyyyyyy/items/8bf386da45bcd8fde387

  • EMRSpot InstanceEMREC2Spot Instance

    Master Node

    m1.medium

  • EMRBlack BoxSpark 1assemblyscala 2.1022.11

    Scala2.11

    HadoopLog4JSLF4J

    EMRLog4J

    EMR

  • Spark MLlib

  • 500k Product, 10M User

    100ms

    Feed

    1/

    unis http://lp.unis.tokyo/

    Criteo

    http://www.slideshare.net/RomainLerallut/recsys-2015-largescale-realtime-product-recommendation-at-criteo

    http://lp.unis.tokyo/http://www.slideshare.net/RomainLerallut/recsys-2015-largescale-realtime-product-recommendation-at-criteohttp://www.slideshare.net/RomainLerallut/recsys-2015-largescale-realtime-product-recommendation-at-criteo

  • -> 10M * 500k

    -> 10M * 100(rank)

    -> 100(rank) * 500k

    from http://www.slideshare.net/jeykottalam/mllib

    http://www.slideshare.net/jeykottalam/mllib

  • LB(~100ms)

  • 1. S3RedShift

    2. SparkCache

    3. Cache

  • 1OoME

    2RedShift

    3

    4

    5100ms

    6Spark Job

    7

    8stagingSpark

  • 1OoME

    OoME

  • 1OoME

    recommendProductsForUsers

    https://github.com/apache/spark/blob/branch-1.6/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactoriz

    ationModel.scala#L283

    Product * UserArray

    500k * 10MGC

    recommendProducts

    https://github.com/apache/spark/blob/branch-1.6/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283https://github.com/apache/spark/blob/branch-1.6/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283https://github.com/apache/spark/blob/branch-1.6/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283

  • 2RedShift

    10

    redshift

    spark-redshift

    S3SparkS3

    RedShift

  • LB(~100ms)

  • Typo

    URL/heathCheck

    orz...

  • 3

    10M

  • 3

    10M(user) * 100(product) * 1kB(record) = 1000GB

  • LB(~100ms)

  • 4

    Spark

    FinagleSparklocal Mode...

  • 4

    RDD

    RDD

    10

    Spark

    serializekryo

    chill-bijection

  • Bijection...

    BiMapserialize

    MLlibAPIproductuserIDInt

    StringBiMap

  • 5100ms

    PCpython(numpy)

    800k * 100rank110ms

    Scala

    userIdabgatling

    100ms10ms

  • 5100ms

    netlib-java

    boxingJava

    GC

    Scala10QPS

  • 6Spark Job

    Spark

    Driver NodeError

    DriverNodeYARNkill

    GangliaDriverNodeGB

    MLlib

    https://github.com/apache/spark/blob/branch-2.0/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L690

    https://github.com/apache/spark/blob/branch-2.0/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L690

  • 6Spark Job

    Driver Node

    Driver Node

    --conf spark.driver.maxResultSize=g

    Driver Node

  • unis

    unitem()

    uniquery

    unicorn

    etc...

    unis + unique + query

    orz http://uniquery.jp/

    http://uniquery.jp/

  • 7

    Kryo serialization failed: Buffer overflow. Available: 0, required: 10. To avoid this,

    increase spark.kryoserializer.buffer.max value.

    kryo

    --conf spark.kryoserializer.buffer.max=1296m

  • 8stagingSpark

    dev

    staging

    ,,,

    CPUNodeNode

  • 8stagingSpark

    stagingS3unload2

    partition

    HashPartition

    180

  • JVM

    GangliaSpark WebUI

  • EMR

  • Appendix

  • SparkTipsSpark on EMR(YARN)Tipshttp://qiita.com/uryyyyyyy/items/f8bb1c4a4137e896de7f

    http://www.slideshare.net/hadooparchbook/top-5-mistakes-when-writing-spark-applications

    EMRhttp://qiita.com/uryyyyyyy/items/5cc7fa8957ad5953f111

    http://qiita.com/uryyyyyyy/items/f8bb1c4a4137e896de7fhttp://qiita.com/uryyyyyyy/items/f8bb1c4a4137e896de7fhttp://www.slideshare.net/hadooparchbook/top-5-mistakes-when-writing-spark-applicationshttp://www.slideshare.net/hadooparchbook/top-5-mistakes-when-writing-spark-applicationshttp://qiita.com/uryyyyyyy/items/5cc7fa8957ad5953f111http://qiita.com/uryyyyyyy/items/5cc7fa8957ad5953f111