DSP開発におけるSpark MLlibの活用

Click here to load reader

  • date post

    15-Apr-2017
  • Category

    Science

  • view

    1.525
  • download

    1

Embed Size (px)

Transcript of DSP開発におけるSpark MLlibの活用

  • DSPSpark MLlib

    2015.11.27

  • 1

    : (, )

    : DSP, , ()

  • DIMSUM -

    word2vec -

    Splash - word2vec

  • DIMSUM -

    word2vec -

    Splash - word2vec

  • item-itemcos

    e.g.

    A B C

    B C

    A C

    A D

    useritem

    user1

    user2

    user3

    user4

    A B C D

    1 1 1 0

    0 1 1 0

    1 0 0 1

    1 0 1 0item

    item

    A

    B

    C

    D

    A B C D

    1 0.16 0.22 0.33

    0.16 1 0.33 0

    0.22 0.33 1 0

    0.33 0 0 1

    item itemsimilarity matrix

    recommend items

    user itempage view matrix

    cos

  • MapReduce

    Zadeh, Reza Bosagh, and Gunnar Carlsson. (2013).

    =>

    m:, L:

    m

    #users = m, #items = n

    useritem

    user1

    user2

    user i

    user i+1

    A B C D

    1 1 1 0

    0 1 1 0

    1 0 0 1

    1 0 1 0

    partition 1

    partition p

    j

    i

    k

    emit

    map reduce

  • useritem

    user1

    user2

    user i

    user i+1

    A B C D

    1 1 1 0

    0 1 1 0

    1 0 0 1

    1 0 1 0

    partition 1

    partition p

    j

    i

    k

    emit

    map reduce

    emit with prob.

    MapReduce DIMSUM

    m!=>

    oversampling parameter

    Zadeh, Reza Bosagh, and Gunnar Carlsson. (2013).

    :

  • DIMSUM

    Chernoff bound

    Zadeh, Reza Bosagh, and Gunnar Carlsson. (2013).

  • vs error, shuffle size

    Zadeh, Reza Bosagh, and Gunnar Carlsson. (2013).

    TwitterIO40%

  • MLlibDIMSUM

    val rows: RDD[Vector] = ... // an RDD of local vectors val mat: RowMatrix = new RowMatrix(rows) val sim= mat.columnSimilarities(1000)

    MLlib

    : oversampling parameter

  • DIMSUM -

    word2vec -

    Splash - word2vec

  • word2vec GoogleMikolov 2(CBOW,Skip gram) 2(, )

    T. Mikolov et al. (2013)

  • web

    itemweb

  • A B Citems

    page view

  • (CV)

    label: 0 or 1

    conversion

    regression by ML

    CVCV

  • web

    3000 7~8000 Page View 1.6/1

    A B C

    B C

    A C

    A D

    page views

    Spark

  • MLlibword2vec C Skip gram

  • var f = blas.sdot(vectorSize, syn0, l1, 1, syn1neg, l2, 1) var g = 0.0 if (f > MAX_EXP) { g = (label - 1) * alpha } else if (f < -MAX_EXP) { g = (label - 0) * alpha } else { val ind = ((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2.0)).toInt f = expTable.value(ind) g = ((label - f) * alpha).toFloat } blas.saxpy(vectorSize, g.toFloat, syn1neg, l2, 1, neu1e, 0, 1) blas.saxpy(vectorSize, g.toFloat, syn0, l1, 1, syn1neg, l2, 1) syn1negModify(target) += 1

  • MLlibword2vec contd partition

    data

    data shard

    data shard

    data shard

    worker 1

    worker 2

    worker 3

    driver

    1

    2

    3 original implementation revised implementation

  • MLlibword2vec contd

    partition

    , , , 2015, p164

    T:iteration, C: const., :

  • parameter server

    J. Dean et al.(2012)

    sparkparameter server

    sparkparameter server: Dist-ML(intel)

  • DIMSUM -

    word2vec -

    Splash - word2vec

  • Splash

    UC Berkeley amplab Spark MLlib 1/iteration LDA, ,SGD,Logistic

  • Splash:partition :partition i

    local update

    global update

    Y. Zhang and M. I. Jordan (2015)

    n:local, T: iteration

  • toy problem

    : : Splash :

    Y. Zhang and M. I. Jordan (2015)

  • Splash Example

  • SplashY. Zhang and M. I. Jordan (2015)

    MNIST 8M

    ref: https://amplab.cs.berkeley.edu/projects/splash/

  • DIMSUM MLlib MLlib