Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

46
Japanese Text Mining with Scala and Spark Eduardo Gonzalez Scala Matsuri 2016 Scala と Spark ととととととととととととととと

Transcript of Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Page 1: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Japanese Text Mining with Scala

and SparkEduardo GonzalezScala Matsuri 2016

Scala と Spark による日本語テキストマイニング

Page 2: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

About Me• Eduardo Gonzalez• Japan Business Systems• Japanese System Integrator (SIer)• Social Systems Design Center (R&D)

• Pittsburgh University• Computer Science• Japanese

@wm_eddie

Page 3: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Agenda• Intro to Text mining with Spark• Pre-processing Japanese Text• Japanese Word Breaking• Spark Gotchas

• Topic Extraction with LDA• Intro to Word2Vec• Recommendation with Word Embedding

導入、前処理(分かち書き、 Spark の落とし穴)、トピック解析、 Word2Vec 、レコメンドの順で説明する

Page 4: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Machine Learning Vocabulary• Feature: A number that represents

something about a data point• Label: A feature of the data we want to

predict• Document: A block of text with a unique

ID• Model: A learned set of parameters that

can be used for prediction • Corpus: A collection of documents

機械学習の前提となる語彙として Feature 、 Label 、 Document 、 Model 、Corpus がある

Page 5: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

What is Apache Spark

• A library that defines a Resilient Distributed Dataset type and a set of transformations• RDDs are only representations of calculations

• A runtime that can execute RDDs in a distributed manner• A master process that schedules and monitors executors

• Executors actually do the calculations and can keep results in their memory

• Spark SQL, MLLib and Graph X define special types of RDDs

Spark は汎用分散処理基盤で、 SQL/ 機械学習 / グラフといったコンポーネントを保持する

Page 6: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Apache Spark Example

import org.apache.spark.{SparkConf, SparkContext}

object Main extends App { val sc = new SparkContext(new SparkConf())

val text = sc.textFile("hdfs:///kjb.txt")

val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.collect().foreach(println)}

Spark で WordCount アプリケーションを構築するとこのようになる

Page 7: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Spark’s Text-Mining Tools• LDA for Topic Extraction

• Word2Vec an unsupervised way to turn words into features based on their meaning

• CountVectorizer turns documents into vectors based on word count

• HashingTF-IDF calculates important words of a document with respect to the corpus

• And much moreSparkのテキストマイニングツールとしてLDA、 CountVectorizer、 HashingTF-IDF等のツールがある

Page 8: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

How to use Spark LDA

import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}import org.apache.spark.mllib.linalg.Vectors

// Load and parse the dataval data = sc.textFile("data/mllib/sample_lda_data.txt")val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))// Index documents with unique IDsval corpus = parsedData.zipWithIndex.map(_.swap).cache()

// Cluster the documents into three topics using LDAval ldaModel = new LDA().setK(3).run(corpus)

Page 9: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

sample_lda_data.txt

ただ、入力の LDA データは文章のようには見えない

1 2 6 0 2 3 1 1 0 0 31 3 0 1 3 0 0 2 0 0 11 4 1 0 0 4 9 0 1 2 02 1 0 3 0 0 5 0 2 3 93 1 1 9 3 0 2 0 0 1 34 2 0 3 4 5 1 1 1 4 02 1 0 3 0 0 5 0 2 2 91 1 1 9 2 1 2 0 0 1 34 4 0 3 4 2 1 3 0 0 02 8 2 0 3 0 2 0 2 7 21 1 1 9 0 2 2 0 0 3 34 1 0 0 4 5 1 3 0 1 0

(´Д` )

This does not

look like text

Page 10: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

LDA Step 0: Get words

LDA 実行にあたり、まずはじめに単語を抽出する必要がある

Page 11: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Word Segmentation• Hard to actually get right.

• Simple in theory with English• Str.Split(“ “)

• But not enough for real data.• (Take parens for example.)• [“(Take”, “parens”, “for”, “example.)”]• Etc.

実際の単語抽出は難しく、区切りで分割するだけではうまくいかない

Page 12: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Word Segmentation

• Since Japanese lacks spaces it’s hard even in theory

• A probabilistic approach is necessary• Thankfully there are libraries that can

help

日本語単語の抽出は単語区切り文字がなく、確率的アプローチが必要、ライブラリで効率的に実行できる

Page 13: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Morphological Analyzers

• Include POS tagging, pronunciation and stemming

• MeCab• Written in C++with SWIG bindings to

pretty much everything• Kuromoji• Written in Java available via maven

• Others形態素解析(品詞タグ付け、発音、語幹処理服務)用に MeCab や Kuromoji 等のライブラリがある

Page 14: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

JMecab & Spark/Hadoop

• Not impossible but difficult• Add MeCab to each node• Add jar to classpaths• Include jar in project for compilation

• Not too bad with own hardware but painful with Amazon EMR or Azure HDInsight

JMecab は事前 Install が必要なため、オンプレでは何とかなるが、クラウド環境では実行困難

Page 15: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Kuromoji & Spark/Hadoop

• Easy• Include dependency in build.sbt• Include jar file in FatJar with sbt-

assembly

Kuromoji は依存性を追加し、 FatJar をビルドするだけなので使いやすい

Page 16: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Using Kuromojiimport org.atilika.kuromoji.Tokenizer

object Main extends App { import scala.collection.JavaConverters.asScalaBufferConverter

val tokenizer = Tokenizer.builder().build()

val ex1 = "リストのような構造の物から条件を満たす物を探す " val res1 = tokenizer.tokenize(ex1).asScala

for (token <- res1) { println(s"${token.getBaseForm}\t${token.getPartOfSpeech}") }}

Page 17: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Using Kuromoji

Kuromoji を使うとこのように認識される

厚生 名詞 , 一般 ,*,*年金 名詞 , 一般 ,*,*基金 名詞 , 一般 ,*,*脱退 名詞 , サ変接続 ,*,*に 助詞 , 格助詞 , 一般 ,*伴う 動詞 , 自立 ,*,*手続き 名詞 , サ変接続 ,*,*について 助詞 , 格助詞 , 連語 ,*の 助詞 , 連体化 ,*,*リマ 名詞 , 固有名詞 , 地域 , 一般インド 名詞 , 固有名詞 , 地域 ,国です 助動詞 ,*,*,*

リスト 名詞 , 一般 ,*,*の 助詞 , 連体化 ,*,*よう 名詞 , 非自立 , 助動詞語幹 ,*だ 助動詞 ,*,*,*構造 名詞 , 一般 ,*,*の 助詞 , 連体化 ,*,*物 名詞 , 非自立 , 一般 ,*から 助詞 , 格助詞 , 一般 ,*条件 名詞 , 一般 ,*,*を 助詞 , 格助詞 , 一般 ,*満たす 動詞 , 自立 ,*,*物 名詞 , 非自立 , 一般 ,*を 助詞 , 格助詞 , 一般 ,*探す 動詞 , 自立 ,*,*

Page 18: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 1: Build Vocabulary

語彙の構築

Page 19: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Vocabularylazy val tokenizer = Tokenizer.builder().build()

val text = sc.textFile("documents")val words = for { line <- text token <- tokenizer.tokenize(line).asScala} yield token.getBaseForm

val vocab = words.distinct().zipWithIndex().collectAsMap()

Page 20: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 2: Create Corpus

コーパスの作成

Page 21: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Corpusval documentWords: RDD[Array[String]] = text.map(line => tokenizer.tokenize(line).asScala.map(t => t.getBaseForm).toArray)val documentCounts: RDD[Array[(String, Int)]] = documentWords.map(words => words.distinct.map { word => (word, words.count(_ == word)) })val documentIndexAndCount: RDD[Seq[(Int, Double)]] = documentCounts.map(wordsAndCount => wordsAndCount.map {

case (word, count) => (vocab(word).toInt, count.toDouble) })val corpus: RDD[(Long, Vector)] = documentIndexAndCount.map(Vectors.sparse(vocab.size, _)).zipWithIndex.map(_.swap)

Page 22: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 3: Learn Topics

トピックモデルの学習

Page 23: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Learn Topicsval ldaModel = new LDA().setK(10).setMaxIterations(100).run(corpus)

val topics = ldaModel.describeTopics(10).map { case (terms, weights) => terms.map(vocabulary(_)).zip(weights)}

topics.zipWithIndex.foreach { case (topic, i) => println(s"TOPIC $i") topic.foreach { case (term, weight) => println(s"$term\t$weight") } println(s"==========")}

Page 24: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 4: Evaluate

結果の評価

Page 25: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Topics?Topic 0:

です 0.10870545899718176。 0.09623411796419644さん 0.06105040403724023

Topic 1:

の 0.11035671185240525を 0.07860862808644907する 0.05605566895190625

Topic 2:

お願い 0.07579177409154919ご 0.04431117457391179よろしく0.032788330612439916

結果は助詞や文章の補助単語になっていた

Page 26: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 5: GOTO 2

Page 27: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Filter Stopwordsval popular = words .map(w => (w, 1)) .reduceByKey(_ + _) .sortBy(-_._2) .take(50) .map(_._1) .toSet

val vocabIndicies = words.distinct().filter(w => !popular.contains(w)).zipWithIndex()val vocab: Map[String, Long] = vocabIndicies.collectAsMap()val vocabulary = vocabIndicies.collect().map(_._1)

ストップワードの除去

Page 28: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Topics!Topic 0:

変更 0.032952997236706624サーバー 0.03140777729144046設定0.021643554361727567エラー 0.017955380768330902

Topic 1:

ログ 0.028665774057609564時間0.026686704628121154時 0.02404938565591628発生0.020797622509804107

Topic 2:

様 0.0474658820402456株式会社 0.026174292703953685お世話0.021939329774535308

Page 29: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Using the LDA model

• Prediction requires a LocalLDAModel• Use .toLocal if

isInstanceOf[DistributedLDAModel]• Convert to Vector using same steps• Be sure to filter out words not in the

vocabulary

• Call topicDistributions to see topic scores

LDA モデルはトピックの予想のために使用される

Page 30: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Topics Prediction

New document topics: 0.091084004103132,0.1044111561202625,0.09090943947509807,0.11607354553753861,0.10404284803971378,0.09697071269561051,0.09571658794577831,0.0919546186785918,0.09176248930132802,0.11707459810294643

New document topics: 0.09424474530277152,0.1183270779577911,0.09230776874419214,0.09835759337114718,0.13159581881630272,0.09279638945611612,0.094124104743527,0.09295449996673977,0.09291472297512052,0.09237727866629193

トピックの予想

Topic 0 Topic 1 Topic 2 Topic …

Page 31: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Now what?

• Find the minimum logLikelihood in a set of documents you know are OK

• Report anomaly whenever a new document has a lower logLikelihood

トピックを正しく予想できた集合の最小対数尤度を計算、新しい文書がその値を下回ったら「異常」に分類

Page 32: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Anomaly Detectionval newDoc = sc.parallelize(Seq("平素は当社サービスをご利用いただき、誠にありがとうございます。 "))

def stringToCountVector(strings: RDD[String]) = { . . .}

val score = lda.logLikelihood(stringToCountVector(newDoc))

/*

-2153492.694125671

*/

Page 33: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Word2Vec• Created vectors that represents

points in meaning space• Unsupervised but requires a lot of

data to generate good vectors• Google’s sample vectors trained

on 100 billion words (~X00GB?)• Vectors with less data can provide

interesting similarities but can’t do so consistently

Word2Vec では単語をベクトル化して定量的に表現可能で、単語同士の類似度を出すことができる

Page 34: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Word2Vec Intuition

• Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.

実際の単語ベクトル化例

Page 35: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Vector Concatenation

ベクトル連結

ITEM_01

営業活用

営業

情報共有

サポート. . .

Page 36: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 1: Make vectors

単語ベクトルの生成

Page 37: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Making Word2VecModel

val documentWords: RDD[Seq[String]] = text.map(line => tokenizer.tokenize(line).asScala.map(_.getSurfaceForm).toSeq)

documentWords.cache()

val model = new Word2Vec().setVectorSize(300).fit(documentWords)

Page 38: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Step 2: Use vectors

単語ベクトルの適用

Page 39: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Using Word2VecModel

model.findSynonyms(“日本” , 5).foreach(println)

/*

(マイクロソフト ,3.750299190465294)

(ビジネス ,3.7329870992662104)

(株式会社 ,3.323983664186244)

(システムズ ,3.1331352923187987)

(ビジネスプロダクティビティ ,2.595931613590554)

*/

実際に単語類似度算出例、ただし、元データで結果は大きく変動するため元データが非常に重要

Big dataset is very important.

Page 40: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Recommendation

• Paragraph Vectors• Not available in Spark T_T

文章のベクトル化によるレコメンドは Spark ではできない

Page 41: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Embedding with Vector Concatenation • Calculate sum of words in description• Add it to vectors from

Word2VecModel.getVectors with special keyword (Ex. ITEM_1234)

• Create new Word2VecModel using constructor

• ※Not state of the art but can produce reasonable recommendations without user rating data

ベクトル連結による embedding 、「アイテム」ごとに含まれる単語のベクトルを合計する

Page 42: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Item Embedding (1/2)

val embeds = Map( "ITEM_001_01" -> "営業部門の情報共有と活用をサポートし ", "ITEM_001_02" -> "組織的な営業力 売れる仕組みを構築します・ ", "ITEM_001_03" -> "営業情報のコミュニケーション基盤を構築する ", "ITEM_002_01" -> "一般的なサーバ、ネットワーク機器や OSレベルの監視に加え ", "ITEM_002_02" -> "またモニタリングポータルでは、アラームの発生状況 ", "ITEM_002_03" -> "監視システムにより取得されたパフォーマンス情報が逐次ダッシュボード形式 ", "ITEM_003_01" -> "IPネットワークインフラストラクチャを構築します ", "ITEM_003_02" -> "導入にとどまらず、アプリケーションや OAシステムとの融合を図ったユニファイドコミュニケーション環境を構築 ", "ITEM_003_03" -> "企業内および企業外へのコンテンツの効果的な配信環境、閲覧環境をご提供します ")

Page 43: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Item Embedding (2/2)

def stringToVector(s: String): Array[Double] = { val words = tokenizer.tokenize(s).asScala.map(_.getSurfaceForm).toSeq val vectors = words.map(word => Try(model.transform(word)).getOrElse(model.transform("は "))) val breezeVectors: Seq[DenseVector[Double]] = vectors.map(v => new DenseVector(v.toArray)) val concat = breezeVectors.foldLeft(DenseVector.zeros[Double](vectorLength))((a, b) => a :+ b)

concat.toArray}

val embedVectors: Map[String, Array[Float]] = embeds.map { case (key, value) => (key, stringToVector(value).map(_.toFloat))}

val embedModel = new Word2VecModel(embedVectors ++ model.getVectors)

Page 44: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Recommending Similar

embedModel.findSynonyms("ITEM_001_01", 5).foreach(println)/*

(ITEM_001_03,12.577457221575695)

(ITEM_003_03,12.542920930725996)

(ITEM_003_02,12.315240961298104)

(ITEM_001_02,12.260734177166485)

(ITEM_002_01,10.866897938028856)

*/

類似度の計算

Page 45: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Recommending New

val newSentence = stringToVector("会計・受発注及び生産管理を中心としたシステム ")embedModel.findSynonyms(Vectors.dense(newSentence), 5).foreach(println)

/*

(ITEM_001_02,14.372981084681571)

(ITEM_003_03,14.343473534848325)

(ITEM_001_01,13.83593570884867)

(ITEM_002_01,13.61507040314043)

(ITEM_002_03,13.462141195072414)

*/

新しいサンプルからのレコメンド

Page 46: Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Thank you

• Questions?

• Example source code at:• https://github.com/wmeddie/spark-text