Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

download Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

If you can't read please download the document

  • date post

    16-Apr-2017
  • Category

    Technology

  • view

    2.626
  • download

    3

Embed Size (px)

Transcript of Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Japanese Text Mining with Scala and Spark

Japanese Text Mining with Scala and SparkEduardo GonzalezScala Matsuri 2016

ScalaSpark

About MeEduardo GonzalezJapan Business SystemsJapanese System Integrator (SIer)Social Systems Design Center (R&D)Pittsburgh UniversityComputer ScienceJapanese@wm_eddie

AgendaIntro to Text mining with SparkPre-processing Japanese TextJapanese Word BreakingSpark GotchasTopic Extraction with LDAIntro to Word2VecRecommendation with Word EmbeddingSpark Word2Vec

Machine Learning VocabularyFeature: A number that represents something about a data pointLabel: A feature of the data we want to predictDocument: A block of text with a unique IDModel: A learned set of parameters that can be used for prediction Corpus: A collection of documentsFeatureLabelDocumentModelCorpus

What is Apache SparkA library that defines a Resilient Distributed Dataset type and a set of transformationsRDDs are only representations of calculations A runtime that can execute RDDs in a distributed mannerA master process that schedules and monitors executorsExecutors actually do the calculations and can keep results in their memorySpark SQL, MLLib and Graph X define special types of RDDsSparkSQL//

Apache Spark Exampleimport org.apache.spark.{SparkConf, SparkContext}

object Main extends App { val sc = new SparkContext(new SparkConf())

val text = sc.textFile("hdfs:///kjb.txt")

val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.collect().foreach(println)}SparkWordCount

Sparks Text-Mining ToolsLDA for Topic ExtractionWord2Vec an unsupervised way to turn words into features based on their meaningCountVectorizer turns documents into vectors based on word countHashingTF-IDF calculates important words of a document with respect to the corpusAnd much moreSparkLDACountVectorizerHashingTF-IDF

How to use Spark LDA

import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}import org.apache.spark.mllib.linalg.Vectors

// Load and parse the dataval data = sc.textFile("data/mllib/sample_lda_data.txt")val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))// Index documents with unique IDsval corpus = parsedData.zipWithIndex.map(_.swap).cache()

// Cluster the documents into three topics using LDAval ldaModel = new LDA().setK(3).run(corpus)

sample_lda_data.txtLDA

1 2 6 0 2 3 1 1 0 0 31 3 0 1 3 0 0 2 0 0 11 4 1 0 0 4 9 0 1 2 02 1 0 3 0 0 5 0 2 3 93 1 1 9 3 0 2 0 0 1 34 2 0 3 4 5 1 1 1 4 02 1 0 3 0 0 5 0 2 2 91 1 1 9 2 1 2 0 0 1 34 4 0 3 4 2 1 3 0 0 02 8 2 0 3 0 2 0 2 7 21 1 1 9 0 2 2 0 0 3 34 1 0 0 4 5 1 3 0 1 0()

This does not look like text

LDA Step 0: Get wordsLDA

Word SegmentationHard to actually get right.Simple in theory with EnglishStr.Split( )But not enough for real data.(Take parens for example.)[(Take, parens, for, example.)]Etc.

Word SegmentationSince Japanese lacks spaces its hard even in theoryA probabilistic approach is necessaryThankfully there are libraries that can help

Morphological AnalyzersInclude POS tagging, pronunciation and stemmingMeCabWritten in C++with SWIG bindings to pretty much everythingKuromojiWritten in Java available via mavenOthersMeCabKuromoji

JMecab & Spark/HadoopNot impossible but difficultAdd MeCab to each nodeAdd jar to classpathsInclude jar in project for compilationNot too bad with own hardware but painful with Amazon EMR or Azure HDInsightJMecabInstall

Kuromoji & Spark/HadoopEasyInclude dependency in build.sbtInclude jar file in FatJar with sbt-assemblyKuromojiFatJar

Using Kuromojiimport org.atilika.kuromoji.Tokenizer

object Main extends App { import scala.collection.JavaConverters.asScalaBufferConverter

val tokenizer = Tokenizer.builder().build()

val ex1 = "" val res1 = tokenizer.tokenize(ex1).asScala

for (token words.distinct.map { word => (word, words.count(_ == word)) })val documentIndexAndCount: RDD[Seq[(Int, Double)]] = documentCounts.map(wordsAndCount => wordsAndCount.map { case (word, count) => (vocab(word).toInt, count.toDouble) })val corpus: RDD[(Long, Vector)] = documentIndexAndCount.map(Vectors.sparse(vocab.size, _)).zipWithIndex.map(_.swap)

Step 3: Learn Topics

Learn Topicsval ldaModel = new LDA().setK(10).setMaxIterations(100).run(corpus)

val topics = ldaModel.describeTopics(10).map { case (terms, weights) => terms.map(vocabulary(_)).zip(weights)}

topics.zipWithIndex.foreach { case (topic, i) => println(s"TOPIC $i") topic.foreach { case (term, weight) => println(s"$term\t$weight") } println(s"==========")}

Step 4: Evaluate

Topics?Topic 0:0.108705458997181760.096234117964196440.06105040403724023Topic 1:0.110356711852405250.078608628086449070.05605566895190625Topic 2:0.075791774091549190.044311174573911790.032788330612439916

Step 5: GOTO 2

Filter Stopwordsval popular = words .map(w => (w, 1)) .reduceByKey(_ + _) .sortBy(-_._2) .take(50) .map(_._1) .toSet

val vocabIndicies = words.distinct().filter(w => !popular.contains(w)).zipWithIndex()val vocab: Map[String, Long] = vocabIndicies.collectAsMap()val vocabulary = vocabIndicies.collect().map(_._1)

Topics!Topic 0:0.0329529972367066240.031407777291440460.0216435543617275670.017955380768330902 Topic 1:0.0286657740576095640.0266867046281211540.024049385655916280.020797622509804107 Topic 2:0.04746588204024560.0261742927039536850.021939329774535308

Using the LDA modelPrediction requires a LocalLDAModelUse .toLocal if isInstanceOf[DistributedLDAModel]Convert to Vector using same stepsBe sure to filter out words not in the vocabularyCall topicDistributions to see topic scoresLDA

Topics PredictionNew document topics: 0.091084004103132,0.1044111561202625,0.09090943947509807,0.11607354553753861,0.10404284803971378,0.09697071269561051,0.09571658794577831,0.0919546186785918,0.09176248930132802,0.11707459810294643New document topics: 0.09424474530277152,0.1183270779577911,0.09230776874419214,0.09835759337114718,0.13159581881630272,0.09279638945611612,0.094124104743527,0.09295449996673977,0.09291472297512052,0.09237727866629193

Topic 0Topic 1Topic 2Topic

Now what?Find the minimum logLikelihood in a set of documents you know are OKReport anomaly whenever a new document has a lower logLikelihood

Anomaly Detectionval newDoc = sc.parallelize(Seq(""))

def stringToCountVector(strings: RDD[String]) = { . . .}

val score = lda.logLikelihood(stringToCountVector(newDoc))/*-2153492.694125671*/

Word2VecCreated vectors that represents points in meaning spaceUnsupervised but requires a lot of data to generate good vectorsGoogles sample vectors trained on 100 billion words (~X00GB?)Vectors with less data can provide interesting similarities but cant do so consistentlyWord2Vec

Word2Vec IntuitionTomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.

Vector Concatenation

ITEM_01. . .

Step 1: Make vectors

Making Word2VecModelval documentWords: RDD[Seq[String]] = text.map(line => tokenizer.tokenize(line).asScala.map(_.getSurfaceForm).toSeq)

documentWords.cache()

val model = new Word2Vec().setVectorSize(300).fit(documentWords)

Step 2: Use vectors

Using Word2VecModelmodel.findSynonyms(, 5).foreach(println)/*(,3.750299190465294)(,3.7329870992662104)(,3.323983664186244)(,3.1331352923187987)(,2.595931613590554)*/Big dataset is very important.

RecommendationParagraph VectorsNot available in Spark T_TSpark

Embedding with Vector Concatenation Calculate sum of words in descriptionAdd it to vectors from Word2VecModel.getVectors with special keyword (Ex. ITEM_1234)Create new Word2VecModel using constructorNot state of the art but can produce reasonable recommendations without user rating data embedding

Item Embedding (1/2)val embeds = Map( "ITEM_001_01" -> "", "ITEM_001_02" -> "", "ITEM_001_03" -> "", "ITEM_002_01" -> "OS", "ITEM_002_02" -> "", "ITEM_002_03" -> "", "ITEM_003_01" -> "IP", "ITEM_003_02" -> "OA", "ITEM_003_03" -> "")

Item Embedding (2/2)def stringToVector(s: String): Array[Double] = { val words = tokenizer.tokenize(s).asScala.map(_.getSurfaceForm).toSeq val vectors = words.map(word => Try(model.transform(word)).getOrElse(model.transform(""))) val breezeVectors: Seq[DenseVector[Double]] = vectors.map(v => new DenseVector(v.toArray)) val concat = breezeVectors.foldLeft(DenseVector.zeros[Double](vectorLength))((a, b) => a :+ b)

concat.toArray}

val embedVectors: Map[String, Array[Float]] = embeds.map { case (key, value) => (key, stringToVector(value).map(_.toFloat))}

val embedModel = new Word2VecModel(embedVectors ++ model.getVectors)

Recommending SimilarembedModel.findSynonyms("ITEM_001_01", 5).foreach(println)/*(ITEM_001_03,12.577457221575695)(ITEM_003_03,12.542920930725996)(ITEM_003_02,12.315240961298104)(ITEM_001_02,12.260734177166485)(ITEM_002_01,10.866897938028856)*/

Recommending Newval newSentence = stringToVector("")embedModel.findSynonyms(Vectors.dense(newSentence), 5).foreach(println)/*(ITEM_001_02,14.372981084681571)(ITEM_003_03,14.343473534848325)(ITEM_001_01,13.83593570884867)(ITEM_002_01,13.61507040314043)(ITEM_002_03,13.462141195072414)*/

Thank youQuestions?

Example source code at:https://github.com/wmeddie/spark-text