1.5.recommending music with apache spark ml

19
Recommending Music with Apache Spark ML 1.5.X [email protected]

Transcript of 1.5.recommending music with apache spark ml

Page 1: 1.5.recommending music with apache spark ml

Recommending Music with Apache Spark ML 1.5.X

[email protected]

Page 2: 1.5.recommending music with apache spark ml

References

● 201504 Advanced Analytics with Spark● Apache Spark Documentation-ML Programming Guide

Page 3: 1.5.recommending music with apache spark ml

Why use ML instead of MLlib ?

Ref: End-to-end Data Pipeline with Apache Spark

Page 4: 1.5.recommending music with apache spark ml

High-level workflows

Page 5: 1.5.recommending music with apache spark ml

Datasets preparation (1)case class ArtistAlias(artAliasId:Long, artistId:Long)case class UserPlayList(userId:Long, usrAliasId:Long, count:Long)case class ArtistData(datArtistId:Long, artistName:String)

def parseUserPlay(): DataFrame = { val rawUA = sc. textFile("file:///tmp/user_artist_data.txt") rawUA.map{ line => val tokens = line. split(' ') UserPlayList(tokens(0).toLong, tokens(1).toLong, tokens(2).toLong) }.toDF() }

userPlayList: DataFrame

userId: bigint usrAliasId: bigint count: bigint

Page 6: 1.5.recommending music with apache spark ml

Datasets preparation (2)def parseArtAli(): DataFrame = { val rawArtAli = sc. textFile("file:///tmp/artist_alias.txt") rawArtAli.flatMap{ line => val tokens = line. split('\t') // tokenize if (tokens.size != 2 || tokens(0). isEmpty || tokens(1).isEmpty) { None } else { Some( ArtistAlias(tokens(0).toLong, tokens(1).toLong) ) } }.toDF() }

artistAlias: DataFrame

artAliasId: bigint artistId: bigint

Page 7: 1.5.recommending music with apache spark ml

Datasets preparation (3)def parseArtName(): DataFrame = { val rawArtData = sc. textFile("file:///tmp/artist_data.txt") rawArtData.flatMap { line => val (id, name) = line. span(_ != '\t') if (name.isEmpty) { None } else { try { Some(ArtistData(id.toLong, name.trim)) } catch { case e: NumberFormatException => None } } }.toDF() }

artistData: DataFrame

datArtistId: bigint artistName: string

Page 8: 1.5.recommending music with apache spark ml

val userRating = userPlayList. join(artistAlias, userPlayList("usrAliasId") === artistAlias("artAliasId") , "left_outer"). selectExpr("cast(userId as int) as uid" , "cast(nvl(artistId, usrAliasId) as int) as pid " , "cast(count as double) as rating")// DataFrame = [uid: int, pid: int, rating: double]

Datasets preparation (4)

artistAlias: DataFrame

artAliasId: bigint artistId: bigint

userPlayList: DataFrame

userId: bigint usrAliasId: bigint count: bigint

Page 9: 1.5.recommending music with apache spark ml

Pipeline construction (1)val Array(trainData, cvData, testData) = userRating. randomSplit(Array(0.8, 0.1, 0.1))

def computeRmse(predictions: RDD[(Double, Double)]) = { val dataCnt = predictions. count val rmse = math.sqrt( predictions.map{ case (rating, prediction) => (rating-prediction)*(rating-prediction) }.reduce(_+_) / dataCnt ) rmse}

val als = new ALS().setUserCol("uid").setItemCol("pid"). setRatingCol("rating").setPredictionCol( "prediction")

trainData.count = 19,435,724cvData.count = 2,431,596testData.count = 2,429,538

Page 10: 1.5.recommending music with apache spark ml

Pipeline construction (2)val modelsCv = for(alpha <- Array(1.0, 20.0, 40.0); iter <- Array(5, 20, 40); rank <- Array(1, 20, 40); regParam <- Array(0.1, 1.0, 2.0)) yield { val paramMap = ParamMap(als.alpha -> alpha). put(als.maxIter -> iter). put(als.rank -> rank). put(als.regParam -> regParam) val model = als. fit(trainData, paramMap) val predictions = model. transform(cvData). filter(!$"prediction".isNaN && !$"prediction".isNull). map{ case Row(uid: Int, pid: Int, rating: Double, prediction: Float) => (rating, prediction. toDouble) } val rmse = computeRmse(predictions) (model, rmse, rank, alpha, iter, regParam)}// : Array[(ALSModel, Double, Int, Double, Int, Double)]

Page 11: 1.5.recommending music with apache spark ml

Pipeline construction (3)// modelsCv.size = 81modelsCv.sortBy{ case (model, rmse, _) => rmse }. foreach{ case (model, rmse, rank, alpha, iter, regParam) => println(f"${rmse},${alpha},${iter},${rank},${regParam}") }

NO. RMSE alpha iteration rank RegPara

1 102.89456 1.0 20 40 2.0

2 102.89456 20.0 20 40 2.0

3 102.89456 40.0 20 40 2.0

4 103.01164 1.0 40 40 2.0

5 103.01164 20.0 40 40 2.0

Page 12: 1.5.recommending music with apache spark ml

Pipeline construction (4)alpha <- Array(1.0, 20.0, 40.0);iter <- Array(5, 20, 40);rank <- Array(1, 20, 40);regParam <- Array(0.1, 1.0, 2.0)

better RMSE if avg. of Reg > 1

better RMSE if avg. of Rank > 20

Page 13: 1.5.recommending music with apache spark ml

Model Evaluation (1)val modelsTest = modelsCv. map{ case (model, rmseCv, rank, alpha, iter, regParam) => val predictions = model. transform(testData). filter(!$"prediction".isNaN && ! $"prediction".isNull). map{ case Row(uid: Int, pid: Int, rating: Double, prediction: Float) => (rating, prediction. toDouble) } val rmseTest = computeRmse(predictions) (model, rmseCv, rmseTest, rank, alpha, iter, regParam) }

modelsTest.sortBy{ case (model, rmseCv, _) => rmseCv }. zipWithIndex. sortBy{ case ((model, rmseCv, rmseTest, _) => rmseTest }. zipWithIndex. foreach{ case (((model, rmseCv, rmseTest, rank, alpha, iter, regParam) , idxCv), idxTest) =>println(f"${idxTest},${idxCv},${rmseTest},${rmseCv},${alpha},${iter},${rank},${regParam}") }

Page 14: 1.5.recommending music with apache spark ml

Model Evaluation (2)NOTest. NOCV. RMSETest RMSECV alpha iteration rank RegPara

1 10 298.59934 108.38374 1.0 40 20 2.0

2 11 298.59934 108.38374 20.0 40 20 2.0

3 12 298.59934 108.38374 40.0 40 20 2.0

4 49 299.02734 118.62548 1.0 40 20 1.0

5 50 299.02734 118.62548 20.0 40 20 1.0

... ... ... ... ... ... ... ...

16 2 301.45575 102.89456 20.0 20 40 2.0

17 1 301.45575 102.89456 1.0 20 40 2.0

18 3 301.45575 102.89456 40.0 20 40 2.0

Page 15: 1.5.recommending music with apache spark ml

Model Evaluation (3)alpha <- Array(1.0, 20.0, 40.0);iter <- Array(5, 20, 40);rank <- Array(1, 20, 40);regParam <- Array(0.1, 1.0, 2.0)

better RMSE if avg. of Reg > 1

better RMSE if avg. of Rand > 20

better RMSE if avg. of Iteration > 20

Page 16: 1.5.recommending music with apache spark ml

Parameters tuning (1)NOTest. NOCV. RMSETest RMSECV alpha iteration rank RegPara

1 1 83.698045 74.281254 1.0 20 40 8.0

2 2 83.698045 74.281254 10.0 20 40 8.0

3 4 85.655462 76.109640 10.0 20 20 8.0

4 3 85.655462 76.109640 1.0 20 20 8.0

5 5 87.739403 78.907838 1.0 20 40 4.0

6 6 87.739403 78.907838 10.0 20 40 4.0

7 9 94.220677 90.304407 10.0 20 40 2.0

8 10 94.220677 90.304407 1.0 20 40 2.0

9 7 95.795563 83.918632 1.0 20 20 4.0

10 8 95.795563 83.918632 10.0 20 20 4.0

Page 17: 1.5.recommending music with apache spark ml

Parameters tuning (2)alpha <- Array(1.0, 10);iter <- Array(20);rank <- Array(20, 40);regParam <- Array(2.0, 4.0, 8.0)

Before tuning After tuning

RMSECV RMSET RMSECV RMSET

102.89456 298.59934 74.281254 83.698045

102.89456 298.59934 74.281254 83.698045

102.89456 298.59934 76.109640 85.655462

Page 18: 1.5.recommending music with apache spark ml

Future works

● Model parameters tuning● Cross validation by ml.tuning.CrossValidator

Page 19: 1.5.recommending music with apache spark ml

Appendixval evaluator = new RegressionEvaluator().setMetricName("rmse"). setLabelCol(als.getRatingCol).setPredictionCol(als.getPredictionCol)val paramGrid = new ParamGridBuilder().addGrid(als.rank, Array(1, 20, 40)). addGrid(als.maxIter, Array(5, 20, 40)).addGrid(als.regParam, Array(0.1, 1.0, 2.0)). addGrid(als.alpha, Array(1.0, 20.0, 40.0)).buildval cv = new CrossValidator().setEstimator(als).setEstimatorParamMaps(paramGrid). setEvaluator(evaluator).setNumFolds(3)val cvModel = cv.fit(trainData)/*15/12/01 13:57:38 INFO storage.BlockManager: Removing RDD 40java.lang.IllegalArgumentException: requirement failed: Column prediction must be of type DoubleType but was actually FloatType. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) at org.apache.spark.ml.evaluation.RegressionEvaluator.evaluate(RegressionEvaluator.scala:67) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:94) */