您的位置：首页 > 编程语言

基于Spark MLlib平台和基于模型的协同过滤算法的电影推荐系统（二）代码实现

2016-12-24 21:15 866 查看

上接基于Spark MLlib平台和基于模型的协同过滤算法的电影推荐系统（一）1. 设置不打印一堆INFO信息（减少打印量保证Shell页面清晰干净）

sc.setLogLevel("WARN")

2. 导入相关recommendation包中相关类，加载数据，并解析到RDD【Rating】对象

①导入相关recommendation包，其中recommendation._的含义是导入recommendation包中全部类
scala> import org.apache.spark.mllib.recommendation._
import org.apache.spark.mllib.recommendation._

②加载数据；匹配模式；user product rating的类型是Int Int Double，需要转换；
scala> val data = sc.textFile("/root/cccc.txt").map(_.split(",") match {case Array (user,product,rating) => Rating (user.toInt,product.toInt,rating.toDouble)})
data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating] = MapPartitionsRDD[29] at map at <console>:24
或者：val data = sc.textFile("/root/cccc.txt").map(_.split(",");Rating(f(0).toInt,f(1).toInt,f(2).toDouble) //这句运行有错。
/**如果不用模式匹配 还可以用if判断（本身case就是if的另一种形式）**/

【附加：.first可以查看数据的第一行；.count可以统计数据的行数

scala> data.first
res24: org.apache.spark.mllib.recommendation.Rating = Rating(1,1,5.0)

scala> data.count
res25: Long = 16

二：设置参数，建立ALS模型利用自带的函数：ALS.train(data,rank,iterations,lambda)各参数意义：ALS.train(数据,维度,迭代次数,正则化参数)细释：k：维度（这里用rank表示），rank一般选在8到20之间 iterations：迭代次数 lambda：正则化参数，防止过拟合，【经验之谈】λ一般是以3倍数往上增 0.01 0.03 0.1 0.3 1 3.........

/**建立ALS模型使用模型推荐的参数即可，设置rank为10,迭代次数为20，alpha为0.01**/
val rank = 10;val iterations=20 ;val lambda =0.01;
val model = ALS.train(data,rank,iterations,lambda)
或者：
val model = ALS.train(data,8,10,0.01)

执行后看到MatrixFactorizationModel！！
scala> val model = ALS.train(data,rank,iterations,alpha)
model: org.apache.spark.mllib.recommendation.MatrixFactorizationModel = org.apache.spark.mllib.recommendation.MatrixFactorizationModel@3667e643
那！！怎么去观察MatrixFactorizationModel这种黑盒子里的内部结构？(参考倒数第一个模块)

那！！Rating是几乘几的矩阵呢？(参考倒数第二个模块)

三：进行预测四：把预测的结果和原始评分整合昨天：

val usersProducts = data.map{case Rating(user,product,rating) =>(user,product)}val ratingAndPredictions = data.map{case Rating(user,product,rating) => ((user,product),rating)}.join(model.predict(usersProducts).map{case Rating(user,product,rating)=>((user,product),rating)})参考其结构帮助理解：scala> usersProducts.collectres27: Array[(Int, Int)] = Array((1,1), (1,2), (1,3), (1,4), (2,1), (2,2), (2,3), (2,4), (3,1), (3,2), (3,3), (3,4), (4,1), (4,2), (4,3), (4,4))输出格式为（（用户，项目），（实际评分，预测评分））scala> ratingAndPredictions.collectres28: Array[((Int, Int), (Double, Double))] = Array(((1,4),(1.0,0.9999058733819626)), ((3,1),(1.0,0.9998962746677607)), ((2,3),(5.0,4.994863065698205)), ((1,2),(1.0,0.9999058733819626)), ((2,1),(5.0,4.994863065698205)), ((4,4),(5.0,4.994911307454755)), ((1,1),(5.0,4.994863065698205)), ((4,2),(5.0,4.994911307454755)), ((2,2),(1.0,0.9999058733819626)), ((4,1),(1.0,0.9998962746677607)), ((2,4),(1.0,0.9999058733819626)), ((3,2),(5.0,4.994911307454755)), ((3,4),(5.0,4.994911307454755)), ((3,3),(1.0,0.9998962746677607)), ((4,3),(1.0,0.9998962746677607)), ((1,3),(5.0,4.994863065698205)))

今天课上思路（与昨天的相同）：

scala> val usersProducts = data.map{case Rating(user,product,rating) =>(user,product)}usersProducts: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[210] at map at <console>:26二元组的key也是一个元组scala> model.predict(usersProducts).map(x => ((x.user,x.product),x.rating))res6: org.apache.spark.rdd.RDD[((Int, Int), Double)] = MapPartitionsRDD[219] at map at <console>:31键是二元组，值也是二元组scala> model.predict(usersProducts).map(x => ((x.user,x.product),x.rating)).join(data.map(x => ((x.user,x.product),x.rating)))res7: org.apache.spark.rdd.RDD[((Int, Int), (Double, Double))] = MapPartitionsRDD[232] at join at <console>:31把第一行取出来看一下，输出格式为（（用户，项目），（预测评分，实际评分））scala> model.predict(usersProducts).map(x => ((x.user,x.product),x.rating)).join(data.map(x => ((x.user,x.product),x.rating))).take(1)res8: Array[((Int, Int), (Double, Double))] = Array(((1,1),(4.996339908089835,5.0)))打印一个用户的实际值和预测值并做差，打印所有用户的实际值和预测值并做差scala> res8(0)._2res11: (Double, Double) = (4.996339908089835,5.0)

【倒数第二个模块】Rating是几乘几的矩阵？？

是4*4的矩阵（矩阵=用户*项目）

验证如下：scala> data.map(x => x.user)res1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[210] at map at <console>:27scala> data.map(x => x.user).distinct.countres3: Long = 4scala> data.map(x => x.product).distinct.countres4: Long = 4

【倒数第一个模块】怎么去观察MatrixFactorizationModel这种黑盒子里的内部结构？

scala> model.asInstanceOf                isInstanceOf                predict                     productFeaturesrank                        recommendProducts           recommendProductsForUsers   recommendUsersrecommendUsersForProducts   save                        toString                    userFeatures打印用户特征第一行scala> model.userFeatures take 1res1: Array[(Int, Array[Double])] = Array((3,Array(-0.21575616300106049, -0.5715493559837341, 0.012001494877040386, 0.050375282764434814, 0.1884985715150833, 0.6539813280105591, -0.023888511583209038, 0.355787068605423)))打印项目特征第一行scala> model.productFeatures take 1res2: Array[(Int, Array[Double])] = Array((3,Array(-2.5677483081817627, -1.7736809253692627, -0.8949224948883057, 3.5357284545898438, 1.3151004314422607, -1.8309783935546875, -2.596622943878174, 0.4328916370868683)))如果要知道用户3对项目3的评分：做内积scala> val user3 = res1user3: Array[(Int, Array[Double])] = Array((3,Array(-0.21575616300106049, -0.5715493559837341, 0.012001494877040386, 0.050375282764434814, 0.1884985715150833, 0.6539813280105591, -0.023888511583209038, 0.355787068605423)))scala> val product3 = res2product3: Array[(Int, Array[Double])] = Array((3,Array(-2.5677483081817627, -1.7736809253692627, -0.8949224948883057, 3.5357284545898438, 1.3151004314422607, -1.8309783935546875, -2.596622943878174, 0.4328916370868683)))现在就是简单的scala的处理了//先把第一个值取出来；这个值里面是二元组，._2 取二元组里的第二个值scala> val user3 = res1(0)._2user3: Array[Double] = Array(-0.21575616300106049, -0.5715493559837341, 0.012001494877040386, 0.050375282764434814, 0.1884985715150833, 0.6539813280105591, -0.023888511583209038, 0.3557870400068605423)scala> val product3 = res2(0)._2product3: Array[Double] = Array(-2.5677483081817627, -1.7736809253692627, -0.8949224948883057, 3.5357284545898438, 1.3151004314422607, -1.8309783935546875, -2.596622943878174, 0.4328916370868683)通过zip把这两个二元组整合起来scala> user3 zip product3 map(x => x._1+x._2) sumwarning: there were 1 feature warning(s); re-run with -feature for detailsres4: Double = -3.9307828275486827???res12.rating

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航