您的位置:首页 > 编程语言

基于Spark MLlib平台和基于模型的协同过滤算法的电影推荐系统(二)代码实现

2016-12-24 21:15 866 查看
上接基于Spark MLlib平台和基于模型的协同过滤算法的电影推荐系统(一)1.  设置不打印一堆INFO信息(减少打印量 保证Shell页面清晰干净) 
sc.setLogLevel("WARN")
2.   导入相关recommendation包中相关类,加载数据,并解析到RDD【Rating】对象
①导入相关recommendation包,其中recommendation._的含义是导入recommendation包中全部类
scala> import org.apache.spark.mllib.recommendation._
import org.apache.spark.mllib.recommendation._
②加载数据;匹配模式;user product rating的类型是Int Int Double,需要转换;
scala> val data = sc.textFile("/root/cccc.txt").map(_.split(",") match {case Array (user,product,rating) => Rating (user.toInt,product.toInt,rating.toDouble)})
data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating] = MapPartitionsRDD[29] at map at <console>:24
或者:val data = sc.textFile("/root/cccc.txt").map(_.split(",");Rating(f(0).toInt,f(1).toInt,f(2).toDouble) //这句运行有错。
/**如果不用模式匹配 还可以用if判断(本身case就是if的另一种形式)**/
【附加:.first可以查看数据的第一行;.count可以统计数据的行数
scala> data.first
res24: org.apache.spark.mllib.recommendation.Rating = Rating(1,1,5.0)

scala> data.count
res25: Long = 16  
二:设置参数,建立ALS模型利用自带的函数:ALS.train(data,rank,iterations,lambda)各参数意义:ALS.train(数据,维度,迭代次数,正则化参数)细释:k:维度(这里用rank表示),rank一般选在8到20之间      iterations:迭代次数      lambda:正则化参数,防止过拟合,【经验之谈】λ一般是以3倍数往上增 0.01 0.03 0.1 0.3 1 3.........
/**建立ALS模型使用模型推荐的参数即可,设置rank为10,迭代次数为20,alpha为0.01**/
val rank = 10;val iterations=20 ;val lambda =0.01;
val model = ALS.train(data,rank,iterations,lambda)
或者:
val model = ALS.train(data,8,10,0.01)
执行后看到MatrixFactorizationModel!!
scala> val model = ALS.train(data,rank,iterations,alpha)
model: org.apache.spark.mllib.recommendation.MatrixFactorizationModel = org.apache.spark.mllib.recommendation.MatrixFactorizationModel@3667e643
那!!怎么去观察MatrixFactorizationModel这种黑盒子里的内部结构?(参考倒数第一个模块)

那!!Rating是几乘几的矩阵呢?(参考倒数第二个模块)
三:进行预测四:把预测的结果和原始评分整合昨天:
val usersProducts = data.map{case Rating(user,product,rating) =>(user,product)}val ratingAndPredictions = data.map{case Rating(user,product,rating) => ((user,product),rating)}.join(model.predict(usersProducts).map{case Rating(user,product,rating)=>((user,product),rating)})参考其结构帮助理解:scala> usersProducts.collectres27: Array[(Int, Int)] = Array((1,1), (1,2), (1,3), (1,4), (2,1), (2,2), (2,3), (2,4), (3,1), (3,2), (3,3), (3,4), (4,1), (4,2), (4,3), (4,4))输出格式为((用户,项目),(实际评分,预测评分))scala> ratingAndPredictions.collectres28: Array[((Int, Int), (Double, Double))] = Array(((1,4),(1.0,0.9999058733819626)), ((3,1),(1.0,0.9998962746677607)), ((2,3),(5.0,4.994863065698205)), ((1,2),(1.0,0.9999058733819626)), ((2,1),(5.0,4.994863065698205)), ((4,4),(5.0,4.994911307454755)), ((1,1),(5.0,4.994863065698205)), ((4,2),(5.0,4.994911307454755)), ((2,2),(1.0,0.9999058733819626)), ((4,1),(1.0,0.9998962746677607)), ((2,4),(1.0,0.9999058733819626)), ((3,2),(5.0,4.994911307454755)), ((3,4),(5.0,4.994911307454755)), ((3,3),(1.0,0.9998962746677607)), ((4,3),(1.0,0.9998962746677607)), ((1,3),(5.0,4.994863065698205)))
今天课上思路(与昨天的相同):
scala> val usersProducts = data.map{case Rating(user,product,rating) =>(user,product)}usersProducts: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[210] at map at <console>:26二元组的key也是一个元组scala> model.predict(usersProducts).map(x => ((x.user,x.product),x.rating))res6: org.apache.spark.rdd.RDD[((Int, Int), Double)] = MapPartitionsRDD[219] at map at <console>:31键是二元组,值也是二元组scala> model.predict(usersProducts).map(x => ((x.user,x.product),x.rating)).join(data.map(x => ((x.user,x.product),x.rating)))res7: org.apache.spark.rdd.RDD[((Int, Int), (Double, Double))] = MapPartitionsRDD[232] at join at <console>:31把第一行取出来看一下,输出格式为((用户,项目),(预测评分,实际评分))scala> model.predict(usersProducts).map(x => ((x.user,x.product),x.rating)).join(data.map(x => ((x.user,x.product),x.rating))).take(1)res8: Array[((Int, Int), (Double, Double))] = Array(((1,1),(4.996339908089835,5.0)))打印一个用户的实际值和预测值并做差,打印所有用户的实际值和预测值并做差scala> res8(0)._2res11: (Double, Double) = (4.996339908089835,5.0)
【倒数第二个模块】Rating是几乘几的矩阵??
是4*4的矩阵(矩阵=用户*项目)
验证如下:scala> data.map(x => x.user)res1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[210] at map at <console>:27scala> data.map(x => x.user).distinct.countres3: Long = 4scala> data.map(x => x.product).distinct.countres4: Long = 4
【倒数第一个模块】怎么去观察MatrixFactorizationModel这种黑盒子里的内部结构?
scala> model.asInstanceOf                isInstanceOf                predict                     productFeaturesrank                        recommendProducts           recommendProductsForUsers   recommendUsersrecommendUsersForProducts   save                        toString                    userFeatures打印用户特征第一行scala> model.userFeatures take 1res1: Array[(Int, Array[Double])] = Array((3,Array(-0.21575616300106049, -0.5715493559837341, 0.012001494877040386, 0.050375282764434814, 0.1884985715150833, 0.6539813280105591, -0.023888511583209038, 0.355787068605423)))打印项目特征第一行scala> model.productFeatures take 1res2: Array[(Int, Array[Double])] = Array((3,Array(-2.5677483081817627, -1.7736809253692627, -0.8949224948883057, 3.5357284545898438, 1.3151004314422607, -1.8309783935546875, -2.596622943878174, 0.4328916370868683)))如果要知道用户3对项目3的评分:做内积scala> val user3 = res1user3: Array[(Int, Array[Double])] = Array((3,Array(-0.21575616300106049, -0.5715493559837341, 0.012001494877040386, 0.050375282764434814, 0.1884985715150833, 0.6539813280105591, -0.023888511583209038, 0.355787068605423)))scala> val product3 = res2product3: Array[(Int, Array[Double])] = Array((3,Array(-2.5677483081817627, -1.7736809253692627, -0.8949224948883057, 3.5357284545898438, 1.3151004314422607, -1.8309783935546875, -2.596622943878174, 0.4328916370868683)))现在就是简单的scala的处理了//先把第一个值取出来;这个值里面是二元组,._2 取二元组里的第二个值scala> val user3 = res1(0)._2user3: Array[Double] = Array(-0.21575616300106049, -0.5715493559837341, 0.012001494877040386, 0.050375282764434814, 0.1884985715150833, 0.6539813280105591, -0.023888511583209038, 0.3557870400068605423)scala> val product3 = res2(0)._2product3: Array[Double] = Array(-2.5677483081817627, -1.7736809253692627, -0.8949224948883057, 3.5357284545898438, 1.3151004314422607, -1.8309783935546875, -2.596622943878174, 0.4328916370868683)通过zip把这两个二元组整合起来scala> user3 zip product3 map(x => x._1+x._2) sumwarning: there were 1 feature warning(s); re-run with -feature for detailsres4: Double = -3.9307828275486827???res12.rating
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: