您的位置：首页 > 其它

Coursera《Introduction to Recommender Systems》Program Assignment3 用户相似性计算

2013-10-16 21:51 435 查看

在Coursera上跟了明尼苏达大学《Introduction to Recommender Systems》的课，

课程的编程作业老师提供的模板是J***A，由于主要是用C++，对于J***A只是简单的翻过一本书，

编程作业都是用python 来自己搭建整个框架

由于我是用Python写作业所以会遇到这些问题

发出来是希望给使用其他编程语言的同学提个醒

Program Assignment 3的作业做出的结果更样例输出不一样，debug一遍代码觉得没有问题；

后来根据论坛上的反馈一点点找问题发现是我计算 user 之间的相似性跟模板不一样

但是根据课堂上的公式、wiki资料等我的计算公式不存在问题；

后来我故意改成我认为“错误”的公式发现跟模板输出一致。。。顿感大窘啊

（后来我又仔细看了下视频，助教给出了他那么做的解释，见后文。）

具体讨论我发在了论坛上不过反应平平：

The answer of PA3 is wrong! the error occurred in the templete!

我使用助教的公式 PA3 刷了满分

我贴一下 Python代码应该不“违法”

__author__ = 'LiFeiteng'
# -*- coding: utf-8 -*-
import numpy as np

class   UserUserRec:
	def __init__(self):
		self.U = 0
		self.M = 0
		self.user_dict = {}
		self.movie_dict = {}
		self.movie_title = {}
		self.user_ratings = np.matrix([])

	def GetRatingData(self, ratings_file):
		for line in open(ratings_file):
			user, movie, rating = line.split(",")
			if not self.user_dict.has_key(user):
				self.user_dict[user] = self.U
				self.U += 1
			if not self.movie_dict.has_key(movie):
				self.movie_dict[movie] = self.M
				self.M += 1
		print self.U, self.M
		self.user_ratings = np.matrix(np.zeros([self.U, self.M]))
		for line in open("ratings.csv", "r"):
			user, movie, rating = line.split(",")
			self.user_ratings[self.user_dict[user], self.movie_dict[movie]] = np.double(rating)

	def GetMovieTitles(self, movie_titles_file):
		for line in open(movie_titles_file):
			movie, title = line.split(",")
			#delete '\n'
			self.movie_title[movie] = title[:-1]

	def CosineUserSim(self, user1, user2):
        # 我觉得这里使用的公式是不对的
		user_rat = self.user_ratings[user1,:].copy()
		u1 = user_rat - np.mean(user_rat[user_rat>0.0])
		u1 = np.array(u1)*np.array(np.where(user_rat>0, 1, 0))

		user_rat = self.user_ratings[user2,:].copy()
		u2 = user_rat - np.mean(user_rat[user_rat>0.0])
		u2 = np.array(u2)*np.array(np.where(user_rat>0, 1, 0))

		if (np.linalg.norm(u1[0,:])*np.linalg.norm(u2[0,:])) == 0:
			sim = 0.0
		else: #问题出在这里的norm上，norm会计算user1 user2 不共同评分的项
			sim = np.dot(u1[0,:],u2[0,:])/(np.linalg.norm(u1[0,:])*np.linalg.norm(u2[0,:]))
		return np.double(sim)

	def MovieScore4User(self, user, movie):
        #以下省略 N 行
		return score4movie
# end of class UserUserRec

#### PA3
user_user_rec = UserUserRec()
user_user_rec.GetRatingData("ratings.csv")
user_user_rec.GetMovieTitles("movie-titles.csv")

user_user_rec.MovieScore4User('1024', '77')

outfile = open("outfile.txt","w")
for line in open("input.txt"):
	user, movie = line.split(":")
	movie = str(int(movie))
	score = user_user_rec.MovieScore4User(user, movie)
	str1 = ",".join([user, movie, format(score,".4f"), user_user_rec.movie_title[movie]])
	outfile.write(str1)

outfile.close()

正如论坛里面我的讨论， CosineUserSim(user1, user2)的计算方法是存在问题的。

here the
cosine between the users’ mean-centered rating vectors is same with the Pearson correlation

=====================================================

看下这个帖子，二楼是助教

Pearson vs. Cosine Similarity

助教的解释是：均值化的cosine相似性跟 Pearson相关系数不同——就是除以 norm的问题上
我觉得可能是实验证实或理论分析这样做比较好！或许我教学视频看得不够仔细呢。。。

又回过头看了下视频助教给出了这么做的解释：Video 4.3 24:00分起+
算是对原始用户相关性的一个改进，因为如果是使用相同评分项会导致“平凡情况”，如果两个用户只有一个评分项相同那么sim=1
这样做有些“鲁莽”，他们只有一个相同点而已；改进的方法还有很多其他，下文只是其中一种。

是我太鲁莽了。。。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航