基于neighborhood models(item-based) 的个性化推荐系统
2013-10-04 11:10
465 查看
相关文章:
基于baseline和stochastic gradient descent的个性化推荐系统
基于baseline、svd和stochastic gradient descent的个性化推荐系统
转载请注明:转自 zh's note http://blog.csdn.net/wuzh670/
文章主要介绍的是koren 08年发的论文[1], 2.2neighborhood models部分内容(其余部分会陆续补充上来)。
koren论文中用到netflix 数据集, 过于大, 在普通的pc机上运行时间很长很长。考虑到写文章目地主要是已介绍总结方法为主,所以采用Movielens 数据集。
变量介绍(涉及到的其他变量可以参看上面提到的相关文章):
利用pearson相关系数,求i,j之间的相关性。
文章中提到shrunk correlation coefficient(收缩的相关系数),收缩后pearson相关系数作为i,j相似性,后面会通过实践证明收缩的效果会更好。
预测值:
系统评判标准:RMSE, MAE
系统采用5-fold cross-validation(movielens数据集中已经默认划分好了)
注: 用SGD来训练出最优的用户和项的偏置值,后续会补充完整。
详细代码实现:
[python] view
plaincopyprint?
'''''
Created on Dec 16, 2012
@Author: Dennis Wu
@E-mail: hansel.zh@gmail.com
@Homepage: http://blog.csdn.net/wuzh670
@Weibo: http://weibo.com/hansel
Data set download from : http://www.grouplens.org/system/files/ml-100k.zip
'''
from operator import itemgetter, attrgetter
from math import sqrt,fabs,log
import random
def load_data(filename_train, filename_test):
train = {}
test = {}
for line in open(filename_train):
(userId, itemId, rating, timestamp) = line.strip().split('\t')
train.setdefault(userId,{})
train[userId][itemId] = float(rating)
for line in open(filename_test):
(userId, itemId, rating, timestamp) = line.strip().split('\t')
test.setdefault(userId,{})
test[userId][itemId] = float(rating)
return train, test
def initialBias(train, userNum, movieNum, mean):
bu = {}
bi = {}
biNum = {}
buNum = {}
u = 1
while u < (userNum+1):
su = str(u)
for i in train[su].keys():
bi.setdefault(i,0)
biNum.setdefault(i,0)
bi[i] += (train[su][i] - mean)
biNum[i] += 1
u += 1
i = 1
while i < (movieNum+1):
si = str(i)
biNum.setdefault(si,0)
if biNum[si] >= 1:
bi[si] = bi[si]*1.0/(biNum[si]+25)
else:
bi[si] = 0.0
i += 1
u = 1
while u < (userNum+1):
su = str(u)
for i in train[su].keys():
bu.setdefault(su,0)
buNum.setdefault(su,0)
bu[su] += (train[su][i] - mean - bi[i])
buNum[su] += 1
u += 1
u = 1
while u < (userNum+1):
su = str(u)
buNum.setdefault(su,0)
if buNum[su] >= 1:
bu[su] = bu[su]*1.0/(buNum[su]+10)
else:
bu[su] = 0.0
u += 1
return bu, bi
def initial(train, userNum, movieNum):
average = {}
Sij = {}
mean = 0
num = 0
N = {}
for u in train.keys():
for i in train[u].keys():
mean += train[u][i]
num += 1
average.setdefault(i,0)
average[i] += train[u][i]
N.setdefault(i,0)
N[i] += 1
Sij.setdefault(i,{})
for j in train[u].keys():
if i == j:
continue
Sij[i].setdefault(j,[])
Sij[i][j].append(u)
mean = mean / num
for i in average.keys():
average[i] = average[i] / N[i]
pearson = {}
itemSim = {}
for i in Sij.keys():
pearson.setdefault(i,{})
itemSim.setdefault(i,{})
for j in Sij[i].keys():
pearson[i][j] = 1
part1 = 0
part2 = 0
part3 = 0
for u in Sij[i][j]:
part1 += (train[u][i] - average[i]) * (train[u][j] - average[j])
part2 += pow(train[u][i] - average[i], 2)
part3 += pow(train[u][j] - average[j], 2)
if part1 != 0:
pearson[i][j] = part1 / sqrt(part2 * part3)
itemSim[i][j] = fabs(pearson[i][j] * len(Sij[i][j]) / (len(Sij[i][j]) + 100))
# initial user and item Bias, respectly
bu, bi = initialBias(train, userNum, movieNum, mean)
return itemSim, mean, average, bu, bi
def neighborhoodModels(train, test, itemSim, mean, average, bu, bi):
pui = {}
rmse = 0.0
mae = 0.0
num = 0
for u in test.keys():
pui.setdefault(u,{})
for i in test[u].keys():
pui[u][i] = mean + bu[u] + bi[i]
stat = 0
stat2 = 0
for j in train[u].keys():
if itemSim.has_key(i) and itemSim[i].has_key(j):
stat += (train[u][j] - mean - bu[u] - bi[j]) * itemSim[i][j]
stat2 += itemSim[i][j]
if stat > 0:
pui[u][i] += stat * 1.0 / stat2
rmse += pow((pui[u][i] - test[u][i]), 2)
mae += fabs(pui[u][i] - test[u][i])
num += 1
rmse = sqrt(rmse*1.0 / num)
mae = mae * 1.0 / num
return rmse, mae
if __name__ == "__main__":
i = 1
sumRmse = 0.0
sumMae = 0.0
while i <= 5:
# load data
filename_train = 'data/u' + str(i) + '.base'
filename_test = 'data/u' + str(i) + '.test'
train, test = load_data(filename_train, filename_test)
# initial variables
itemSim, mean, average, bu, bi = initial(train, 943, 1682)
# neighborhoodModels
rmse, mae = neighborhoodModels(train, test, itemSim, mean, average, bu, bi)
print 'cross-validation %d: rmse: %s mae: %s' % (i, rmse, mae)
sumRmse += rmse
sumMae += mae
i += 1
print 'neighborhood models final results: Rmse: %s Mae: %s' % (sumRmse/5, sumMae/5)
实验结果:
注:第一个结果是没有使用收缩的pearson相关系数跑出的结果;第二个结果则是使用收缩的相关系数跑出的结果。
从图中容易得出使用收缩的相关系数的必要性和有效性。
REFERENCES
1.Y. Koren. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. Proc. 14th ACM SIGKDD Int. Conf. On Knowledge Discovery and Data Mining
(KDD’08), pp. 426–434, 2008.
2. Y.Koren. The BellKor Solution to the Netflix Grand Prize 2009
基于baseline和stochastic gradient descent的个性化推荐系统
基于baseline、svd和stochastic gradient descent的个性化推荐系统
转载请注明:转自 zh's note http://blog.csdn.net/wuzh670/
文章主要介绍的是koren 08年发的论文[1], 2.2neighborhood models部分内容(其余部分会陆续补充上来)。
koren论文中用到netflix 数据集, 过于大, 在普通的pc机上运行时间很长很长。考虑到写文章目地主要是已介绍总结方法为主,所以采用Movielens 数据集。
变量介绍(涉及到的其他变量可以参看上面提到的相关文章):
利用pearson相关系数,求i,j之间的相关性。
文章中提到shrunk correlation coefficient(收缩的相关系数),收缩后pearson相关系数作为i,j相似性,后面会通过实践证明收缩的效果会更好。
预测值:
系统评判标准:RMSE, MAE
系统采用5-fold cross-validation(movielens数据集中已经默认划分好了)
注: 用SGD来训练出最优的用户和项的偏置值,后续会补充完整。
详细代码实现:
[python] view
plaincopyprint?
'''''
Created on Dec 16, 2012
@Author: Dennis Wu
@E-mail: hansel.zh@gmail.com
@Homepage: http://blog.csdn.net/wuzh670
@Weibo: http://weibo.com/hansel
Data set download from : http://www.grouplens.org/system/files/ml-100k.zip
'''
from operator import itemgetter, attrgetter
from math import sqrt,fabs,log
import random
def load_data(filename_train, filename_test):
train = {}
test = {}
for line in open(filename_train):
(userId, itemId, rating, timestamp) = line.strip().split('\t')
train.setdefault(userId,{})
train[userId][itemId] = float(rating)
for line in open(filename_test):
(userId, itemId, rating, timestamp) = line.strip().split('\t')
test.setdefault(userId,{})
test[userId][itemId] = float(rating)
return train, test
def initialBias(train, userNum, movieNum, mean):
bu = {}
bi = {}
biNum = {}
buNum = {}
u = 1
while u < (userNum+1):
su = str(u)
for i in train[su].keys():
bi.setdefault(i,0)
biNum.setdefault(i,0)
bi[i] += (train[su][i] - mean)
biNum[i] += 1
u += 1
i = 1
while i < (movieNum+1):
si = str(i)
biNum.setdefault(si,0)
if biNum[si] >= 1:
bi[si] = bi[si]*1.0/(biNum[si]+25)
else:
bi[si] = 0.0
i += 1
u = 1
while u < (userNum+1):
su = str(u)
for i in train[su].keys():
bu.setdefault(su,0)
buNum.setdefault(su,0)
bu[su] += (train[su][i] - mean - bi[i])
buNum[su] += 1
u += 1
u = 1
while u < (userNum+1):
su = str(u)
buNum.setdefault(su,0)
if buNum[su] >= 1:
bu[su] = bu[su]*1.0/(buNum[su]+10)
else:
bu[su] = 0.0
u += 1
return bu, bi
def initial(train, userNum, movieNum):
average = {}
Sij = {}
mean = 0
num = 0
N = {}
for u in train.keys():
for i in train[u].keys():
mean += train[u][i]
num += 1
average.setdefault(i,0)
average[i] += train[u][i]
N.setdefault(i,0)
N[i] += 1
Sij.setdefault(i,{})
for j in train[u].keys():
if i == j:
continue
Sij[i].setdefault(j,[])
Sij[i][j].append(u)
mean = mean / num
for i in average.keys():
average[i] = average[i] / N[i]
pearson = {}
itemSim = {}
for i in Sij.keys():
pearson.setdefault(i,{})
itemSim.setdefault(i,{})
for j in Sij[i].keys():
pearson[i][j] = 1
part1 = 0
part2 = 0
part3 = 0
for u in Sij[i][j]:
part1 += (train[u][i] - average[i]) * (train[u][j] - average[j])
part2 += pow(train[u][i] - average[i], 2)
part3 += pow(train[u][j] - average[j], 2)
if part1 != 0:
pearson[i][j] = part1 / sqrt(part2 * part3)
itemSim[i][j] = fabs(pearson[i][j] * len(Sij[i][j]) / (len(Sij[i][j]) + 100))
# initial user and item Bias, respectly
bu, bi = initialBias(train, userNum, movieNum, mean)
return itemSim, mean, average, bu, bi
def neighborhoodModels(train, test, itemSim, mean, average, bu, bi):
pui = {}
rmse = 0.0
mae = 0.0
num = 0
for u in test.keys():
pui.setdefault(u,{})
for i in test[u].keys():
pui[u][i] = mean + bu[u] + bi[i]
stat = 0
stat2 = 0
for j in train[u].keys():
if itemSim.has_key(i) and itemSim[i].has_key(j):
stat += (train[u][j] - mean - bu[u] - bi[j]) * itemSim[i][j]
stat2 += itemSim[i][j]
if stat > 0:
pui[u][i] += stat * 1.0 / stat2
rmse += pow((pui[u][i] - test[u][i]), 2)
mae += fabs(pui[u][i] - test[u][i])
num += 1
rmse = sqrt(rmse*1.0 / num)
mae = mae * 1.0 / num
return rmse, mae
if __name__ == "__main__":
i = 1
sumRmse = 0.0
sumMae = 0.0
while i <= 5:
# load data
filename_train = 'data/u' + str(i) + '.base'
filename_test = 'data/u' + str(i) + '.test'
train, test = load_data(filename_train, filename_test)
# initial variables
itemSim, mean, average, bu, bi = initial(train, 943, 1682)
# neighborhoodModels
rmse, mae = neighborhoodModels(train, test, itemSim, mean, average, bu, bi)
print 'cross-validation %d: rmse: %s mae: %s' % (i, rmse, mae)
sumRmse += rmse
sumMae += mae
i += 1
print 'neighborhood models final results: Rmse: %s Mae: %s' % (sumRmse/5, sumMae/5)
实验结果:
注:第一个结果是没有使用收缩的pearson相关系数跑出的结果;第二个结果则是使用收缩的相关系数跑出的结果。
从图中容易得出使用收缩的相关系数的必要性和有效性。
REFERENCES
1.Y. Koren. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. Proc. 14th ACM SIGKDD Int. Conf. On Knowledge Discovery and Data Mining
(KDD’08), pp. 426–434, 2008.
2. Y.Koren. The BellKor Solution to the Netflix Grand Prize 2009
相关文章推荐
- 基于neighborhood models(item-based) 的个性化推荐系统
- 推荐系统:协同过滤 之 Item-based Collaborative Filtering
- 基于内容和主题的个性化新闻推荐系统设计需求分析[转载]
- 推荐系统之基于二部图的个性化推荐系统原理及C++实现
- 大数据推荐系统算法之基于内容个性化推荐
- 基于内容的推荐系统(content-based recommender system)
- 基于baseline、svd和stochastic gradient descent的个性化推荐系统
- 论文笔记] Amazon推荐系统——基于item的协同过滤
- 开源java推荐系统Taste(1)-基本的Item-based推荐
- 开源java推荐系统Taste(1)-基本的Item-based推荐
- 推荐系统:协同过滤 之 Item-based Collaborative Filtering
- 推荐系统之基于二部图的个性化推荐系统原理及C++实现
- 推荐系统之基于二部图的个性化推荐系统原理及C++实现
- 推荐系统user-based和item-based协同过滤算法定性比较
- 【推荐系统】协同过滤(CF)算法详解,item-base,user-based,SVD,SVD++
- 个性化推荐系统原理介绍(基于内容推荐/协同过滤/关联规则/序列模式/基于社交推荐)
- 基于Item的推荐系统-使用RHadoop运算
- 基于红帽6.4(64位系统)web服务器的又一利器Nginx 推荐
- 基于内容的推荐算法(推荐系统)(一)
- 基于Spark Mllib,SparkSQL的电影推荐系统