您的位置:首页 > 其它

推荐系统实践笔记(一)

2017-09-09 16:11 357 查看
第二章

读取Movielens数据集

MovieLens|GroupLens

数据集中有三个.dat文件,分别是users、movies、ratings。

三个文件都是用”::”分隔的表格形式文件,查看说明文档发现各列代表的信息分别如下:

users.dat:UserID::Gender::Age::Occupation::Zip-code

movies.dat:MovieID::Title::Genres

ratings.dat:UserID::MovieID::Rating::Timestamp

使用pandas库中的read_table函数读取文件,并用merge函数将三个表格合并,保存为.csv格式的文件。

import pandas as pd
from pandas import Series,DataFrame
from operator import itemgetter, attrgetter
import math

unames=['user_id','gender','age','occupation','zip']
users=pd.read_table('ml-1m/users.dat',sep='::',header=None,names=unames)

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames)

all_data = pd.merge(pd.merge(ratings, users), movies)
data = DataFrame(data=all_data,columns=['user_id','movie_id'])
data.to_csv('data.csv')


由于在进行TopN推荐而非评分预测时,我们并不关心用户给电影究竟打了多少分,而是关心用户是否对物品产生了行为,所以这里仅仅需要用户ID及其对应评分的电影ID数据,读取之前存储的表格并提取我们需要的信息,书里给出的代码使用了dict存储用户id和电影id,如果不进行任何处理的话是不合理的,因为大多数用户都对不止一部电影进行了评分,一部电影也有多位不同用户进行过评分,‘user_id’和‘movie_id’都不能作为dict中的key,因此我在这里直接进行倒排矩阵的建立:

data=pd.read_csv('data.csv')
X=data['user_id']
Y=data['movie_id']

item_user=dict()
for i in range(X.count()):
user=X.iloc[i]
item=Y.iloc[i]
if item not in item_user:
item_user[item]=set()
item_user[item].add(user)
#计算N(u)、矩阵C(u)(v):
C={}
N={}
for i,users in item_user.items():
for u in users:
N.setdefault(u,0)
N[u]+=1
C.setdefault(u,{})
for v in users:
if u==v:
continue
C[u].setdefault(v,0)
C[u][v]+=1
#建立相似度矩阵:
W=C.copy()
for u,related_users in C.items():
for v,cuv in related_users.items():
W[u][v]=cuv/math.sqrt(N[u]*N[v])
def recommend(user,user_item,W,K):
rank={}
interacted_items=user_item[user]
for v,wuv in sorted(W[user].items(),reverse=True)[0:K]:
for i in user_item[v]:
if i not in interacted_items:
rank.setdefault(i,0)
rank[i]+=wuv
return rank
#Python字典setdefault()函数和get()方法类似,如果键不存在于字典中,将会添加键并将值设为默认值。dict.setdefault(key, default=None)

#对热门物品进行了惩罚的改进的UserCF,只需要在计算C[u][v]时乘上1/math.log(1+len(users))即可。ItemCF只要在UserCF算法代码的基础上稍作修改,在这里就不赘述了。

def ItemSimilarity(train):
C = dict()
N = dict()
for u, items in train.items():
for i in users:
N[i] += 1
for j in users:
if i == j:
continue
c[i][j] += 1 / math.log(1 + len(items) * 1.0)
W = dict()
for i,related_items in C.items():
for j, cij in related_items.items():
W[u][v] = cij / math.sqrt(N[i] * N[j])
return W

#召回率
def Recall(train, test, W, N):
hit = 0
All = 0
tu = dict()
for user in train.keys():
tu = test[user]
rank = recommend(user, train, W, N)
for item, pui in rank.items():
if item in tu:
hit += 1
All += len(tu)
return hit/(All * 1.0)
# 准确率
def Precision(train, test, W, N):
hit = 0
All = 0
for user in train.keys():
tu = test[user]
rank = recommend(user, train, W, N)
for item, pui in rank.items():
if item in tu:
hit += 1
All += N
return hit/(All*1.0)
#覆盖率
def Coverage(train, test, W, N):
recommend_items = set()
all_items = set()
for user in train.keys():
for item in train[user].keys():
all_items.add(item)
rank = recommend(user, train, W, N)
for item, pui in rank.items():
recommend_items.add(item)
return len(recommend_items)/(len(all_items)*1.0)
#流行度
def Popularity(train, test, W, N):
item_popularity = dict()
for user, items in train.items():
for item in items.keys():
if item not in item_popularity:
item_popularity.setdefault(item,0)
item_popularity[item] += 1
ret = 0
n = 0
for user in train.keys():
rank = GetRecommendation(user, N)
for item, pui in rank:
ret += math.log(1 + item_popularity[item])
n += 1
ret /= n*1.0
return ret


参考博客 http://blog.csdn.net/Cherrie3/article/details/52757118?locationNum=2&fps=1
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  数据分析