您的位置:首页 > 其它

Movie recommendation based on collaborative filtering method

2015-06-26 18:22 351 查看
Movie recommendation based on collaborative filtering method

Zhen Zhu

• PROPOSAL

Online movie have become increasingly popular in recent ten years, and tremendous volumes movie resources have been upload to internet every day. Also, the recommender system has become a vital part of Internet companies [1]. So, it is quite interesting and
important to build an intelligent recommender system to identify the effective of what are user’s real interests. So, we choose the recommendation prediction task on MovieLens datasets for project in this course.

• RELATED WORK

Such recommendation problems are usually solved by the collaborative filtering (CF) [3] approach, which relies only on information about the behavior of users in the past. There are two primary methods in CF: the neighborhood approach [4] and latent factor
modeling [5]. Both of these methods have been proven to be successful for the recommendation problem. A key step in CF is to combine these models. Methods such as regression, gradient boosting decision trees, and neural networks have been used to ensemble
CF models.

• OUTLINE OF APPROACH

The framework of our approach shown in figure1

Fig.1 the framework of the approach

1. We use Movielens datasets for our project and transform the data form for data model.

2. Use several different algorithms to calculate the similarity in users or items.

3 For user-based method, we need to do neighborhood calculation.

4 Use user-based CF and item-based CF to recommendation.

5 Design a evaluator to compare the results of different approaches.

• SIMILARITY CALCULATION

Euclidean similarity

In this way, we calculate similarity between two points by Euclidean distance. The Euclidean distance between point p and q is the length of the line segment connecting them ( ). In Cartesian coordinates, if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are
two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the formula as following:

The position of a point in a Euclidean n-space is a Euclidean vector. So, p and q are Euclidean vectors, starting from the origin of the space, and their tips indicate two points. The Euclidean norm, or Euclidean length, or magnitude of a vector measures the
length of the vector:

Log-likelihood similarity

The natural logarithm of the likelihood function, called the log-likelihood, is more convenient to work for similarity calculation. Because the logarithm is a monotonically increasing function, the logarithm of a function achieves its maximum value at the same
points as the likelihood itself, and hence the log-likelihood can be used in place of the likelihood in maximum likelihood estimation and related techniques. Finding the maximum of a function often involves taking the derivative of a function and solving for
the parameter being maximized, and this is often easier when the function being maximized is a log-likelihood rather than the original likelihood function.

The distribution has two parameters α and β. The likelihood function is

.

Finding the maximum likelihood estimate of β for a single observed value x looks rather daunting. Its logarithm is much simpler to work with:

Gamma Function:

Pearson Correlation Similarity

Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard deviations. Pearson's correlation coefficient when applied to a population is commonly represented by the Greek letter ρ and may be referred to
as the population correlation coefficient or the population Pearson correlation coefficient. The formula as following:

is the covariance, is the standard deviation of

The formula for ρ can be expressed in terms of mean and expectation. Since

and are defined as above; is the mean of ; is the expectation.

Then the formula for ρ can also be written as

Cosine similarity

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgment of orientation and not magnitude:
two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive
space, where the outcome is neatly bounded in [0, 1].

The cosine of two vectors can be derived by using the Euclidean dot product formula:

Given two vectors of attributes, A and B, the cosine similarity, cos (θ), is represented using a dot product and magnitude as following:

The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating de-correlation, and in-between values indicating intermediate similarity or dissimilarity.

For distance calculation, most common way is by following:

or

• NEIGHBORHOOD CALCULATION

Neighborhood calculation only use for collaborative filtering based on user. Use this calculation to sort top N similar user for final recommendation.

Fixed-size neighborhoods (K-neighborhoods)

K-neighborhoods means the recommendations are derived from a neighborhood of the K most similar users. We need to define a suitable K. If K were smaller. The recommendation would base on fewer similar users, and it would exclude some less-similar users from
consideration. If K were larger. The recommendation would base on more users, and it might be add some less-similar users. Only choose suitable K can improve the affection of recommendation.

Fig.2 An illustration of defining a neighborhood of most similar users by picking a fixed number of closest neighbors. Distance illustrates similarity: farther means less similar.

Threshold-based neighborhood

What if we don't want to build a neighborhood of the n most similar users, but rather try to pick the “pretty similar” users and ignore everyone else? We could pick a similarity threshold and take any users that are at least that similar. The threshold should
be between -1 and 1, since all similarity metrics return similarity values in this range. In the moment, we may use the similarity metric above.

Fig.3 An illustration of defining a neighborhood of most-similar users with a similarity threshold

• COLLABORATIVE FILTERING

Collaborative filtering (CF) approaches assume that those who agreed in the past tend to agree again in the future. For example, a collaborative filtering or recommendation system for movie tastes could make predictions about which movie a user should like
by given a partial list of that user's tastes (likes or dislikes). CF methods have two important steps, firstly, CF collects taste information from many users. In the second step, using information gleaned from many users to predict users’ interest and recommend
Item to user. Researchers have devised a number of collaborative filtering algorithms which mainly can be divided into two main categories, User-based and Item-based algorithms.

User-based CF

User-based CF is also called nearest-neighbor based Collaborative Filtering, it utilize the entire user-item data to generate a prediction. These systems use statistical techniques to find users’ nearest-neighbors, who have the similar preference. Once the
nearest-neighborhood of users are found, these systems use algorithms to combine the preferences of neighbors to produce a prediction or top-N recommendation for the target user. The techniques are popular and widely used in practice.

User-based CF algorithms have been popular and successful in several years, but the widespread use has revealed some potential challenges. In practice, users only have purchased few percent of all the items, maybe 1% of 2 million items, so that recommender
system based on nearest neighbor algorithms may be unable to make any item recommendations for a particular user, and the accuracy of recommendations may be poor. Nearest neighbor algorithms require computation that grows with the number of users. With millions
of users, a typical web-based recommender system running existing algorithms will suffer serious scalability problems.

Item-based CF

Unlike the User-based CF algorithm discussed above, the item-based approach focus on items which the target user had rated. Then, calculate the similarity to the target item and then selects k most similar items. According to the weight on those item. We find
the most similar items to target user which have not been used.

One critical step in the item-based collaborative filtering algorithm is to compute the similarity between items. The basic idea in similarity computation between two items i and j is to build co-occurrence matrix. Then, using the Euclidean dot product to calculate
with target user’s rating. After that we can find the most similar items which has the highest score.

• EXPERIMENT

Data sets

This datasets were collected by the GroupLens Research Project at the University of Minnesota. The datasets are very suit for do recommendation research. This data set consists of: 100,000 ratings (1-5) from 943 users on 1682 movies and each user has rated
at least 20 movies with attributes (age, gender, occupation).The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up – users who
had less than 20 ratings or did not have complete demographic information were removed from this data set.

Evaluation

We use precision value and recall value to measure the result.

C is the number of the item target user like and identified by system.

N is the total number user like in test data.

T is the Identified number by system.

Results

We use both User-based CF and Item-based CF to experiments. Set Number of neighborhood is 2. The value of neighborhood threshold is 0.5 and the recommender number is 5. The training data account for 70% and the testing datasets account for 30%. The result as
Table1 and Table2.

TABLE I. EVALUATION ON USER-BASED CF

Similarity algorithm User-based CF

Rating type
Neighborhood calculation Precision
Recall Difference

Euclidean similarity Score(1-10)
K-neighborhoods 0.15406236275801483
0.14404609475032024 0.07389905172235824

Log-likelihood similarity Score(1-10)
K-neighborhoods 0.11314553990610342
0.13617157490396936 0.8677294206556467

Cosine similarity Score(1-10)
K-neighborhoods 0.15008714596949832
0.12059325650874954 1.1111111111111112

Pearson Correlation Similarity Score (1-10)
K-neighborhoods 0.12151979565772673
0.10734101579171995 0.5506329113924048

Euclidean similarity Boolean(0,1)
K-neighborhoods 0.08128205128205122
0.1065941101152369 2.8610988410290643

Log-likelihood similarity Boolean(0,1)
K-neighborhoods 0.11395646606914228
0.13107127614169872 2.324895084732162

Cosine similarity Boolean(0,1)
K-neighborhoods 0.09692307692307693
0.11662398634229627 2.6268349661343318

Pearson Correlation Similarity Boolean(0,1)
K-neighborhoods 0.0725752508361203
0.07543747332479718 2.616772097320647

Euclidean similarity Score(1-10)
Threshold-based neighborhood 0.03057017543859648
0.04244558258642761 0.5040645832107125

Log-likelihood similarity Score(1-10)
Threshold-based neighborhood 0.0035851472471190786
0.0036491677336747777 0.8119956200953123

Cosine similarity Score (1-10)
Threshold-based neighborhood 0.008903225806451634
0.008578745198463503 0.8116693771906173

Pearson Correlation Similarity Score (1-10)
Threshold-based neighborhood 0.04111349036402571
0.03252240717029451 0.6220533633729025

Euclidean similarity Boolean(0,1)
Threshold-based neighborhood 0.20906735751295338
0.26429790866410546 6.877570770773614

Log-likelihood similarity Boolean(0,1)
Threshold-based neighborhood 0.10653008962868132
0.1333546734955189 82.5830467048691

Cosine similarity Boolean(0,1)
Threshold-based neighborhood 0.224358974358974
0.2836961160904817 85.7349085590648

Pearson Correlation Similarity Boolean(0,1)
Threshold-based neighborhood 0.17935943060498222
0.1814127187366625 5.8100268562854955

TABLE II. EVALUATION ON ITEM-BASED CF

Similarity algorithm Item-based CF

Rating type
Precision Recall
Difference

Euclidean similarity Score(1-10)
0.0017925736235595393 0.0017925736235595393
0.7945934427298251

Log-likelihood similarity Score(1-10)
0.0 0.0
0.8183123680491949

Cosine similarity Score(1-10)
0.0 0.0
0.8303853550496187

Pearson Correlation Similarity Score (1-10)
0.00870678617157492 0.009240290226205712
0.6706314833178705

Euclidean similarity Boolean(0,1)
0.0025608194622279107 0.0025608194622279107
47.13259472322781

Log-likelihood similarity Boolean(0,1)
0.12522407170294492 0.1432565087494666
91.85028229060038

Cosine similarity Boolean(0,1)
0.07221510883482725 0.07407170294494238
106.26153096575278

Pearson Correlation Similarity Boolean(0,1)
0.0015364916773367482 0.0015364916773367482
0.7373257024463155

Slope one

Score(1-10)
0.0015364916773367482 0.0015364916773367482
0.7373257024463155

Unexpectedly, the experiments results of User-based are mostly better than those of Item-based. Only prediction on rating on Boolean data are better than rating on score. Use threshold to choose neighborhoods are better than use fixed size. The approach of
cosine to calculate the similarity is better than others. Value of P reach to about 22.4% and value of R reach to 28.4%.

• CONCLUSIONS

We find Similarity algorithm and CF approaches are sensitive to the datasets. Item-based method do better in many application scenarios. However, in this datasets are lower than User-based method. So, we may use different methods in different datasets. If it
possible, we may use several approaches for testing and make the decision based on experimental results.

REFERENCES

[1]M. Hao, Z. Dengyong, L. Chao, L. M. R., and K. Irwin. Recommender systems with social regularization. In Proceedings of the fourth ACM International Conference on Web Search and Data Mining, WSDM '11, pages 287{296, New York, NY, USA, 2011. ACM.

[2]Yanzhi Niu, Yi Wang, Gordon Sun, Aden Yue, Brian Dalessandro, Claudia Perlich, Ben Hammer, 2012. The Tencent Dataset and KDD-Cup'12. KDD-Cup Workshop

[3]Y. Niu, Y. Wang, G. Sun, A. Yue, B. Dalessandro, C. Perlich, and B. Hamner. The Tencent dataset and kdd-cup'12. In KDD-Cup Workshop, 2012., 2012.

[4]Y. Koren. 2009. The bellkor solution to the netflix grand prize. Tech. report

[5]M. Piotte and M. Chabbert. 2009. The Pragmatic Theory solution to the Netflix Grand prize. Tech. report

[6] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems 20 (NIPS'07), pages 1257-1264, 2008

[7] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30-37, 2009.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: