Making Recommendations #
准备数据集
critics = {
'Lisa Rose': {
'Lady in the Water': 2.5,
'Snakes on a Plane': 3.5,
'Just My Luck': 3.0,
'Superman Returns': 3.5,
'You, Me and Dupree': 2.5,
'The Night Listener': 3.0,
},
欧几里得距离评估 #
使用欧几里得距离评估,可以很好的评估两个向量间的距离。这里使用了共同的元素,来进行计算,不过有一些小的 TIPS
def sim_distance(prefs, p1, p2):
'''
Returns a distance-based similarity score for person1 and person2.
'''
# Get the list of shared_items
si = {}
for item in prefs[p1]:
if item in prefs[p2]:
si[item] = 1
# If they have no ratings in common, return 0
if len(si) == 0:
return 0
# Add up the squares of all the differences
sum_of_squares = sum([pow(prefs[p1][item] - prefs[p2][item], 2) for item in
prefs[p1] if item in prefs[p2]])
return 1 / (1 + sqrt(sum_of_squares))
最后用来一个倒数来防止 除数为 0 的情况,获得一个特别的大的值,这样的话,就可以获得一个 0~1 之间的值。
>>> recommendations.sim_distance(recommendations.critics,"Toby","Lisa Rose")
0.3483314773547883
皮尔逊相关度评价 #
皮尔逊通过这个公式来计算两个人的相似度。在数据不是很规范的情况下的结果会更好。
这里的XY又变成了人,点变成了评价。我们试图在这些点上找到一个最佳的线,所有的点距离这个线越近越好。这个线称之为 最佳拟合线(Best-fit line)
。
斜度为 1
的时候,就是最好的数据。这一算法,修正了 夸大分值(grade inflation)
, 计算公式
使用回归分析来理解复杂的关系
def sim_pearson(prefs, p1, p2):
'''
Returns the Pearson correlation coefficient for p1 and p2.
'''
# Get the list of mutually rated items
si = {}
for item in prefs[p1]:
if item in prefs[p2]:
si[item] = 1
# If they are no ratings in common, return 0
if len(si) == 0:
return 0
# Sum calculations
n = len(si)
# Sums of all the preferences
sum1 = sum([prefs[p1][it] for it in si])
sum2 = sum([prefs[p2][it] for it in si])
# Sums of the squares
sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])
# Sum of the products
pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])
# Calculate r (Pearson score)
num = pSum - sum1 * sum2 / n
den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))
if den == 0:
return 0
r = num / den
return r
为用户打分 #
我们有了相似度,我们现在来为每个用户打分。这样就可以找到和这个用户最相似的用户了。
def topMatches(
prefs,
person,
n=5,
similarity=sim_pearson,
):
'''
Returns the best matches for person from the prefs dictionary.
Number of results and similarity function are optional params.
'''
# 从字典中找到其他用户,计算相似度然后排序找到靠前的
scores = [(similarity(prefs, person, other), other) for other in prefs
if other != person]
scores.sort()
scores.reverse()
return scores[0:n]
提供推荐 #
如果直接从相似的用户里拿到此用户尚未观看的电影,就可以推荐给这个用户。但是这样太过于随意了,这里利用了一个加权评价值体系来推荐。
以第一行为例,Toby
和 Rose
相似度 0.99
,这样给 Rose
看过的电影加权打分 3.0 * 0.99 = 2.98
。
def getRecommendations(prefs, person, similarity=sim_pearson):
'''
Gets recommendations for a person by using a weighted average
of every other user's rankings
'''
totals = {}
simSums = {}
for other in prefs:
# Don't compare me to myself
if other == person:
continue
sim = similarity(prefs, person, other)
# Ignore scores of zero or lower
if sim <= 0:
continue
for item in prefs[other]:
# Only score movies I haven't seen yet
if item not in prefs[person] or prefs[person][item] == 0:
# Similarity * Score
totals.setdefault(item, 0)
# The final score is calculated by multiplying each item by the
# similarity and adding these products together
totals[item] += prefs[other][item] * sim
# Sum of similarities
simSums.setdefault(item, 0)
simSums[item] += sim
# Create the normalized list
rankings = [(total / simSums[item], item) for (item, total) in
totals.items()]
# Return the sorted list
rankings.sort()
rankings.reverse()
return rankings
如果基于指定的人,对齐推荐商品,那么就是反之,将人和人之间的相似度变化为人和物的相似度。
def transformPrefs(prefs):
'''
Transform the recommendations into a mapping where persons are described
with interest scores for a given title e.g. {title: person} instead of
{person: title}.
'''
result = {}
for person in prefs:
for item in prefs[person]:
result.setdefault(item, {})
# Flip item and person
result[item][person] = prefs[person][item]
return result
Exercises #
Tanimoto Similarity(谷本系数) #
参考:
Tanimoto 计算公式 T(A,B) = |A ∩ B| / (|A| + |B| - |A ∩ B|)
Jaccard 计算公式 J(A, B) = |A ∩ B| / |A ∪ B|
古本函数计算的是 它不关心用户对物品的具体评分值是多少,它在关心用户与物品之间是否存在关联关系。