生成推荐 | Article

Making Recommendations #

准备数据集


critics = {
    'Lisa Rose': {
        'Lady in the Water': 2.5,
        'Snakes on a Plane': 3.5,
        'Just My Luck': 3.0,
        'Superman Returns': 3.5,
        'You, Me and Dupree': 2.5,
        'The Night Listener': 3.0,
    },

欧几里得距离评估 #

使用欧几里得距离评估，可以很好的评估两个向量间的距离。这里使用了共同的元素，来进行计算，不过有一些小的 TIPS

def sim_distance(prefs, p1, p2):
    '''
    Returns a distance-based similarity score for person1 and person2.
    '''

    # Get the list of shared_items
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]:
            si[item] = 1
    # If they have no ratings in common, return 0
    if len(si) == 0:
        return 0
    # Add up the squares of all the differences
    sum_of_squares = sum([pow(prefs[p1][item] - prefs[p2][item], 2) for item in
                         prefs[p1] if item in prefs[p2]])
    return 1 / (1 + sqrt(sum_of_squares))

最后用来一个倒数来防止除数为 0 的情况，获得一个特别的大的值，这样的话，就可以获得一个 0~1 之间的值。

>>> recommendations.sim_distance(recommendations.critics,"Toby","Lisa Rose")
0.3483314773547883

皮尔逊相关度评价 #

皮尔逊通过这个公式来计算两个人的相似度。在数据不是很规范的情况下的结果会更好。

这里的XY又变成了人，点变成了评价。我们试图在这些点上找到一个最佳的线，所有的点距离这个线越近越好。这个线称之为 最佳拟合线(Best-fit line)。

斜度为 1 的时候，就是最好的数据。这一算法，修正了 夸大分值(grade inflation) , 计算公式使用回归分析来理解复杂的关系

def sim_pearson(prefs, p1, p2):
    '''
    Returns the Pearson correlation coefficient for p1 and p2.
    '''

    # Get the list of mutually rated items
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]:
            si[item] = 1
    # If they are no ratings in common, return 0
    if len(si) == 0:
        return 0
    # Sum calculations
    n = len(si)
    # Sums of all the preferences
    sum1 = sum([prefs[p1][it] for it in si])
    sum2 = sum([prefs[p2][it] for it in si])
    # Sums of the squares
    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])
    # Sum of the products
    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])
    # Calculate r (Pearson score)
    num = pSum - sum1 * sum2 / n
    den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))
    if den == 0:
        return 0
    r = num / den
    return r

为用户打分 #

我们有了相似度，我们现在来为每个用户打分。这样就可以找到和这个用户最相似的用户了。

def topMatches(
    prefs,
    person,
    n=5,
    similarity=sim_pearson,
):
    '''
    Returns the best matches for person from the prefs dictionary. 
    Number of results and similarity function are optional params.
    '''

    # 从字典中找到其他用户，计算相似度然后排序找到靠前的
    scores = [(similarity(prefs, person, other), other) for other in prefs
              if other != person]
    scores.sort()
    scores.reverse()
    return scores[0:n]

提供推荐 #

如果直接从相似的用户里拿到此用户尚未观看的电影，就可以推荐给这个用户。但是这样太过于随意了，这里利用了一个加权评价值体系来推荐。

以第一行为例，Toby 和 Rose 相似度 0.99，这样给 Rose看过的电影加权打分 3.0 * 0.99 = 2.98。

def getRecommendations(prefs, person, similarity=sim_pearson):
    '''
    Gets recommendations for a person by using a weighted average
    of every other user's rankings
    '''

    totals = {}
    simSums = {}
    for other in prefs:
    # Don't compare me to myself
        if other == person:
            continue
        sim = similarity(prefs, person, other)
        # Ignore scores of zero or lower
        if sim <= 0:
            continue
        for item in prefs[other]:
            # Only score movies I haven't seen yet
            if item not in prefs[person] or prefs[person][item] == 0:
                # Similarity * Score
                totals.setdefault(item, 0)
                # The final score is calculated by multiplying each item by the
                #   similarity and adding these products together
                totals[item] += prefs[other][item] * sim
                # Sum of similarities
                simSums.setdefault(item, 0)
                simSums[item] += sim
    # Create the normalized list
    rankings = [(total / simSums[item], item) for (item, total) in
                totals.items()]
    # Return the sorted list
    rankings.sort()
    rankings.reverse()
    return rankings

如果基于指定的人，对齐推荐商品，那么就是反之，将人和人之间的相似度变化为人和物的相似度。

def transformPrefs(prefs):
    '''
    Transform the recommendations into a mapping where persons are described
    with interest scores for a given title e.g. {title: person} instead of
    {person: title}.
    '''

    result = {}
    for person in prefs:
        for item in prefs[person]:
            result.setdefault(item, {})
            # Flip item and person
            result[item][person] = prefs[person][item]
    return result

Exercises #

Tanimoto Similarity（谷本系数） #

参考：

Tanimoto Similarity and Jaccard Indexes with FeatureBase.)

Tanimoto 计算公式 T(A,B) = |A ∩ B| / (|A| + |B| - |A ∩ B|) Jaccard 计算公式 J(A, B) = |A ∩ B| / |A ∪ B|

古本函数计算的是它不关心用户对物品的具体评分值是多少，它在关心用户与物品之间是否存在关联关系。