您的位置:首页 > 编程语言 > Python开发

商品亲和性分析示例 - Python 数据挖掘

2016-10-26 11:31 1031 查看

商品亲和性分析示例 - Python 数据挖掘

= 开始之前 =

安装python3,给个地址:python3

pip install xxxx


自带的pip速度很慢,换个源,速度瞬间飙高几百倍。

Pycharm 和 SublimeTxet

书中用到的数据和源代码 点击下载

= 开始之旅 =

一个简单的亲和性分析示例

在有足够多的数据的情况下,我们可以对某种假设进行分析。亲和性分析即确定个体之间的相似度以及他们之间关系的亲疏,这里有一些应用场景的举例:

* 想网站用户提供多样化的服务或投放定向广告

* 为购买产品的用户提供一些相关的产品

* 根据基因寻找有亲缘关系的人

在接下来的给定示例中,

商家售卖五种商品:

features = ["bread", "milk", "cheese", "apples", "bananas"]


购买信息储存在 affinity_dataset.txt 中,我们用
numpy
类库加载它,通过打印前几行查看其格式:

import numpy as np

dataset_filename = "affinity_dataset.txt"
X = np.loadtxt(dataset_filename)
# 由n_samples,

n_goods
-------

构成了 X.shape,即数据集X的行和列,sample代表个人的购买行为总记录,good代表五种货物
n_samples, n_goods = X.shape
print(X[:5])

> [[ 0.  0.  1.  1.  1.]
>  [ 1.  1.  0.  1.  0.]
>  [ 1.  0.  1.  1.  0.]
>  [ 0.  0.  1.  1.  1.]
>  [ 0.  1.  0.  0.  1.]]


根据信息的结构,我们可以提出一个假设,假设N:If someone buy x1 they will also buy x2,并想办法对其进行验证,比如:

通过可信度和支持度验证,支持度即,假设N成立的情况下,支持度+=1;X的可信度即,假设N成立次数/购买商品所含X的总数,结果为百分数。
我们可以通过算法来实现这个假设。


我们用
defaultdictionary
来构建相关变量,再计算支持度和可信度:

from collections import defaultdict

valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurances = defaultdict(int)

# 对于X数据集中的每一个个体(sin),如果他们购买了某件商品,则这件商品的购买次数(num_occurances[])+=1,再通过conclusion循环,找出买了X1又买了X2的例子并使验证成功的规则(valid_rule)加一,买了X1没买X2就使验证失败的例子加一
for sin in X:
# range(5)是没有意义的
for premise in range(4):
if sin[premise] == 0:
continue
num_occurances[premise] += 1
for conclusion in range(n_goods):
# 他们相等的情况也是没有意义的
if premise == conclusion:
continue
if sin[conclusion] == 1:
valid_rules[(premise, conclusion)] += 1
else:
invalid_rules[(premise, conclusion)] += 1

#
support = valid_rules
confidence = defaultdict(float)
# 把keys依次取出分配给premise和conclusion,能用上这样的语法实在是棒了
for premise, conclusion in valid_rules.keys():
rule = (premise, conclusion)
confidence[rule] = valid_rules[rule] / num_occurances[premise]


最后一步,我们要将支持度和可信度排序:

from operator import itemgetter

# reverse=True才是从大到小排序
# itemgetter函数,设定一个函数,获取第几个域,第几个值,其中参数的个数决定了这些值或者域的个数
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)

# print_rule 和 print_line 是打印相关函数,我把源代码放在文末了
for index in range(5):
print("Rule #{}".format(index + 1))
premise, conclusion = sorted_support[index][0]
print_rule(premise, conclusion, support, confidence, goods)
print_line()

for index in range(5):
print("Rule #{}".format(index + 1))
premise, conclusion = sorted_confidence[index][0]
print_rule(premise, conclusion, support, confidence, goods)
print_line()


本小节的源代码

import numpy as np
from collections import defaultdict
import pprint
from operator import itemgetter

def print_line():
print("====================================================")
print("\n")

dataset_filename = "affinity_dataset.txt"
X = np.loadtxt(dataset_filename)
n_samples, n_goods = X.shape
goods = ["bread", "milk", "cheese", "apples", "bananas"]

print(X[:5])
print_line()

num_apple_purchases = 0
for sin in X:
if sin[3] == 1:
num_apple_purchases += 1
print("{0} people bought Apples".format(num_apple_purchases))
print_line()

valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurances = defaultdict(int)

for sin in X:
for premise in range(4):
if sin[premise] == 0:
continue
num_occurances[premise] += 1
for conclusion in range(n_goods):
if premise == conclusion:
continue
if sin[conclusion] == 1:
valid_rules[(premise, conclusion)] += 1
else:
invalid_rules[(premise, conclusion)] += 1

support = valid_rules
confidence = defaultdict(float)

for premise, conclusion in valid_rules.keys():
rule = (premise, conclusion)
confidence[rule] = valid_rules[rule] / num_occurances[premise]

def print_rule(premise, conclusion, support, confidence, features):
premise_name = features[premise]
conclusion_name = features[conclusion]
print("Rule : If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
print(" - Support : {}".format(support[(premise, conclusion)]))
print(" - Confidence : {0:.3f}".format(confidence[(premise, conclusion)]))
print_line()

premise = 1
conclusion = 3
print_rule(premise, conclusion, support, confidence, goods)
print_line()

pprint.pprint(support)
pprint.pprint(list(support.items()))
print_line()

sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)

for index in range(5):
print("Rule #{}".format(index + 1))
premise, conclusion = sorted_support[index][0]
print_rule(premise, conclusion, support, confidence, goods)
print_line()

for index in range(5):
print("Rule #{}".format(index + 1))
premise, conclusion = sorted_confidence[index][0]
print_rule(premise, conclusion, support, confidence, goods)
print_line()
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
相关文章推荐