您的位置:首页 > 编程语言 > Python开发

Python数据挖掘入门与实践 第二章2.1 关于random_state

2019-04-02 11:01 736 查看

random_state

之前的文章自己给自己挖了个坑,现在尝试来解决一下。
根据之前的代码,来看看,random_state的取值在0-20的时候,
是否会改变test_size和准确率。

import numpy as np
import csv
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from matplotlib import pyplot as plt

data_filename = r'下载地址\ionosphere.data'
X = np.zeros((351, 34), dtype='float')
y = np.zeros((351,), dtype='bool')

with open(data_filename, 'r') as input_file:
reader = csv.reader(input_file)
for i, row in enumerate(reader):
data = [float(datum) for datum in row[:-1]]
X[i] = data
y[i] = row[-1] == 'g'

for random_state in range(0,21):
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=random_state)
estimator = KNeighborsClassifier()
estimator.fit(X_train, y_train)
y_predicted = estimator.predict(X_test)
accuracy = np.mean(y_test == y_predicted) * 100
print(y_predicted.shape,random_state,'{0:.1f}%'.format(accuracy))

太占地方,其他省略,只保留14正负2内的结果:
(88,) 12 83.0%
(88,) 13 83.0%
(88,) 14 86.4%
(88,) 15 85.2%
(88,) 16 79.5%
由此可知,test_size还是默认取值0.25得到的88个样本。
但是准确率发生了浮动。
由此猜测,应该是test和train样本,具体的编号发生了改变。

那么来检验一下:
先用书本中的random_state = 14 来看看究竟是如何分切的数据样本。

(此处空白,思考时间)

好像需要先找一个方法,判断每一个X的行,是否是相同的?
此处寻找大佬:
python3 np矩阵判断是否相同,包含顺序和值

于是可以用:
矩阵a==矩阵b.all() == True
来判断,2个矩阵是否一致。

于是更改代码,测试一下random_state = 14的时候,
X_test取值对应在X里面的序号。
顺便再复习以下enumerate()这个强大的函数。

import numpy as np
import csv
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from matplotlib import pyplot as plt

data_filename = r'下载地址\ionosphere.data'
X = np.zeros((351, 34), dtype='float')
y = np.zeros((351,), dtype='bool')

with open(data_filename, 'r') as input_file:
reader = csv.reader(input_file)
for i, row in enumerate(reader):
data = [float(datum) for datum in row[:-1]]
X[i] = data
y[i] = row[-1] == 'g'

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=14)
i_order,j_order = [],[]
for i,row_i in enumerate(X_test):
for j,row_j in enumerate(X):
if (row_i == row_j).all() == True:
i_order.append(i)
j_order.append(j)
print(i_order,j_order)

运行的结果如下:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87]
i_order就是88个样本的序号。j_order是测试集在原数据集的序号:
[14, 1, 44, 245, 288, 140, 309, 252, 339, 243, 318, 69, 214, 303, 158, 152, 321, 145, 106, 276, 298, 20, 281, 49, 326, 85, 292, 223, 227, 112, 264, 115, 121, 240, 102, 248, 125, 225, 60, 6, 62, 241, 3, 161, 236, 300, 287, 226, 137, 98, 94, 33, 283, 31, 337, 5, 70, 175, 278, 97, 235, 143, 315, 111, 171, 304, 215, 204, 151, 146, 176, 212, 317, 221, 182, 91, 113, 141, 21, 224, 117, 188, 183, 131, 170, 253, 301, 79, 40]

按照random_state的文档所写,无论多少次,都会是这个结果才对~
但是。。。每一个j_order的序号,我就理解不了了。。。
再来看看
random_state=10的j_order的结果:
[43, 306, 138, 275, 65, 6, 262, 172, 343, 218, 341, 294, 227, 102, 248, 142, 148, 163, 151, 1, 323, 207, 154, 225, 52, 195, 312, 255, 181, 87, 308, 78, 252, 219, 80, 322, 241, 12, 334, 245, 331, 270, 304, 47, 263, 26, 230, 205, 121, 174, 212, 303, 64, 24, 349, 199, 276, 111, 114, 69, 76, 170, 27, 56, 126, 92, 173, 131, 34, 300, 57, 146, 196, 98, 164, 217, 29, 204, 210, 249, 17, 100, 184, 88, 229, 25, 282, 315, 223]
random_state=33的j_order的结果:
[162, 95, 342, 122, 321, 0, 42, 112, 349, 69, 142, 322, 174, 223, 4, 225, 145, 287, 1, 213, 137, 64, 300, 345, 220, 293, 273, 318, 310, 147, 268, 5, 123, 110, 306, 276, 165, 101, 207, 209, 35, 194, 62, 81, 334, 155, 70, 8, 266, 229, 217, 324, 119, 302, 339, 183, 295, 315, 283, 319, 336, 193, 269, 304, 140, 330, 226, 317, 89, 274, 294, 199, 11, 144, 167, 314, 282, 267, 52, 348, 152, 263, 215, 47, 126, 173, 196, 284]

看来真的是随机数,但是random_state取值一样的时候,每一次随机,都会随机出一样的结果。
就不继续深究了~
具体的数值,看个人喜好的感觉!

参考文献:
1.https://www.geek-share.com/detail/2721837827.html

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: