scikit-learn源码学习之datasets.samples_generator.make_blobs
2016-12-04 11:09
1216 查看
在看sklearn聚类部分的时候碰到的,可以按照需求生成数据,官方源码地址
读代码顺带把注释和心得写了上去
中文注释都是个人见解,如果有写的不到位的地方,欢迎大家评论区拍砖
读代码顺带把注释和心得写了上去
def make_blobs(n_samples=100, n_features=2, centers=3, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None): """Generate isotropic Gaussian blobs for clustering. Read more in the :ref:`User Guide <sample_generators>`. Parameters ---------- n_samples : int, optional (default=100) The total number of points equally divided among clusters. n_features : int, optional (default=2) The number of features for each sample. centers : int or array of shape [n_centers, n_features], optional (default=3) The number of centers to generate, or the fixed center locations. cluster_std : float or sequence of floats, optional (default=1.0) The standard deviation of the clusters. center_box : pair of floats (min, max), optional (default=(-10.0, 10.0)) The bounding box for each cluster center when centers are generated at random. shuffle : boolean, optional (default=True) Shuffle the samples. random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`. Returns ------- X : array of shape [n_samples, n_features] The generated samples. y : array of shape [n_samples] The integer labels for cluster membership of each sample. Examples -------- >>> from sklearn.datasets.samples_generator import make_blobs >>> X, y = make_blobs(n_samples=10, centers=3, n_features=2, ... random_state=0) >>> print(X.shape) (10, 2) >>> y array([0, 0, 1, 0, 2, 2, 2, 1, 1, 0]) See also -------- make_classification: a more intricate variant """ #根据random_state生成随机数生成器 generator = check_random_state(random_state) #判断centers对象的类型 #如果是int就根据center_box的范围来随机生成中心点 if isinstance(centers, numbers.Integral): #uniform表示均匀分布采样 #范围是(center_box[0],center_box[1]) #形状是centers*n_features的 centers = generator.uniform(center_box[0], center_box[1], size=(centers, n_features)) #把centers转化np.array类型 并得到n_features else: centers = check_array(centers) n_features = centers.shape[1] #如果cluster_std是一个实数,表示每个中心的标准差都是cluster_std if isinstance(cluster_std, numbers.Real): cluster_std = np.ones(len(centers)) * cluster_std #存放样本的返回值 X = [] y = [] n_centers = centers.shape[0] #//运算符表示整数除法 平均每个中心的样本数 n_samples_per_center = [int(n_samples // n_centers)] * n_centers #把余数依次摊在前几个中心里 for i in range(n_samples % n_centers): n_samples_per_center[i] += 1 #enumrate的返回值为index,value #zip可以把长度一样的多个序列打包在一起,遍历时下标一样的在一起 for i, (n, std) in enumerate(zip(n_samples_per_center, cluster_std)): #normal表示正态分布 #根据scale和size生成随机数 然后加在中心点上,让其波动~~ #array类型相加的时候有一个性质如下 #>>> np.array([1,2])+np.array([[3,4],[5,6]]) # array([[4, 6],[6, 8]]) X.append(centers[i] + generator.normal(scale=std, size=(n, n_features))) #标签连续 y += [i] * n #concatenate这个函数就是把原来的不同组的数列合在一起 理解起来有些绕 #>>> np.concatenate([[[1,2],[3,4]], # ... [[5,6],[7,8]], # ... [[9,10],[10,11]]]) # array([[ 1, 2], # [ 3, 4], # [ 5, 6], # [ 7, 8], # [ 9, 10], # [10, 11]]) #其实如果把上面的X.append换成X.extend就能省略这步比较难懂的操作了 X = np.concatenate(X) y = np.array(y) #打乱次序 if shuffle: #获取下标 indices = np.arange(n_samples) #打乱下标 generator.shuffle(indices) X = X[indices] y = y[indices] return X, y
中文注释都是个人见解,如果有写的不到位的地方,欢迎大家评论区拍砖
相关文章推荐
- 【scikit-learn】06:make_blobs聚类数据生成器
- scikit-learn源码学习之cluster.MeanShift
- Scikit-Learn模块学习笔记——数据集模块datasets
- [Python][MachineLeaning]Python Scikit-learn学习笔记1-Datasets&Estimators
- scikit-learn源码学习之cluster.mean_shift.estimate_bandwidth
- [机器学习]Scikit-Learn模块学习笔记——数据集模块datasets
- Scikit-learn源码学习之cluster.SpectralClustering
- Scikit-Learn模块学习笔记——数据集模块datasets
- Scikit-learn 学习笔记--(1)特征选择
- Scikit-learn-python机器学习工具入门学习
- scikit-learn工具学习 - random,mgrid,np.r_ ,np.c_, scatter, axis, pcolormesh, contour, decision_function
- Python scikit-learn 学习笔记—环境篇
- Python scikit-learn 学习笔记—手写数字识别
- [Machine Learning step by step] 1 统计学习:scikit-learn机器学习简介
- scikit-learn:0. user_guide——需要学习的所有内容
- scikit-learn 学习笔记-1-加载文本语料库
- Python scikit-learn 学习笔记—PCA+SVM人脸识别
- Python下的机器学习工具scikit-learn(学习笔记3--数据预处理)
- Python scikit-learn 学习笔记—鸢尾花模型
- Python scikit-learn机器学习工具包学习笔记:feature_selection模块