您的位置:首页 > 编程语言 > Python开发

Python数据分析:scikit-learn基础(二)

2019-05-06 14:55 676 查看

Python数据分析:scikit-learn基础(二)

使用scikit-learn流程

准备数据集
  • 数据处理 数据集格式
  • 二维数组,形状(n_samples,n_features)
  • 使用np.reshape()转换数据集形状
  • 特征工程
      特征提取
    • 特征归一化(normalization)
    • ……
  • 分割训练集、测试集train_test_split()
  • 特征归一化(normalization)
      preprocessing.scale()
    选择合适的模型

    训练模型
    • estimator对象
    • 从训练数据学习得到
    • 可以使分类算法、回归算法或是特征提取算法
    • fit方法用于训练estimator
    • estimator的参数可以训练前初始化或者之后更新
    • get_params()返回之前定义的参数
    • score()对estimator进行评分 回归模型使用决定系数评分(coefficient of determination)
    • 分类模型使用准确率评分(accuracy)
    调整参数
    • 依靠经验
    • 交叉验证(cross validation) cross_val_score()
    测试模型
    • model.predict(X_test) 返回测试样本的预测标签
    • model.score(X_test,y_test) 根据预测值和真实值计算评分
    import numpy as np
    from sklearn.model_selection import train_test_split
    
    X = np.random.randint(0, 100, (10, 4))
    y = np.random.randint(0, 3, 10)
    y.sort()
    
    print('样本:')
    print(X)
    print('标签:', y)

    运行:

    # 分割训练集、测试集
    # random_state确保每次随机分割得到相同的结果
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=7)
    
    print('训练集:')
    print(X_train)
    print(y_train)
    
    print('测试集:')
    print(X_test)
    print(y_test)

    运行:

    # 特征归一化
    from sklearn import preprocessing
    
    x1 = np.random.randint(0, 1000, 5).reshape(5,1)
    x2 = np.random.randint(0, 10, 5).reshape(5, 1)
    x3 = np.random.randint(0, 100000, 5).reshape(5, 1)
    
    X = np.concatenate([x1, x2, x3], axis=1)
    print(X)
    print('归一化后的数据集:')
    print(preprocessing.scale(X))

    运行:

    验证归一化对结果的重要性 不归一化:
    from sklearn import svm
    from sklearn.datasets import make_classification
    
    X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2,
    random_state=25, n_clusters_per_class=1, scale=100)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=7)
    svm_classifier = svm.SVC()
    svm_classifier.fit(X_train, y_train)
    svm_classifier.score(X_test, y_test)

    运行:

    归一化:
    from sklearn import svm
    from sklearn.datasets import make_classification
    
    X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2,
    random_state=25, n_clusters_per_class=1, scale=100)
    #归一化
    X = preprocessing.scale(X)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=7)
    svm_classifier = svm.SVC()
    svm_classifier.fit(X_train, y_train)
    svm_classifier.score(X_test, y_test)

    运行:

    交叉验证:
    from sklearn import datasets
    from sklearn.model_selection import train_test_split, cross_val_score
    from sklearn.neighbors import KNeighborsClassifier
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=10)
    
    k_range = range(1, 31)
    cv_scores = []
    for n in k_range:
    knn = KNeighborsClassifier(n)
    scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy') # 分类问题使用
    cv_scores.append(scores.mean())
    
    plt.plot(k_range, cv_scores)
    plt.xlabel('K')
    plt.ylabel('Accuracy')
    plt.show()

    运行:

    # 选择最优的K
    best_knn = KNeighborsClassifier(n_neighbors=5)
    best_knn.fit(X_train, y_train)
    print(best_knn.score(X_test, y_test))
    print(best_knn.predict(X_test))

    运行:

  • 内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
    标签: