您的位置：首页 > Web前端

Sklearn库学习笔记1 Feature_Engineering之预处理篇

2017-08-28 16:06 351 查看

一、预处理

1. Binarizer 二值化处理

from sklearn.preprocessing import Binarizer
import numpy as np

'''
数据二值化处理:
适用场景：泊松分布，文本数据
操作特点：返回对于数值特征的阈值判断
'''

x_train = np.array([[1,2,-1],
[2, 3, -2],
[1, -1 ,1]])

bina = Binarizer(threshold=1.0, copy=True)
bina.fit(x_train)
bina.transform(x_train)

2. Imputer 填补缺失值

from sklearn.preprocessing import Imputer
import numpy as np
'''
缺失值计算：
填补方式: “mean”, "median", "most_frequent"
'''
x_train = np.array([[1,np.nan,-1],
[2, 3, -2],
[1, -1 ,1]])
imp = Imputer(missing_values='NaN', strategy='mean', axis=1, verbose=0, copy=True)
imp.fit(x_train)
imp.transform(x_train)

3. Normalizer 归一化

from sklearn.preprocessing import Normalizer
import numpy as np
'''
归一化处理数据：
适用场景：
比如计算两个L2归一化后的TF-IDF向量内积实际上是计算它们的余弦相似度，余弦值越接近于1，它们的方向更加吻合，则越相似。

'''
x_train = np.array([[1,-5,-1],
[2, 3, -2],
[1, -1 ,1]])
imp = Normalizer(norm='l2', copy=True)
'''
正则化方式： 'l1' ,'l2', 'max'
'''
imp.fit(x_train)
imp.transform(x_train)

4. OneHotEncoder独热编码

from sklearn.preprocessing import OneHotEncoder
import numpy as np
'''
独热编码:
对类别型特征编码，one-of-K的形式
'''
x_train = np.array([1,3,4]).reshape(-1, 1)
onehot = OneHotEncoder(n_values='auto', categorical_features='all', dtype=np.float64, sparse=True, handle_unknown='error')
'''
n_values: 每个特征的数量
categorical_features： 需要编码的特征名
dtype: 数据类型
sparse: 是否返回稀疏矩阵
handle_unknown： 遇到错误如何处理
'''
onehot.fit(x_train)
print onehot.transform(x_train).toarray()

5. StandardScaler 和 MinMaxScaler标准化

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import numpy as np
'''
StandardScaler 数据标准化:
适用场景：比如PCA, SVM的RBF核等
注意事项：不能分别对训练集和测试集训练与转换，应该在训练集上训练，在测试集在转化，如下所示:
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

'''
x_train = np.array([[1,2,-1],
[2, 3, -2],
[1, -1, 1]])
stan = StandardScaler(copy=True, with_mean=True, with_std=True)
stan.fit(x_train)
stan.transform(x_train)

maxmin = MinMaxScaler(feature_range=(0, 1), copy=True)
maxmin.fit(x_train)
maxmin.transform(x_train)

#feature_range: 压缩范围

6. RobustScaler鲁棒性缩放

RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航