您的位置:首页 > Web前端

Sklearn库学习笔记1 Feature_Engineering之预处理篇

2017-08-28 16:06 351 查看

一、预处理

1. Binarizer 二值化处理

from sklearn.preprocessing import Binarizer
import numpy as np

'''
数据二值化处理:
适用场景:泊松分布,文本数据
操作特点:返回对于数值特征的阈值判断
'''

x_train = np.array([[1,2,-1],
[2, 3, -2],
[1, -1 ,1]])

bina = Binarizer(threshold=1.0, copy=True)
bina.fit(x_train)
bina.transform(x_train)


2. Imputer 填补缺失值

from sklearn.preprocessing import Imputer
import numpy as np
'''
缺失值计算:
填补方式: “mean”, "median", "most_frequent"
'''
x_train = np.array([[1,np.nan,-1],
[2, 3, -2],
[1, -1 ,1]])
imp = Imputer(missing_values='NaN', strategy='mean', axis=1, verbose=0, copy=True)
imp.fit(x_train)
imp.transform(x_train)


3. Normalizer 归一化

from sklearn.preprocessing import Normalizer
import numpy as np
'''
归一化处理数据:
适用场景:
比如计算两个L2归一化后的TF-IDF向量内积实际上是计算它们的余弦相似度,余弦值越接近于1,它们的方向更加吻合,则越相似。

'''
x_train = np.array([[1,-5,-1],
[2, 3, -2],
[1, -1 ,1]])
imp = Normalizer(norm='l2', copy=True)
'''
正则化方式: 'l1' ,'l2', 'max'
'''
imp.fit(x_train)
imp.transform(x_train)


4. OneHotEncoder独热编码

from sklearn.preprocessing import OneHotEncoder
import numpy as np
'''
独热编码:
对类别型特征编码,one-of-K的形式
'''
x_train = np.array([1,3,4]).reshape(-1, 1)
onehot = OneHotEncoder(n_values='auto', categorical_features='all', dtype=np.float64, sparse=True, handle_unknown='error')
'''
n_values: 每个特征的数量
categorical_features: 需要编码的特征名
dtype: 数据类型
sparse: 是否返回稀疏矩阵
handle_unknown: 遇到错误如何处理
'''
onehot.fit(x_train)
print onehot.transform(x_train).toarray()


5. StandardScaler 和 MinMaxScaler标准化

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import numpy as np
'''
StandardScaler 数据标准化:
适用场景:比如PCA, SVM的RBF核等
注意事项:不能分别对训练集和测试集训练与转换,应该在训练集上训练,在测试集在转化,如下所示:
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

'''
x_train = np.array([[1,2,-1],
[2, 3, -2],
[1, -1, 1]])
stan = StandardScaler(copy=True, with_mean=True, with_std=True)
stan.fit(x_train)
stan.transform(x_train)

maxmin = MinMaxScaler(feature_range=(0, 1), copy=True)
maxmin.fit(x_train)
maxmin.transform(x_train)

#feature_range: 压缩范围


6. RobustScaler鲁棒性缩放

RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: