您的位置：首页 > 产品设计 > UI/UE

scikit-learn：4.3. Preprocessing data（standardi/normali/binari..zation、encoding、missing value）

2015-07-23 09:10 561 查看

参考：http://scikit-learn.org/stable/modules/preprocessing.html

主要讲述The sklearn.preprocessing package的utility
function and transformer classes，包括standardization、normalization、binarization、encoding
categorical features、process missing value。

1、Standardization, or mean removal and variance scaling（标准化：去均值、除方差）

所谓standardization（标准化），就是指features处于standard normally distribut（高斯分布：均值是0、方差是1），强调的是，所有的features都是标准化的，防止某个features权重过大影响estimators的结果。

（详细内容之前，看一下Further discussion on the importance of centering and scaling data：http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html）

实际操作中，我们不关心分布的具体形状，只需要每个features减去本features的均值，然后除以本features的标准差。

主要介绍scale function、StandardScaler class[b]和MinMaxScaler：[/b]

The function scale provides
a quick and easy way to perform this operation on a single array-like dataset:

>>>

>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1.,  2.],
...               [ 2.,  0.,  0.],
...               [ 0.,  1., -1.]])
>>> X_scaled = preprocessing.scale(X)

>>> X_scaled                                          
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])

Scaled data has zero mean and unit variance:

>>>

>>> X_scaled.mean(axis=0)
array([ 0.,  0.,  0.])

>>> X_scaled.std(axis=0)
array([ 1.,  1.,  1.])

The preprocessing module further provides a utility class StandardScaler ，可以计算训练集的均值和方差，然后将相同的transformation应用于测试集.
This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline:

>>>

>>> scaler = preprocessing.StandardScaler().fit(X)
>>> scaler
StandardScaler(copy=True, with_mean=True, with_std=True)

>>> scaler.mean_                                      
array([ 1. ...,  0. ...,  0.33...])

>>> scaler.std_                                       
array([ 0.81...,  0.81...,  1.24...])

>>> scaler.transform(X)                               
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])

The scaler instance can then be used on new data to transform it the same way it did on the training set:

>>>

>>> scaler.transform([[-1.,  1., 0.]])                
array([[-2.44...,  1.22..., -0.26...]])

It is possible to disable either centering or scaling by either passing with_mean=False or with_std=False to
the constructor ofStandardScaler.

MinMaxScaler看名字就知道，将特征值缩放到min和max之间，常见的是缩放到0和1之间。好处是，不仅能够保持sparse data中得0仍然是0，还可以增加处理小方差特征的鲁棒性。

实现细节如下：

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std / (max - min) + min

（其实看出来了，实现时操作的是1d array，而不是2d的X，所以对于regression任务，可以考虑缩放target variables）

以缩放到[0, 1]之间为例：

>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
...
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

如果不直接对训练集使用fit_transform，而是使用fit ，那么相同的transformation过程同样可以应用到测试集上：

>>>

>>> X_test = np.array([[ -3., -1.,  4.]])
>>> X_test_minmax = min_max_scaler.transform(X_test)
>>> X_test_minmax
array([[-1.5       ,  0.        ,  1.66666667]])

（除了center、scale操作，有些model还假设特征之间的linear independence，如PCA处理图像时，关于移除特征之间的linear correlation，参考： sklearn.decomposition.PCA or sklearn.decomposition.RandomizedPCA with whiten=True）

2、Normalization（正规化）

重要的不翻译：Normalization is
the process of scaling individual samples to have unit norm.

正规化对于使用例如点积形式或kernel形式的二次式来衡量samples pair的相似度非常有用！是文本分类聚类中常用的 Vector Space Model（VSM）的基础。

function normalize，分为l1、l2
norm：

>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> X_normalized = preprocessing.normalize(X, norm='l2')

>>> X_normalized                                      
array([[ 0.40..., -0.40...,  0.81...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.  ...,  0.70..., -0.70...]])

utility class Normalizer，功能一样，但使用Transformer的API实现（即，fit训练集、对测试集进行相同操作的transform）：

>>> normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing（因为每个样本都独立于其他样本处理）
>>> normalizer
Normalizer(copy=True, norm='l2')
>>> normalizer.transform(X)                            
array([[ 0.40..., -0.40...,  0.81...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.  ...,  0.70..., -0.70...]])
>>> normalizer.transform([[-1.,  1., 0.]])             
array([[-0.70...,  0.70...,  0.  ...]])

3、Feature Binarization（二值化）

Feature
binarization is the process ofthresholding
numerical features to get boolean values.

对于假设数据是根据multi-variate Bernoulli distribution产生的模型而言，这种预处理非常重要。在文本处理中，尽管tf或者tf-idf一般能够产生较好的效果，但有时也会用到binary
feature。

utility class Binarizer：

>>> binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
>>> binarizer
Binarizer(copy=True, threshold=0.0)
>>> binarizer.transform(X)
array([[ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])
>>> binarizer = preprocessing.Binarizer(threshold=1.1) # adjust the threshold of the binarizer

4、Encoding categorical features（编码类别特征）

对于类别特征，如人的特征包含：[["male","female"], ["from Europe", "from US", "from Asia"], ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]].

我们首先想到用interger来代替这样的特征，如：["male", "from US", "uses Internet Explorer"] 编码为 [0, 1, 3] ， ["female", "from Asia", "uses Chrome"] 编码为[1, 2, 1].
但这种integer representation并不能直接应用于scikit-learn的estimators（期望使用连续的numerical input），他们常常把这些数字解释为有序的，这显然是我们不希望的（即，对于浏览器集合{0, 1, 2, 3}，我们是不排序的）。

一个可能的方法，将类别特征转化为one-of-K（one-hot）特征，这种转化由OneHotEncoder实现，且可以应用于scikit-learn的estimators。

什么是one-hot
feature？This estimatortransforms each categorical feature with m possible
values into m binary
features, with only one active.（这个在CTR中非常常用！！！）

接着上面人的特征给个例子：

>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'float'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.transform([[0, 1, 3]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

看懂了吧，默认情况下，每个feature有几个类别值是通过fit训练集自动得到的，也可以通过直接设置参数n_values指定。

上面讲了如何转换以integer来表达的类别特征，对于以dict来表达的类别特征的转换，参考Loading
features from dicts ，翻译文章为：/article/1323243.html的2、loading
features form dicts。

5、imputation of missing values（归责缺失值）

impute the missing values：scikit-learn estimators假设all values in an array are numerical, 对于有缺失值的samples，好的策略是通过已知的数据部分推理缺失值。

Imputer class：提供最基本的impute
missing value的策略，如使用（缺失值所在行或列的）均值、中值、最频繁值；同时也允许使用自定义策略（This class also allows for different missing values encodings，靠，看了API，没看到怎么实现。。。。）。给个例子就知道怎么回事：

>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0) #假设缺失值由 np.nan编码，列（axis=0）均值
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))                           
[[ 4.          2.        ]
 [ 6.          3.666...]
 [ 7.          6.        ]]

Imputer支持sparse matrices：

>>> X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
>>> imp = Imputer(missing_values=0, strategy='mean', axis=0)

由于缺失值由0来编码，实际上内部存储为sparse matrices，这对于高维度的稀疏特征非常有用。

上面提到的所有preprocess都可以在pipeline的early stage按需求进行应用。。。

完。。。。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航