您的位置:首页 > 编程语言 > Python开发

[置顶] python归一化、标准化、正则化

2017-06-30 20:50 176 查看
        在前面的 文章中,我们对一些特征做了分析,根据describe得到统计信息,很多特征是稀疏的。
#首先去掉空值,并查看数据的统计
ads = ads.dropna(axis=0)
print(ads.describe())
#我们可以看到,大量的特征的25%分位数显示,特征分布
#              0            1            2            3            4     \
# count  2359.000000  2359.000000  2359.000000  2359.000000  2359.000000
# mean     63.912251   155.631624     3.912982     0.759644     0.002120
# std      54.881130   130.237867     6.047220     0.427390     0.045999
# min       1.000000     1.000000     0.001500     0.000000     0.000000
# 25%      25.000000    80.500000     1.033450     1.000000     0.000000
# 50%      51.000000   110.000000     2.111100     1.000000     0.000000
# 75%      84.000000   184.000000     5.333300     1.000000     0.000000
# max     640.000000   640.000000    60.000000     1.000000     1.000000
#
#          5            6            7            8            9     \
# count  2359.0  2359.000000  2359.000000  2359.000000  2359.000000
# mean      0.0     0.006359     0.004663     0.004663     0.014837
# std       0.0     0.079504     0.068141     0.068141     0.120925
# min       0.0     0.000000     0.000000     0.000000     0.000000
# 25%       0.0     0.000000     0.000000     0.000000     0.000000
# 50%       0.0     0.000000     0.000000     0.000000     0.000000
# 75%       0.0     0.000000     0.000000     0.000000     0.000000
# max       0.0     1.000000     1.000000     1.000000     1.000000
#
#           ...              1549         1550         1551         1552  \
# count     ...       2359.000000  2359.000000  2359.000000  2359.000000
# mean      ...          0.003815     0.001272     0.002120     0.002543
# std       ...          0.061662     0.035646     0.045999     0.050379
# min       ...          0.000000     0.000000     0.000000     0.000000
# 25%       ...          0.000000     0.000000     0.000000     0.000000
# 50%       ...          0.000000     0.000000     0.000000     0.000000
# 75%       ...          0.000000     0.000000     0.000000     0.000000
# max       ...          1.000000     1.000000     1.000000     1.000000
#
#               1553         1554         1555        1556         1557  \
# count  2359.000000  2359.000000  2359.000000  2359.00000  2359.000000
# mean      0.008478     0.013989     0.014837     0.00975     0.000848
# std       0.091705     0.117470     0.120925     0.09828     0.029111
# min       0.000000     0.000000     0.000000     0.00000     0.000000
# 25%       0.000000     0.000000     0.000000     0.00000     0.000000
# 50%       0.000000     0.000000     0.000000     0.00000     0.000000
# 75%       0.000000     0.000000     0.000000     0.00000     0.000000
# max       1.000000     1.000000     1.000000     1.00000     1.000000
#
#               1558
# count  2359.000000
# mean      0.161509
# std       0.368078
# min       0.000000
# 25%       0.000000
# 50%       0.000000
# 75%       0.000000
# max       1.000000
#
# [8 rows x 1559 columns]
对特征进行一定的处理,可以提升算法模型的结果,主要分为归一化,标准化,正则化。python的sklearn.preprocessing提供了相应的方法,使用起来非常方便。
#导入sklearn.preprocessing数据预处理包
from sklearn.preprocessing import MinMaxScalerdf_all = ads.valuesX = df_all[:,:-1]y = df_all[:,-1]
#归一化:消除不同数据之间的量纲,方便数据比较和共同处理,并维持了数据的稀疏性质# 比如在神经网络中,归一化可以加快训练网络的收敛性。
X_scaler = MinMaxScaler().fit_transform(X)print(X_scaler)
# [[ 0.19405321  0.19405321  0.01664208 ...,  0.          0.          0.        ]#  [ 0.08763693  0.73082942  0.13682009 ...,  0.          0.          0.        ]#  [ 0.05007825  0.35837246  0.1161379  ...,  0.          0.          0.        ]#  ...,#  [ 0.15649452  0.21752739  0.02307724 ...,  0.          0.          0.        ]#  [ 0.03442879  0.18622848  0.08693217 ...,  0.          0.          0.        ]#  [ 0.06103286  0.06103286  0.01664208 ...,  0.          0.          0.        ]]
#标准化:使每个特征均值为0,方差为1。# 更利于使用标准正态分布的性质,进行处理。
from sklearn.preprocessing import StandardScalerscaler = StandardScaler().fit(X)X_scaler = scaler.transform(X)print(scaler.mean_,scaler.std_)print(X_scaler)# [[ 1.11332804 -0.23524739 -0.48180809 ..., -0.12272017 -0.09922646#   -0.02912965]#  [-0.12597621  2.39895364  0.71081076 ..., -0.12272017 -0.09922646#   -0.02912965]#  [-0.5633777   0.57114068  0.50556553 ..., -0.12272017 -0.09922646#   -0.02912965]#  ...,#  [ 0.67592654 -0.12004909 -0.41794703 ..., -0.12272017 -0.09922646#   -0.02912965]#  [-0.74562833 -0.27364682  0.21573459 ..., -0.12272017 -0.09922646#   -0.02912965]#  [-0.43580227 -0.88803773 -0.48180809 ..., -0.12272017 -0.09922646#   -0.02912965]]#正则化:与上述方法不同,正则化是对每个样本进行加工,使得每个样本的范数为1,用来计算样本之间相似度from sklearn.preprocessing import Normalizerscaler = Normalizer().fit(X)X_scaler = scaler.transform(X)print(X_scaler)# [[ 0.70693714  0.70693714  0.0056555  ...,  0.          0.          0.        ]#  [ 0.12088013  0.99248947  0.01741204 ...,  0.          0.          0.        ]#  [ 0.14193112  0.98921689  0.02997585 ...,  0.          0.          0.        ]#  ...,#  [ 0.58495049  0.81082246  0.00802772 ...,  0.          0.          0.        ]#  [ 0.18799975  0.98086824  0.0426457  ...,  0.          0.          0.        ]#  [ 0.70589457  0.70589457  0.01764736 ...,  0.          0.          0.        ]]
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: