您的位置：首页 > 编程语言 > Python开发

python数据预处理并使用pickle模块保存

2018-01-21 19:17 736 查看

机器学习中，通常拿到的数据并不能直接使用，需要进行预处理，比如剔除部分特征、去除脏数据、数据归一化、独热编码等，也就是特征工程。我们不希望每次加载程序的时候都需要进行前面的预处理，因此可以把预处理之后的数据保存起来，这里可以用pickle模块。这有点类似电脑游戏中的进度保存。

下面以 notMNIST 数据集为例，介绍如何进行数据的预处理。

1. 导入需要的模块

第一步，先把需要的模块导入。

import hashlib
import os
import pickle
from urllib.request import urlretrieve

import numpy as np
from PIL import Image
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.utils import resample
from tqdm import tqdm
from zipfile import ZipFile

print('All modules imported.')

hashlib
：用来提供常见的摘要算法，比如MD5、SHA1等。摘要算法又称哈希算法、散列算法。它通过一个函数，把任意长度的数据转换为一个长度固定的数据串（通常用16进制的字符串表示）。摘要算法不是加密算法，只能用于防篡改，也可以在不存储明文口令的情况下验证用户口令。（参考 hashlib）

os
：包含普遍的操作系统功能，主要是对目录和文件进行访问，可以方便移植到不同的平台。

pickle
：序列化模块，用来把内存中的数据变成可存储和传输的数据。

urllib.request
：

urllib

是用来处理网址的模块，其中

urllib.request

是为打开url提供的可扩展类库。（参考 Python中的urllib.request模块）

PIL
：是python上的标准图像处理库，

Python Imaging Library

。（参考 PIL）

tqdm
：用来显示进度条的，简单而且直观。（参考 python tqdm模块分析、tqdm）

zipfile
：是用来做zip格式编码的压缩和解压缩的。（参考 zipfile）

2. 下载数据集

将模型上传到云主机上计算的时候通常是先将数据集上传到网盘，再从网盘直接下载到云主机上，这样比采用本机上传要快很多。

数据格式采用“.zip”格式。

def download(url, file):
"""
从<url>处下载数据
:参数 url: 文件的URL
:参数 file: 本地文件的路径
"""
if not os.path.isfile(file):
print('Downloading ' + file + '...')
urlretrieve(url, file)
print('Download Finished')

# 分别下载训练集和测试集.
download('https://s3.amazonaws.com/udacity-sdc/notMNIST_train.zip', 'notMNIST_train.zip')
download('https://s3.amazonaws.com/udacity-sdc/notMNIST_test.zip', 'notMNIST_test.zip')

# 校验MD5值，确保下载的数据是完整的
assert hashlib.md5(open('notMNIST_train.zip', 'rb').read()).hexdigest() == 'c8673b3f28f489e9cdf3a3d74e2ac8fa',\
'notMNIST_train.zip file is corrupted.  Remove the file and try again.'
assert hashlib.md5(open('notMNIST_test.zip', 'rb').read()).hexdigest() == '5d3c7e653e63471c88df796156a9dfa9',\
'notMNIST_test.zip file is corrupted.  Remove the file and try again.'

# 直到所有文件下载完，给出下列提示.
print('All files downloaded.')

3. 解压特征和标签

自定义一个解压函数，从输入的zip文件中解压出features和labels，存储为numpy的数组格式。然后重采样出部分数据，方便后面快速测试模型。

def uncompress_features_labels(file):
"""
从zip文件中解压出features和labels
:参数 file: 待解压文件
"""
features = []
labels = []

with ZipFile(file) as zipf:
# 进度条
filenames_pbar = tqdm(zipf.namelist(), unit='files')

# 从所有文件中提取features和labels
for filename in filenames_pbar:
# 检查文件名是否为目录，不是则往下执行
if not filename.endswith('/'):       #str.endswith()
#ZipFile.open()以二进制的形式访问文件
with zipf.open(filename) as image_file:
image = Image.open(image_file)
image.load()
# 将图像数据以一维矩阵形式存储
# 存储格式设定为 float32
feature = np.array(image, dtype=np.float32).flatten()

# 提取文件名的首位，存储为对应的label。
# os.path.split()分割为“路径+文件名”，文件名中无'/'
label = os.path.split(filename)[1][0]

features.append(feature)
labels.append(label)
return np.array(features), np.array(labels)

# 从zip文件中提取出train和test数据集中的features和labels
train_features, train_labels = uncompress_features_labels('notMNIST_train.zip')
test_features, test_labels = uncompress_features_labels('notMNIST_test.zip')

# 重新采样，数据大小限制为docker_size_limit
docker_size_limit = 150000
train_features, train_labels = resample(train_features, train_labels, n_samples=docker_size_limit)

# 为特征工程设置标记，防止跳过重要步骤
is_features_normal = False
is_labels_encod = False

# 直到所有features和labels被解压，打印以下内容
print('All features and labels uncompressed.')

4. 特征工程

这里只进行features的归一化和labels的二值化（独热编码）。

归一化 normalizing

对训练集和测试集的features进行归一化处理，这里用最小-最大规范化（Min-Max scaling）。

Min-Max Scaling:

X′=a+(X−Xmin)(b−a)Xmax−Xmin

# 对于灰度图像数据的归一化
def normalize_grayscale(image_data):
"""
将输入的最小最大值缩放到[0.1, 0.9]
:参数 image_data: 待归一化的图像数据
:return: 归一化处理后的数据
"""
a = 0.1
b = 0.9
grayscale_min = 0
grayscale_max = 255
return a + ( ( (image_data - grayscale_min)*(b - a) )/( grayscale_max - grayscale_min ) )

# 如果标记为False，则归一化处理，并置标记为True
if not is_features_normal:
train_features = normalize_grayscale(train_features)
test_features = normalize_grayscale(test_features)
is_features_normal = True

标签二值化

用

sklearn.preprocessing

中的

LabelBinarizer()

对标签

A~F

进行二值化处理，也即是独热编码（

One-Hot Encoding

）。

if not is_labels_encod:
# 应用独热编码，将labels转化成数字（0/1表示）
encoder = LabelBinarizer()
encoder.fit(train_labels)
train_labels = encoder.transform(train_labels)
test_labels = encoder.transform(test_labels)

# 转化为float32的格式，便于后面在TensorFlow可以进行乘法运算
train_labels = train_labels.astype(np.float32)
test_labels = test_labels.astype(np.float32)
is_labels_encod = True

划分验证集，随机打乱数据

首先检验特征工程是否完成，然后使用

sklearn.model_selection

中的

train_test_split

将训练集随机划分为训练集和验证集。

assert is_features_normal, 'You skipped the step to normalize the features'
assert is_labels_encod, 'You skipped the step to One-Hot Encode the labels'

# 为训练集和验证集随机选取数据
train_features, valid_features, train_labels, valid_labels = train_test_split(
train_features,
train_labels,
test_size=0.05,
random_state=832289)

5. 数据保存

用pickle模块将处理好的数据存储成pickle格式，方便以后调用，即建立一个checkpoint。

# 保存数据方便调用
pickle_file = 'notMNIST.pickle'
if not os.path.isfile(pickle_file):    #判断是否存在此文件，若无则存储
print('Saving data to pickle file...')
try:
with open('notMNIST.pickle', 'wb') as pfile:
pickle.dump(
{
'train_dataset': train_features,
'train_labels': train_labels,
'valid_dataset': valid_features,
'valid_labels': valid_labels,
'test_dataset': test_features,
'test_labels': test_labels,
},
pfile, pickle.HIGHEST_PROTOCOL)
except Exception as e:
print('Unable to save data to', pickle_file, ':', e)
raise

print('Data cached in pickle file.')

pickle.HIGHEST_PROTOCOL
：代表文件的协议版本，version 0~4，参考protocol version）

pickle.dump(obj, file[, protocol])
：序列化数据，将

obj

数据流写入到

file

中，

file

必须是可写入模式。

6. 数据载入

下一次加载文件时不用重新处理数据，可以从保存的pickle文件中直接加载，具体方法如下。

%matplotlib inline     #直接将绘图显示在当前页面，用于jupyter notebook

# 加载模块
import pickle
import math

import numpy as np
import tensorflow as tf
from tqdm import tqdm
import matplotlib.pyplot as plt

# 读取数据
pickle_file = 'notMNIST.pickle'
with open(pickle_file, 'rb') as f:
pickle_data = pickle.load(f)       # 反序列化，与pickle.dump相反
train_features = pickle_data['train_dataset']
train_labels = pickle_data['train_labels']
valid_features = pickle_data['valid_dataset']
valid_labels = pickle_data['valid_labels']
test_features = pickle_data['test_dataset']
test_labels = pickle_data['test_labels']
del pickle_data  # 释放内存

print('Data and modules loaded.')

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航