您的位置:首页 > 其它

数据集划分 --训练集、测试集、验证集

2020-08-07 11:33 197 查看

数据集划分 --训练集、测试集、验证集

按比例划分数据集(按照句子级别)

按照8:1:1比例划分数据集,下面展示一些代码:

# -*- coding: utf-8 -*-
"""
Created on Fri Jul 31 16:26:46 2020
随机将数据集按照句子级别划分训练集、测试集、验证集
@author: jpcheng2
"""
import random
def split(all_list, shuffle=False, ratio=0.8,ratio1 = 0.9):
num = len(all_list)
offset = int(num * ratio)
offset1 = int(num * ratio1)
if num == 0 or offset < 1:
return [], all_list
if shuffle:
random.shuffle(all_list)  # 列表随机排序
train = all_list[:offset]
test = all_list[offset:offset1]
dev = all_list[offset1:]
return train, test,dev

def file_shffle_split(file,train,test,dev):
infilm = open(file, 'r', encoding='utf-8')
trainfilm = open(train, 'w+', encoding='utf-8')
testfilm = open(test, 'w+', encoding='utf-8')
devfilm = open(dev, 'w+', encoding='utf-8')
li = []
sentence=[]
for data in infilm.readlines():
if data!= '\n':
li.append(data)
else:
sentence.append(li)
li = []
traindatas, testdatas,devdatas = split(sentence, shuffle=True, ratio=0.6,ratio1=0.8)
#写入训练集
for sentence in traindatas:
for word in sentence:
trainfilm.write(word)
trainfilm.write('\n')
#写入测试集
for sentence in testdatas:
for word in sentence:
testfilm.write(word)
testfilm.write('\n')
#写入验证集
for sentence in devdatas:
for word in sentence:
devfilm.write(word)
devfilm.write('\n')
infilm.close()
trainfilm.close()
testfilm.close()
devfilm.close()

file_shffle_split('microsoft_test.txt','0001_train.txt','0001_test.txt','0001_dev.txt')

输入一个文件:microsoft_test.txt
输出三个文件:‘0001_train.txt’,
‘0001_test.txt’,
‘0001_dev.txt’

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: