您的位置：首页 > 编程语言 > Python开发

python dataformat.py通用数据格式转化脚本

2011-12-10 15:26 246 查看

博客已迁移，新文章
地址

==================

需求：在进行hadoop测试时，需要造大量数据，例如某个表存在56列，但实际程序逻辑只适用到某几列，我们造的数据也只需要某几列

构造几列数据，转化为对应数据表格式

涉及模块：os,getopt,sys

输入：源格式，文本文件

输出：目标格式，文本文件

#!/usr/bin/python
# -*- coding: utf-8 -*-
#dataformat.py
#this script change data from your source to the dest data format
#2011-08-05 created version0.1
#2011-10-29 add row-row mapping ,default row value .rebuild all functions. version0.2
#next:add data auto generate by re expression

import os,getopt,sys

#读入文件,返回所有行
def read_file(path):
f = open(path, "r")
lines = f.readlines()
f.close()
return lines

#处理一行，转为目标格式，返回目标行
def one_line_proc(parts, total, ft_map, outsp, empty_fill):
toindex = 0
outline = ""
keys = ft_map.keys()
for i in range(1, total+1):
if i in keys:
fill_index = ft_map[i]
if fill_index.startswith("d"):
outline += fill_index[1:]
else:
outline += parts[int(fill_index)-1]
else:
outline += empty_fill
if i !=total:
outline += outsp
#TODO:加入使用默认值列  若是以d开头，后面是默认，否则取文件对应列 done
#TODO:这里根据这个判断长度也需要换掉 done
return outline

#处理入口，读文件，循环处理每一行，写出
#输入数据分隔符默认\t,输出数据默认分隔符\t
def process(inpath, total, to, outpath, insp="\t", outsp="\t", empty_fill=""):
#TODO:这里将to转为映射格式 done
ft_map = {}
in_count = 0
used_row = []
for to_row in to:
if r"\:" not in to_row and len(to_row.split(":"))==2:
used_row.append(int(to_row.split(":")[1]))
if r"\=" not in str(to_row) and len(str(to_row).split("="))==2:
pass
else:
in_count += 1

for to_row in to:
if r"\=" not in str(to_row) and len(str(to_row).split("="))==2:
ft_map.update({int(to_row.split("=")[0]):"d"+to_row.split("=")[1]})
continue
elif r"\:" not in to_row and len(to_row.split(":"))==2:
ft_map.update({int(to_row.split(":")[0]):to_row.split(":")[1]})
continue
else:
to_index = 0
for i in range(1, 100):
if i not in used_row:
to_index = i
break
ft_map.update({int(to_row):str(to_index)})
used_row.append(to_index)

lines = read_file(inpath)
f = open(outpath,"w")
result=[]
for line in lines:
parts = line.strip("\n").split(insp)
#TODO:这里判断长度必须换掉 done
if len(parts) >= in_count:
outline = one_line_proc(parts, total, ft_map, outsp, empty_fill)
result.append(outline+"\n")
f.writelines(result)
f.close()

#打印帮助信息
def help_msg():
print("功能：原数据文件转为目标数据格式")
print("选项:")
print("\t -i inputfilepath  [必输，原文件路径]")
print("\t -t n              [必输，n为数字，目标数据总的域个数]")
print("\t -a '1,3,4'        [必输，域编号字符串，逗号分隔。指定域用原数据字段填充，未指定用'0'填充]")
print("\t -o outputfilepath [可选，默认为 inputfilepath.dist ]")
print("\t -F 'FS'           [可选，原文件域分隔符，默认为\\t ]")
print("\t -P 'OFS'          [可选，输出文件的域分隔符，默认为\\t ]")
sys.exit(0)

#程序入口，读入参数，执行
def main():
try:
opts,args = getopt.getopt(sys.argv[1:],"F:P:t:a:i:o:f:h")

for op,value in opts:
if op in ("-h","-H","--help"):
help_msg()
if op == "-i":
inpath = value
elif op == "-o":
outpath = value
elif op == "-t":
total = int(value)
elif op == "-a":
to = value.split(",")
elif op == "-F":
insp = value.decode("string_escape")
elif op == "-P":
outsp = value.decode("string_escape")
elif op == "-f":
empty_fill = value
#考虑下这边放在神马地方合适
if len(opts) < 3:
print(sys.argv[0]+" : the amount of params must great equal than 3")
sys.exit(1)

except getopt.GetoptError:
print(sys.argv[0]+" : params are not defined well!")

if 'inpath' not in dir():
print(sys.argv[0]+" : -i param is needed,input file path must define!")
sys.exit(1)

if 'total' not in dir():
print(sys.argv[0]+" : -t param is needed,the fields of result file must define!")
sys.exit(1)

if 'to' not in dir():
print(sys.argv[0]+" : -a param is needed,must assign the field to put !")
sys.exit(1)

if not os.path.exists(inpath):
print(sys.argv[0]+" file : %s is not exists"%inpath)
sys.exit(1)

if 'empty_fill' not in dir():
empty_fill = ''

tmp=[]
for st in to:
tmp.append(str(st))
to=tmp

if 'outpath' not in dir():
outpath = inpath+".dist"

if 'insp' in dir() and 'outsp' in dir():
process(inpath,total,to,outpath,insp,outsp,empty_fill=empty_fill)
elif 'insp' in dir():
process(inpath,total,to,outpath,insp,empty_fill=empty_fill)
elif 'outsp' in dir():
process(inpath,total,to,outpath,outsp=outsp,empty_fill=empty_fill)
else:
process(inpath,total,to,outpath,empty_fill=empty_fill)

if __name__ =="__main__":
main()

使用说明：

功能：可指定输入分隔，输出分隔，无配置字段填充，某列默认值

可按顺序填充，也可乱序映射填充

输入：输入文件路径

选项：

-i “path”	必设	输入文件路径
-t n	必设	目标数据表总列数
-a “r1,r2”	必设	将要填充的列号列表，可配置默认值，可配置映射
-o “path”	可选	输出文件路径，默认为输入文件路径.dist
-F “IFS”	可选	输入文件中字段域分隔符，默认\t
-P ”OFS”	可选	输出文件中字段域分隔符，默认\t
-f “”	可选	指定未配置列的填充内容，默认为空
-h	单独	查看帮助信息

列填充的配置示例：

普通用法【最常用】

命令：./dataformat.py –i in_file –t 65 -a “22,39,63” –F “^I” –P “^A” –f “0”

说明：

in_file中字段是以\t分隔的[可不配-F,使用默认]。
将in_file的第1,2,3列分别填充到in_file.dist[use default]的第22,39,63列
in_file.dist共65列，以^A分隔，未配置列以0填充
-a中顺序与源文件列序有关，若-a “39,22,63” 则是将第1列填充到第39列，第二列填充到22列，第3列填充到63列

列默认值用法:【需要对某些列填充相同的值，但不想在源文件中维护】

命令: ./dataformat.py -i in_file –t 30 –a “3=tag_1,9,7,12=0.0” –o out_file

说明:

in_file以\t分隔，输出out_file以\t分隔
将in_file的第1列,第2列填充到out_file的第9列，第7列
out_file共30列，第3列均用字符串”tag_1”填充，第12列用0.0填充，其他未配置列为空
注意：默认值的取值，若是使用到等号和冒号，需转义，加 \= \:

列列乱序映射：

命令:./dataformat.py –i in_file –t 56 –a “3:2,9,5:3,1=abc,11”

说明:

分隔，输入，输出，同上…..
冒号前面为输出文件列号，后面为输入文件列号
目标文件第3列用输入文件第2列填充，目标文件第5列用输入文件第3列填充
目标文件第一列均填充“abc”
目标文件第9列用输入文件第1列填充，第11列用输入文件第4列填充【未配置映射，使用从头开始还没有被用过的列】
脚本会对简单的字段数量等映射逻辑进行检测，复杂最好全配上，使用默认太抽象

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航