您的位置：首页 > 编程语言 > Python开发

【详解】Python处理大量数据与DICT遍历的优化问题

2016-01-05 13:48 796 查看

前言：本例我们的需求是写一个每天0点运行的脚本。这个脚本从一个实时更新的数据库中提取数据。

每天跑一个Excel表出来，表里是当天零点与昨天零点时的差异的数据展示。

其实是很简单的需求，遇到的关键问题是数据量。该例的数据量太大，每次都能从数据库中拿出20多万条数据。

数据量大的话遇到的问题有这么几个：

1. 数据无法装入Excel表，因为使用Python处理Excel数据，最多插入65536行数据，多了就会报错；

2. 遍历筛选问题。我们拿到两天的数据进行对比，然后生成一个差异对比表。就需要遍历对比两张表的数据，数据量太大，遍历所用时间过长。

对这两个关键的问题，我们现作阐述。

【问题一：Excel表改为Csv表】

我们发现，Csv格式的表，是没有行数限制的，我们可以把20多万条数据直接插入csv表中。

【问题二：DICT类型数据的遍历】

按我们以往的经验，生成对比信息的字典代码如下：

def getCurrentCompareMessageDict0(dict0, dict1):
'''未被优化的获取当前对比信息字典'''
dlist0=list(dict0.keys())
dlist1=list(dict1.keys())
dict2={}
for i in range(len(dlist1)):
if dlist1[i] not in dlist0:
key=dlist1[i]
value=[0, dict1[dlist1[i]]]
dict2[key]=value
else:
if dict1[dlist1[i]]/100.0 != dict0[dlist1[i]]:
key=dlist1[i]
value=[dict0[dlist1[i]], dict1[dlist1[i]]]
dict2[key]=value
return dict2

即，先构建两个dict的key列表。

然后，以key列表的长度为上限，进行for循环，采用DICT[KEY]的方式来进行列表数据的筛选。

这个方法的运行是超级慢的。

经过研究我们将该方法改进如下：

def getCurrentCompareMessageDict(dict0, dict1):
'''优化的获取当前对比信息字典'''
dict2={}
i=0
for d, x in dict1.items():
if dict0.has_key(str(d)):
if x/100.0 != string.atof(dict0[str(d)]):
key=d
value=[string.atof(dict0[str(d)]), x]
dict2[key] = value
else:
key=d
value=[0, x]
dict2[key]=value
return dict2

采用该方法后，两组20多万条数据的比对筛选，在1秒内就完成了。

经测试，优化方法后速度提高了大约400倍！

这个方法优化了哪里呢？

首先，遍历dict的方法改为了

for d, x in dict1.items():

其中，d为key，x为value。其实也可以这样写

for (d, x） in dict1.items():

网上找到的资料称，加括号的在200次以内的遍历效率较高，不加括号的在200次以上的遍历效率较高。（参考链接：python两种遍历方式的比较）

我们没有去测试，采用了不加括号的方式。

其次，检测某key是否存在于dict中的方法改为了

if dict0.has_key(str(d)):

这个has_key函数返回的是布尔值True或False。

原先的检测方法：

if dlist1[i] not in dlist0:

舍弃！

其实提高了效率的部分就两步，遍历和检测！至于到底是哪一步提高了，……应该是都提高了。

因为这两步的代码不是分开的，是联系在一起的。

只有采用了for d,x in dict.items()这种遍历方法,才能够直接使用d和x这两个参数，而不用取值。

关键问题就是如上两个。还有过程中遇到的几个问题需要阐述一下：

1. python比较两个数组中的元素是否完全相等的问题。

>>> a = [(1,1),(2,2),(3,3),(4,4)]
>>> b = [(4,4),(1,1),(2,2),(3,3)]

>>> a.sort()
>>> b.sort()

>>> a==b
True

即，先排序后比较。只检验其中的元素是否一致，不考虑顺序的影响。

参考链接：python比较两个数组中的元素是否完全相等

2.python如何将字符串转为数字？

最终代码中我们用到了

string.atof(浮点数字符串)

string.atoi(整数字符串)

注意：需要

import string

3.读取csv文件

我们之前都是写csv文件。这里需要读，并将其中的数据装入dict中，方便使用。

方法如下：

def getHandleDataDict(fileName):
'''获取昨天零点数据字典'''
dict={}
csvfile=file(fileName, 'rb')
reader=csv.reader(csvfile)
for i in reader:
key=i[0]
value=i[1]
dict[key]=value
return dict

关键代码两行：

csvfile=file(fileName, 'rb')
reader=csv.reader(csvfile)

for i in reader:

i 就是dict中每条数据。每个i是个列表，i[0]是key，i[1]是value。

4.Python的KeyError

这个错误我们不是第一次遇到，这里着重说明，以示重视

KeyError的意思是：dict中不存在这个键。这种情况，我们如果dict[key]去取这个key对应的value，就会报KeyError的错误。

有可能是key的数据类型出错，也有可能就是不存在这个键，两种情况都要考虑。

我们在本例中遇到了数据类型出错的情况，所以才会有2问题，将字符串转为数字blabala。。。。

【脚本撰写思想阐述】

还有一个脚本的撰写思想，先贴出最终版代码如下。

#!/usr/bin/python
# -*- coding: UTF-8 -*-

__author__ = "$Author: wangxin.xie$"
__version__ = "$Revision: 1.0 $"
__date__ = "$Date: 2015-01-05 10:01$"

###############################################################
#功能： 当前0点与昨天0点余额信息对比表,每天00:00运行
###############################################################
import sys
import datetime
import xlwt
import csv
import string
from myyutil.DBUtil import DBUtil

#######################全局变量####################################

memberDBUtil = DBUtil('moyoyo_member')

today = datetime.datetime.today()
todayStr = datetime.datetime.strftime(today, "%Y-%m-%d")
handleDate = today - datetime.timedelta(1)
handleDateStr = datetime.datetime.strftime(handleDate, "%Y-%m-%d")

fileDir = 'D://'
handleCsvFileName= fileDir+handleDateStr+'_balance_data.csv'
currentCsvfileName = fileDir+todayStr+'_balance_data.csv'
currentexcelFileName= fileDir+todayStr+'_balance_compare_message.xls'

style1 = xlwt.XFStyle()
font1 = xlwt.Font()
font1.height = 220
font1.name = 'SimSun'
style1.font = font1

csvfile1=file(currentCsvfileName, 'wb')
writer1 = csv.writer(csvfile1, dialect='excel')
##################################################################
def genCurrentBalanceData():
'''获取当前余额数据'''
sql = '''
SELECT MEMBER_ID,
(TEMP_BALANCE_AMOUNT + TEMP_FROZEN_AMOUNT)
FROM moyoyo_member.MONEY_INFO
WHERE (TEMP_BALANCE_AMOUNT + TEMP_FROZEN_AMOUNT) != 0
'''
rs = memberDBUtil.queryList(sql, ())
if not rs: return None
return rs

def getCurrentDataDict(rs):
'''将当前数据组装为字典'''
dict={}
for i in range(len(rs)):
key=rs[i][0]
value=rs[i][1]
dict[key]=value
return dict

def writeCsv(x,writer):
'''csv数据写入函数'''
writer.writerow(x)

def writeCurrentCsvFile():
'''写包含当前数据的csv文件'''
rs=genCurrentBalanceData()
dict=getCurrentDataDict(rs)
for d, x in dict.items():
writeCsv([d, x/100.0], writer1)
csvfile1.close()
return dict

def getHandleDataDict(fileName):
'''获取昨天零点数据字典'''
dict={}
csvfile=file(fileName, 'rb')
reader=csv.reader(csvfile)
for i in reader:
key=i[0]
value=i[1]
dict[key]=value
return dict

def getCurrentCompareMessageDict(dict0, dict1):
'''获取当前对比信息字典'''
dict2={}
for d, x in dict1.items():
if dict0.has_key(str(d)):
if x/100.0 != string.atof(dict0[str(d)]):
key=d
value=[string.atof(dict0[str(d)]), x]
dict2[key] = value
else:
key=d
value=[0, x]
dict2[key]=value
return dict2

def writeExcelHeader():
'''写Excel表表头'''
wb = xlwt.Workbook(encoding = "UTF-8", style_compression = True)
sht0 = wb.add_sheet("余额信息对比列表", cell_overwrite_ok = True)
sht0.col(0).width=3000
sht0.col(1).width=4000
sht0.col(2).width=4000
num=today.day
sht0.write(0, 0, '用户ID', style1)
sht0.write(0, 1, str(num-1)+'日零点余额', style1)
sht0.write(0, 2, str(num)+'日零点余额', style1)
return wb

def writeCurrentCompareMessageInfo(sht,dict):
'''写当前对比信息数据'''
dlist=list(dict.keys())
for i in range(len(dlist)):
sht.write(i+1, 0, dlist[i], style1)
sht.write(i+1, 1, dict[dlist[i]][0], style1)
sht.write(i+1, 2, dict[dlist[i]][1]/100.0, style1)

def writeCurrentCompareMessageExcel(dict):
'''写当前对比信息Excel表'''
wb = writeExcelHeader()
sheet0 = wb.get_sheet(0)
writeCurrentCompareMessageInfo(sheet0, dict)
wb.save(currentexcelFileName)

def main():
print "===%s start===%s"%(sys.argv[0], datetime.datetime.strftime(datetime.datetime.now(), "%Y-%m-%d %H:%M:%S"))
currentDataDict=writeCurrentCsvFile()
handleDataDict = getHandleDataDict(handleCsvFileName)
currentCompareMessageDict = getCurrentCompareMessageDict(handleDataDict, currentDataDict)
writeCurrentCompareMessageExcel(currentCompareMessageDict)
print "===%s end===%s"%(sys.argv[0], datetime.datetime.strftime(datetime.datetime.now(), "%Y-%m-%d %H:%M:%S"))

if __name__ == '__main__':
try:
main()
finally:
if memberDBUtil: memberDBUtil.close()

之所以要说，脚本撰写思想。

是因为我们在写这个脚本时，需要注意的很多问题，没有加以重视。

尤其是流程方面。先做什么后做什么，拿到的数据如何处理。有没有可以省去的步骤之类的。

都是在写各个方法时需要注意的。

思想一：脚本运行时间的指导作用

我们这个脚本需求里说，脚本需要在每日零点取数据。数据库中的数据是实时改变的。

所以既然要求了0点取数据，所以取数据的方法肯定是要放在最前面的。

即，脚本的方法排列，与脚本要求的运行时间是有密切关系的。

脚本为什么要选在0点运行，0点的时候干了些什么，是需要我们多加考虑的。

因为，最终影响的是数据的准确性。

即，如果我们先运行了别的方法，比如读取昨天0点的csv文件之类的方法。

读了20多秒后，才运行这个取数据的方法。这时候取的数据就不是零点数据了。

思想二：不要重复劳动。

我们来分析一下本例中的数据流向。

Dict0--------昨天0点的数据在csv中。

Dict1--------该脚本于当日0点运行时从数据库中取的数据。先写入csv中。

Dict2--------昨天的数据与刚跑出来的数据，经过对比筛选出来的差异数据字典。

需要注意的是生成Dict2时代码的操作。Dict0的数据自然是直接取，Dict1的数据存在于代码中的Dict1,可以直接return。

但是之前我们犯了一个错误，Dict1的数据我们从刚生成的csv文件中提取。

这样是没有必要的。我们直接从代码中取就可以。这个数据代码中就有，不需要到文件中提取了。

会因为这个无端延长脚本的运行时间的。属于基本的逻辑疏漏。

所以最终版代码中的这个方法。

def writeCurrentCsvFile():
'''写包含当前数据的csv文件'''
rs=genCurrentBalanceData()
dict=getCurrentDataDict(rs)
for d, x in dict.items():
writeCsv([d, x/100.0], writer1)
csvfile1.close()
return dict

在写完csv文件后，用过的dict就直接return了，因为后面还要用。

生成的csv文件只是为了与明天的数据作对比。

思想三：数据产生的意义。

犯了上述错误。我们可以反思一下，数据的作用。。还有文件的作用。

我们生成dict是为了什么，当然数据可能不止一个作用，这个要注意。

csv0是为了提供dict0，dict0是为了与当天数据对比。

dict1是为了生成明天的csv，还有生成当天的dict2。

即，csv1根本不是为了dict1而存在的。只是为了为明天而做准备。

明白了这一点，就不会做出从csv1中取dict1的傻事了。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航