您的位置:首页 > 编程语言 > Python开发

python内存管理

2016-05-09 14:50 525 查看
不断更新

1.手动释放内存

import gc
del obj_name
gc.collection()


2.pandas read_csv()技巧

最近在参加一个比赛,意外发现直接用pd.read_csv(filename)占用内存,远超过文件本身大小

解决方法一:指定类型

import  pandas as pd
import numpy as np
test = pd.read_csv("test_data.csv",names=['User_id','Location_id','Merchant_id'],header=None
,dtype = {'User_id':np.uint32,'Location_id':np.uint32,'Merchant_id':np.uint32})


解决方法二

无论是在那个操作系统,确保python及其系列库都是64位的

解决方法三:可以解决上面两种方法都解决不了的时候,适用于任务可分解

import pandas as pd
import numpy as np
import subprocess

columns = ['User_id','Seller_id','Item_id','Category_id','Online_Action_id','Time_Stamp']
type = {'User_id':np.uint32,'Seller_id':np.uint32,'Item_id':np.uint32,'Category_id':np.uint16,'Online_Action_id':np.uint8,'Time_Stamp':np.uint32}

# for Linux
# Get number of lines in the CSV file
# nlines = subprocess.check_output('wc -l %s' %'ijcai2016_taobao', shell=True)
# nlines = int(nlines.split()[0])

#windows need nmake
#find /v/c "" ijcai2016_taobao

#also can get lines by using SQL
nlines = 44528127

chunksize = 1000

with open(name='pre_taobao.csv',mode='w') as out:
for i in range(0, nlines, chunksize):
df = pd.read_csv("ijcai2016_taobao",
header=None,  # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i)   # skip rows that were already read
# columns to read
df.columns = columns
for i in range(len(df)):
if df['Time_Stamp'][i]  < 20151101:
str_tmp = str(df['User_id'][i])+','+str(df['Seller_id'][i])+','+str(df['Item_id'][i])+','+str(df['Category_id'][i])+','+str(df['Online_Action_id'][i])+','+str(df['Time_Stamp'][i])
out.write(str_tmp)
out.write('\n')
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: