您的位置:首页 > 编程语言 > Python开发

python算法--置换选择排序详细实现

2013-05-10 15:23 746 查看
败者树一文中有提到,如果能一次性归并多个小文件,可以大大减少对文件的读写操作,从而减少 I/O 时间提高排序效率。那如果可以减少分割的小文件的数量呢?如果在不能一次性归并完所有小文件的情况下,如果能减少分割的小文件数量其实也是提高大文件排序的一种办法。

这正是这篇文章要介绍的——置换-选择排序。

过程如下:

假设内存工作区最多可容纳 n 条记录,则从大文件读取 n 条记录到内存工作区。筛选出最小关键字的元素,将其标记为 lastSmall, 输出到一个文件中(或者先存缓存区,等达到一定数量后一并写进文件)。然后再从大文件读取下一条记录到内存工作区,选取关键字大于 lastSmall 的最小值输出到 lastSmall 所在的文件(或缓存区)并将新的最小关键字元素赋值给 lastSmall 。重复这一动作。。。当在序列中选不出关键字小于
lastSmall 的时候,完成一次分割,生成一个小文件。重复上述动作,直到到达大文件结尾。

在内存工作区筛选最小关键字的元素的过程可以用败者树来实现。只是需要重新定义一下如何进行大小的比较——每个元素的比较不但需要关键字的比较,还需要更大优先级的段号的比较(段号大的为大,段号一样,关键字大的为大)。即当新的最小关键字元素的关键字小于 lastSmall
的关键字的时候,需要令该元素的段号递增一下,重新调整败者树。

完整代码:

#!/usr/bin/python
# Filename: ReplaceSelection.py

#---------------------------------Data Struct----------------------------------
class RSNode:
'''The struct of the Replace_Selection method'''
def __init__(self, rowNum, value):
self.rowNum = rowNum
self.value = value
#---------------------------------Loser Tree-----------------------------------

def createLoserTree(loserTree, dataArray, n):
for i in range(n):
loserTree.append(0)
dataArray.append(RSNode(1, i-n))

for i in range(n):
adjust(loserTree, dataArray, n, n-1-i)

def adjust(loserTree, dataArray, n, s):
t = (s + n) / 2
while t > 0:
# rowNum has a higher Priority than value.
if dataArray[s].rowNum > dataArray[loserTree[t]].rowNum:
s, loserTree[t] = loserTree[t], s
elif dataArray[s].rowNum == dataArray[loserTree[t]].rowNum and dataArray[s].value > dataArray[loserTree[t]].value:
s, loserTree[t] = loserTree[t], s
t /= 2
loserTree[0] = s
#-------------------------------------Use---------------------------------------

from time import ctime

# A method to write file.
def writeFile(tarDir, tmp):
file_writer = open(tarDir, 'a+')
file_writer.writelines(tmp)
file_writer.close()
# Clear the array tmp.
while tmp:
tmp.pop()

def splitFile(fileLocation, tarDirectory, n):
file_reader = open(fileLocation, 'r')
loserTree = []
dataArray = []
n = int(n)
createLoserTree(loserTree, dataArray, n)
line = file_reader.readline()
# First, read file, fill the data array with front items of the file.
for i in range(n):
dataArray[i] = RSNode(1, line)
# Adjust the loser tree after every change of the data array.
adjust(loserTree, dataArray, n, i)
line = file_reader.readline()
lastRowNum = 1 # Used to name the new little files.
lastSmall = dataArray[loserTree[0]] # lastSmall is a mark...
tmp = [lastSmall.value] # You know, it's a temporary array to storage sorted ips.
dataArray[loserTree[0]] = RSNode(lastRowNum, line)
while True:
# Write tmp into file if it's size reach the Maximum we defined.
if len(tmp) == n:
writeFile(tarDirectory + 'file' + str(lastRowNum) + '.txt', tmp)

# Adjust the loser tree after every change of the data array.
adjust(loserTree, dataArray, n, loserTree[0])

# Finish one trip of search and finish one file.
if dataArray[loserTree[0]].rowNum > lastRowNum:
writeFile(tarDirectory + 'file' + str(lastRowNum) + '.txt', tmp)
lastRowNum += 1

lastSmall = dataArray[loserTree[0]]
tmp.append(lastSmall.value)
line = file_reader.readline()
if line: # Reach the end of the file
dataArray[loserTree[0]] = RSNode(lastRowNum, line)
else:
break
else:
# Can add new item into the tmp.
if dataArray[loserTree[0]].value > lastSmall.value:
lastSmall = dataArray[loserTree[0]]
tmp.append(lastSmall.value)
line = file_reader.readline()
if line: # Reach the end of the file
dataArray[loserTree[0]] = RSNode(lastRowNum, line)
else:
break
else:
# rowNum + 1 and return to adjust.
dataArray[loserTree[0]].rowNum += 1

# Don't forget to write the items in the loser tree into the file.
dataArray[loserTree[0]] = RSNode(lastRowNum+10, 'F')
while True: # This loop almost like the one above.
if len(tmp) == n:
writeFile(tarDirectory + 'file' + str(lastRowNum) + '.txt', tmp)

adjust(loserTree, dataArray, n, loserTree[0])
if dataArray[loserTree[0]].value == 'F':
break

if dataArray[loserTree[0]].rowNum > lastRowNum:
writeFile(tarDirectory + 'file' + str(lastRowNum) + '.txt', tmp)
lastRowNum += 1

lastSmall = dataArray[loserTree[0]]
tmp.append(lastSmall.value)

dataArray[loserTree[0]] = RSNode(lastRowNum+10, 'F')
else:
if dataArray[loserTree[0]].value > lastSmall.value:
lastSmall = dataArray[loserTree[0]]
tmp.append(lastSmall.value)

dataArray[loserTree[0]] = RSNode(lastRowNum+10, 'F')
else:
dataArray[loserTree[0]].rowNum += 1

# And don't forget tmp. If tmp is not empty, write it into file.
if tmp:
writeFile(tarDirectory + 'file' + str(lastRowNum) + '.txt', tmp)

file_reader.close()
#----------------------------Test-------------------------------------
if __name__ == '__main__':
import sys
from time import ctime
try:
fileLocation = sys.argv[1]
tarDir = sys.argv[2]
n = sys.argv[3]
except:
print 'Wrong Arguments!'
print '''You neew 3 Parameters in total.
1. The path of your file.
2. The path of the target files.
3. The size of the LoserTree.

You should do like this:
python ReplaceSelect.py /root/hehe.txt /root/hehe/ 6'''
sys.exit()
timeNow = ctime()
print 'Now the time is ' + str(timeNow) + ','
print 'and the work is coming, please to wait...'
splitFile(fileLocation, tarDir, n)
print 'Work Over!'
print 'Now the time is ' + str(ctime())
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: