您的位置:首页 > 编程语言 > Python开发

Python分词统计

2016-01-25 22:38 791 查看
利用Python切片处理文本非常方便,下面是一个简单的例子,进行分词统计

(需要读取的文件为utf-8编码,运行环境为Windows,版本为python3)

# -*- coding: utf-8 -*-
import re
import os

Total = 0; #总字母数
words = []

#获取所有单词
readfile = open('Data.txt', encoding = 'utf-8')

for line in readfile.readlines():
lineArr = line.strip().split()
for word in lineArr:
data = re.findall(r'[a-zA-Z]*', word)
for w in data:
if w != '':
words.append(w.lower())

readfile.close()

#进行统计
def MySta(n):
Dic = {}
total_num = 0
for word in words:
for i in range(len(word) - n + 1):
letter = word[i:i+n]
total_num = total_num + 1
if letter in Dic.keys():
Dic[letter] = Dic[letter] + 1
else:
Dic[letter] = 1

return Dic,total_num

#输出前n个和后n个高频字母、统计频数、统计频率
def PrintSta(Dic, total_num, n):
if n > len(Dic):
print('n超出索引范围')
return
word_lst = []
for word, freq in Dic.items():
word_lst.append((freq, word))

word_lst.sort(reverse = True)
print('*-----------------------------------------*')
print('字母组合\t频数\t频率')
for freq, word in word_lst[:n]:
print('{0}\t{1}\t{2:.5}'.format(word, freq, freq/total_num))
for freq, word in word_lst[-n:]:
print('{0}\t{1}\t{2:.5}'.format(word, freq, freq/total_num))

os.system("pause")
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: