您的位置：首页 > 编程语言 > Python开发

Python分词统计

2016-01-25 22:38 791 查看

利用Python切片处理文本非常方便，下面是一个简单的例子，进行分词统计

（需要读取的文件为utf-8编码，运行环境为Windows，版本为python3）

# -*- coding: utf-8 -*-
import re
import os

Total = 0; #总字母数
words = []

#获取所有单词
readfile = open('Data.txt', encoding = 'utf-8')

for line in readfile.readlines():
lineArr = line.strip().split()
for word in lineArr:
data = re.findall(r'[a-zA-Z]*', word)
for w in data:
if w != '':
words.append(w.lower())

readfile.close()

#进行统计
def MySta(n):
Dic = {}
total_num = 0
for word in words:
for i in range(len(word) - n + 1):
letter = word[i:i+n]
total_num = total_num + 1
if letter in Dic.keys():
Dic[letter] = Dic[letter] + 1
else:
Dic[letter] = 1

return Dic,total_num

#输出前n个和后n个高频字母、统计频数、统计频率
def PrintSta(Dic, total_num, n):
if n > len(Dic):
print('n超出索引范围')
return
word_lst = []
for word, freq in Dic.items():
word_lst.append((freq, word))

word_lst.sort(reverse = True)
print('*-----------------------------------------*')
print('字母组合\t频数\t频率')
for freq, word in word_lst[:n]:
print('{0}\t{1}\t{2:.5}'.format(word, freq, freq/total_num))
for freq, word in word_lst[-n:]:
print('{0}\t{1}\t{2:.5}'.format(word, freq, freq/total_num))

os.system("pause")

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航