您的位置：首页 > 编程语言 > Python开发

python 中文分词：正向最大匹配

2017-07-15 11:21 656 查看

本人虽然接触NLP时间不算短，但实际上代码能力还不是很强。上传该博客一是为了分享自己的代码（当然，我还没有能厉害到生成一个开源库让他人下载使用的程度，所以就没往github上传），二是也希望各位大神批评指正，从而也提升自己的能力。

关于中文分词，网上的理论早已铺天盖地，在此不赘述了。当然，其实我的数学也一般，看到那些理论什么的，脑袋也疼，主要是首先在理解上有困难，更别说转换为代码了。不过我发现，如果能理解已有的代码，那么理论也多少能更深入一些。不过想想第一份代码，不也是先理解了理论才有的么，所以我很佩服那些理论搞得懂、代码能力也强的人。

中文分词的最大匹配算是在这个领域比较基础而简单的了，涉及到的数学知识并不多，所以博主敢于发表。那些较为深入的算法博主也在琢磨中，任重而道远啊。

我为什么选择python这个语言呢？大概是因为我周围人用得少吧，我就想尝试突破，不过我也不讳言，我的C/C++，java等等高级语言用的也不多，虽说编程语言这个东西，基本上只要熟悉一个，其他的都好学，不过我在python上尝到了甜头，索性就用这个语言了。

下面对我的代码做简单的功能介绍，把测试语料和词典读进内存，假设最大词长为4，按最大匹配原则正向匹配，期间把测试语料按回车换行（’\r\n’）切分放入列表中，这样每句话就独立了，当然，最后的分词结果也是不含回车换行的。分词的结果写入到一个新文件中，词与词包括标点之间用两个空格（’ ‘）分开。分词结束后，把标准的分词结果，即金标分词结果（金标分词也是按照两个空格分开词的）读进内存，计算精确率、召回率以及F值。

整个程序运行对于五十万字左右的文本只需几秒就可完成，读者可尝试一下。

另外说明一下：

1.如果不设置编码方式，则一个中文字符的长度是3，如果设置为utf-8，则任意字符的长度都是1。（len()求长度）

2.本程序博主在python 3.6下运行。

博主对于面向对象的编程方式还在一点点深入，本想编成一个类再上传的。但是限于水平，我没有那么做。今后还需多加努力。

以下是代码：

MAXLEN=4
import codecs
#import sys
#语料
corpus=codecs.open('此处为测试语料路径','r','utf-8')
corpusReader=corpus.read()
corpus.close()

#字典
dic=codecs.open('此处为字典路径','r','utf-8')
diclines=dic.readlines()
dic.close()

#分别存储四字词、三字词和二字词
char_4=[]
char_3=[]
char_2=[]

for i in diclines:
if len(i.split('\r\n')[0])==4:
char_4.append(i.split('\r\n')[0])
elif len(i.split('\r\n')[0])==3:
char_3.append(i.split('\r\n')[0])
else:
char_2.append(i.split('\r\n')[0])

char_4=set(char_4)
char_3=set(char_3)
char_2=set(char_2)

sentences=[]
corpuslines=corpusReader.split('\r\n')
for senten in corpuslines:
sentences.append(senten)

print('Please wait a few seconds...')
temp=''
segResult=codecs.open('divide_result.txt','w','utf-8')

k=0
while k!=len(sentences):
i=0
while i<len(sentences[k]):
if i+MAXLEN<len(sentences[k]):
possible_word=sentences[k][i:i+MAXLEN].split('\r\n')[0]
if possible_word in char_4:
temp+=possible_word+'  '
#segResult.write(possible_word+'  ')
i+=MAXLEN
continue

if i+3<len(sentences[k]):
possible_word=sentences[k][i:i+3].split('\r\n')[0]
if possible_word in char_3:
temp+=possible_word+'  '
#segResult.write(possible_word+'  ')
i+=3
continue

if i+2<len(sentences[k]):
possible_word=sentences[k][i:i+2].split('\r\n')[0]
if possible_word in char_2:
temp+=possible_word+'  '
#segResult.write(possible_word+'  ')
i+=2
continue

possible_word=sentences[k][i]
temp+=possible_word+'  '
#segResult.write(possible_word+'  ')
i+=1
#segResult.write('\r\n')
k+=1

temp=temp.strip()
segResult.write(temp)
segResult.close()

print('Segmentation ends,calculating precision rate,recall rate and f-score.')
segResult=codecs.open('divide_result.txt','r','utf-8')
my=segResult.read()
segResult.close()

gold_corpus=codecs.open('此处为金标分词结果路径','r','utf-8')
gold=gold_corpus.read()
gold_corpus.close()

gold_split_enter=gold.split('\r\n')
gold=''
for i in gold_split_enter:
gold+=i

gold_list=gold.strip().split('  ')
my_list=my.split('  ')
gold_len=len(gold_list)
my_len=len(my_list)
correct=0

gold_before=''
my_before=''

i=1
j=1
gold_before+=gold_list[0]
my_before+=my_list[0]
if gold_before==my_before and gold_list[0]==my_list[0]:
correct+=1
#sys.stdout.write(my_list[0])

while True:
if gold_before==my_before and gold_list[i]==my_list[j]:
correct+=1
#sys.stdout.write(my_list[j])
gold_before+=str(gold_list[i])
my_before+=str(my_list[j])
i+=1
j+=1
elif len(gold_before)<len(my_before):
gold_before+=str(gold_list[i])
i+=1
elif len(gold_before)>len(my_before):
my_before+=str(my_list[j])
j+=1
elif gold_before==my_before and gold_list[i]!=my_list[j]:
gold_before+=str(gold_list[i])
my_before+=str(my_list[j])
i+=1
j+=1
if i>=len(gold_list) and j>=len(my_list):
break

precision=correct/my_len
recall=correct/gold_len
f_score=2*precision*recall/(precision+recall)
print('precision rate:',precision)
print('recall rate:',recall)
print('f-score:',f_score)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 中文分词算法

相关文章推荐

新的分享

章节导航