您的位置:首页 > 编程语言 > Python开发

wiki_word2vec_python实验

2016-07-14 20:29 260 查看
1.linux安装python版本 gensim
word2vec :
依赖库:Numpy和SciPy:

首先进行安装以上两个库:

ubuntu:

sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose
安装完后安装gensim:
pip install gensim2.前提工作完成后进入关键部分:
  中文数据(1.3G)下载:

https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

英文数据(11G)下载:

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

将xml的wiki数据转换成text格式,命令:python process_wiki.py enwiki-latest-pages-articles.xml.bz2
wiki.en.text
process_wiki.py:python处理代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import logging
import os.path
import sys

from gensim.corpora import WikiCorpus

if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)

logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))

# check and process input arguments
if len(sys.argv) < 3:
print globals()['__doc__'] % locals()
sys.exit(1)
inp, outp = sys.argv[1:3]
space = " "
i = 0

output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
output.write(space.join(text) + "\n")
i = i + 1
if (i % 10000 == 0):
logger.info("Saved " + str(i) + " articles")

output.close()
logger.info("Finished Saved " + str(i) + " articles")

处理后存在
wiki.en.text里,格式:一篇文章一行,中间用空格分隔一些关键词

2.但处理过后的 wiki.en.text简体和繁体不分,我们需要用opencc来统一:指令:

opencc
-i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini

需提前安装:

sudo
apt-get install opencc

3.将处理好的wiki.zh.text.jian分词

分词用jieba:jieba.py:

#!/usr/bin/env python
#-*- coding:utf-8 -*-
import jieba
import jieba.analyse
import jieba.posseg as pseg
def cut_words(sentence):
return " ".join(jieba.cut(sentence)).encode('utf-8')
f = open("/home/xuanwei/工作/word2Vec/wiki.zh.text.jian")
target = open("/home/xuanwei/工作/word2Vec/wiki.zh.text.jian.seg", 'a+')
print 'open files:'
line = f.readlines(100000)
num_n=0
while line:
curr=[]
num_n+=1
for online in line:
curr.append(online)
after_cut=map(cut_words,curr)
target.writelines(after_cut)
print 'saved %d00000 articles' % num_n
line=f.readlines(100000)

f.close()
target.close()
时间耗时好长。。。
4.现在才进入最关键的一步:训练:

执行:python
train_word2vec_model.py wiki.zh.text.jian.seg wiki.zh.text.model wiki.zh.text.vector

直接放代码(train_word2vec_model.py):

#!/usr/bin/env python
#-*- coding:utf-8 -*-
import logging
import os.path
import sys
import multiprocessing
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 4:
print globals()['__doc__'] % locals()
sys.exit(1)
inp, outp1, outp2 = sys.argv[1:4]
model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())
model.save(outp1)
model.save_word2vec_format(outp2, binary=False)
5.最后我们看一下我们的模型,用其操作:
In [1]: import gensim

In [2]: model = gensim.models.Word2Vec.load("wiki.zh.text.model")

In [3]: model.most_similar(u"足球")
Out[3]:
[(u'\u8054\u8d5b', 0.6553816199302673),
(u'\u7532\u7ea7', 0.6530429720878601),
(u'\u7bee\u7403', 0.5967546701431274),
(u'\u4ff1\u4e50\u90e8', 0.5872289538383484),
(u'\u4e59\u7ea7', 0.5840631723403931),
(u'\u8db3\u7403\u961f', 0.5560152530670166),
(u'\u4e9a\u8db3\u8054', 0.5308005809783936),
(u'allsvenskan', 0.5249762535095215),
(u'\u4ee3\u8868\u961f', 0.5214947462081909),
(u'\u7532\u7ec4', 0.5177896022796631)]

In [4]: result = model.most_similar(u"足球")

In [5]: for e in result:
print e[0], e[1]
....:
联赛 0.65538161993
甲级 0.653042972088
篮球 0.596754670143
俱乐部 0.587228953838
乙级 0.58406317234
足球队 0.556015253067
亚足联 0.530800580978
allsvenskan 0.52497625351
代表队 0.521494746208
甲组 0.51778960228至此,已完成了word2vec训练,word2vec就是将关键词映射为一个向量,包含有词之间的相关度等信息。

参考文献:

1,我爱自然语言处理:    http://www.52nlp.cn/%E4%B8%AD%E8%8B%B1%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91%E8%AF%AD%E6%96%99%E4%B8%8A%E7%9A%84word2vec%E5%AE%9E%E9%AA%8C

2,CodeSky
代码之空  http://codesky.me/archives/ubuntu-python-jieba-word2vec-wiki-tutol.wind
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  nlp word2vec