您的位置:首页 > 其它

word2vec训练中文模型

2017-12-03 16:20 549 查看

1.准备数据与预处理

首先需要一份比较大的中文语料数据,可以考虑中文的维基百科(也可以试试搜狗的新闻语料库)。中文维基百科的打包文件地址为
https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

中文维基百科的数据不是太大,xml的压缩文件大约1G左右。首先用 process_wiki_data.py处理这个XML压缩文件,执行:
python process_wiki_data.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text


#!/usr/bin/env python

# -*- coding: utf-8 -*-

# process_wiki_data.py 用于解析XML,将XML的wiki数据转换为text格式



import logging

import os.path

import sys



from gensim.corpora import WikiCorpus



if __name__ == '__main__':

program = os.path.basename(sys.argv[0])

logger = logging.getLogger(program)



logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')

logging.root.setLevel(level=logging.INFO)

logger.info("running %s" % ' '.join(sys.argv))



# check and process input arguments

if len(sys.argv) < 3:

print globals()['__doc__'] % locals()

sys.exit(1)

inp, outp = sys.argv[1:3]

space = " "

i = 0



output = open(outp, 'w')

wiki = WikiCorpus(inp, lemmatize=False, dictionary={})

for text in wiki.get_texts():

output.write(space.join(text) + "\n")

i = i + 1

if (i % 10000 == 0):

logger.info("Saved " + str(i) + " articles")



output.close()

logger.info("Finished Saved " + str(i) + " articles")


得到信息:

2016-08-11 20:39:22,739: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

2016-08-11 20:40:08,329: INFO: Saved 10000 articles

2016-08-11 20:40:45,501: INFO: Saved 20000 articles

2016-08-11 20:41:23,659: INFO: Saved 30000 articles

2016-08-11 20:42:01,748: INFO: Saved 40000 articles

2016-08-11 20:42:33,779: INFO: Saved 50000 articles

......

2016-08-11 20:55:23,094: INFO: Saved 200000 articles

2016-08-11 20:56:14,692: INFO: Saved 210000 articles

2016-08-11 20:57:04,614: INFO: Saved 220000 articles

2016-08-11 20:57:57,979: INFO: Saved 230000 articles

2016-08-11 20:58:16,621: INFO: finished iterating over Wikipedia corpus of 232894 documents with 51603419 positions (total 2581444 articles, 62177405 positions before pruning articles shorter than 50 words)

2016-08-11 20:58:16,622: INFO: Finished Saved 232894 articles


Python的话可用jieba完成分词,生成分词文件wiki.zh.text.seg

接着用word2vec工具训练:
python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model wiki.zh.text.vector


#!/usr/bin/env python

# -*- coding: utf-8 -*-

# train_word2vec_model.py用于训练模型



import logging

import os.path

import sys

import multiprocessing



from gensim.corpora import WikiCorpus

from gensim.models import Word2Vec

from gensim.models.word2vec import LineSentence



if __name__ == '__main__':

program = os.path.basename(sys.argv[0])

logger = logging.getLogger(program)



logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')

logging.root.setLevel(level=logging.INFO)

logger.info("running %s" % ' '.join(sys.argv))



# check and process input arguments

if len(sys.argv) < 4:

print globals()['__doc__'] % locals()

sys.exit(1)

inp, outp1, outp2 = sys.argv[1:4]



model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,

workers=multiprocessing.cpu_count())



# trim unneeded model memory = use(much) less RAM

#model.init_sims(replace=True)

model.save(outp1)

model.save_word2vec_format(outp2, binary=False)


运行信息

2016-08-12 09:50:02,586: INFO: running python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model wiki.zh.text.vector

2016-08-12 09:50:02,592: INFO: collecting all words and their counts

2016-08-12 09:50:02,592: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types

2016-08-12 09:50:12,476: INFO: PROGRESS: at sentence #10000, processed 12914562 words and 254662 word types

2016-08-12 09:50:20,215: INFO: PROGRESS: at sentence #20000, processed 22308801 words and 373573 word types

2016-08-12 09:50:28,448: INFO: PROGRESS: at sentence #30000, processed 30724902 words and 460837 word types

...

2016-08-12 09:52:03,498: INFO: PROGRESS: at sentence #210000, processed 143804601 words and 1483608 word types

2016-08-12 09:52:07,772: INFO: PROGRESS: at sentence #220000, processed 149352283 words and 1521199 word types

2016-08-12 09:52:11,639: INFO: PROGRESS: at sentence #230000, processed 154741839 words and 1563584 word types

2016-08-12 09:52:12,746: INFO: collected 1575172 word types from a corpus of 156430908 words and 232894 sentences

2016-08-12 09:52:13,672: INFO: total 278291 word types after removing those with count<5

2016-08-12 09:52:13,673: INFO: constructing a huffman tree from 278291 words

2016-08-12 09:52:29,323: INFO: built huffman tree with maximum node depth 25

2016-08-12 09:52:29,683: INFO: resetting layer weights

2016-08-12 09:52:38,805: INFO: training model with 4 workers on 278291 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0

2016-08-12 09:52:49,504: INFO: PROGRESS: at 0.10% words, alpha 0.02500, 15008 words/s

2016-08-12 09:52:51,935: INFO: PROGRESS: at 0.38% words, alpha 0.02500, 44434 words/s

2016-08-12 09:52:54,779: INFO: PROGRESS: at 0.56% words, alpha 0.02500, 53965 words/s

2016-08-12 09:52:57,240: INFO: PROGRESS: at 0.62% words, alpha 0.02491, 52116 words/s

2016-08-12 09:52:58,823: INFO: PROGRESS: at 0.72% words, alpha 0.02494, 55804 words/s

2016-08-12 09:53:03,649: INFO: PROGRESS: at 0.94% words, alpha 0.02486, 58277 words/s

2016-08-12 09:53:07,357: INFO: PROGRESS: at 1.03% words, alpha 0.02479, 56036 words/s

......

2016-08-12 19:22:09,002: INFO: PROGRESS: at 98.38% words, alpha 0.00044, 85936 words/s

2016-08-12 19:22:10,321: INFO: PROGRESS: at 98.50% words, alpha 0.00044, 85971 words/s

2016-08-12 19:22:11,934: INFO: PROGRESS: at 98.55% words, alpha 0.00039, 85940 words/s

2016-08-12 19:22:13,384: INFO: PROGRESS: at 98.65% words, alpha 0.00036, 85960 words/s

2016-08-12 19:22:13,883: INFO: training on 152625573 words took 1775.1s, 85982 words/s

2016-08-12 19:22:13,883: INFO: saving Word2Vec object under wiki.zh.text.model, separately None

2016-08-12 19:22:13,884: INFO: not storing attribute syn0norm

2016-08-12 19:22:13,884: INFO: storing numpy array 'syn0' to wiki.zh.text.model.syn0.npy

2016-08-12 19:22:20,797: INFO: storing numpy array 'syn1' to wiki.zh.text.model.syn1.npy

2016-08-12 19:22:40,667: INFO: storing 278291x400 projection weights into wiki.zh.text.vector


测试模型效果:

In [1]: import gensim



In [2]: model = gensim.models.Word2Vec.load("wiki.zh.text.model")



In [3]: model.most_similar(u"足球")

Out[3]:

[(u'\u8054\u8d5b', 0.6553816199302673),

(u'\u7532\u7ea7', 0.6530429720878601),

(u'\u7bee\u7403', 0.5967546701431274),

(u'\u4ff1\u4e50\u90e8', 0.5872289538383484),

(u'\u4e59\u7ea7', 0.5840631723403931),

(u'\u8db3\u7403\u961f', 0.5560152530670166),

(u'\u4e9a\u8db3\u8054', 0.5308005809783936),

(u'allsvenskan', 0.5249762535095215),

(u'\u4ee3\u8868\u961f', 0.5214947462081909),

(u'\u7532\u7ec4', 0.5177896022796631)]



In [4]: result = model.most_similar(u"足球")



In [5]: for e in result:

print e[0], e[1]

....:

联赛 0.65538161993

甲级 0.653042972088

篮球 0.596754670143

俱乐部 0.587228953838

乙级 0.58406317234

足球队 0.556015253067

亚足联 0.530800580978

allsvenskan 0.52497625351

代表队 0.521494746208

甲组 0.51778960228



In [6]: result = model.most_similar(u"男人")



In [7]: for e in result:

print e[0], e[1]

....:

女人 0.77537125349

家伙 0.617369174957

妈妈 0.567102909088

漂亮 0.560832381248

잘했어 0.540875017643

谎言 0.538448691368

爸爸 0.53660941124

傻瓜 0.535608053207

예쁘다 0.535151124001

mc刘 0.529670000076



In [8]: result = model.most_similar(u"女人")



In [9]: for e in result:

print e[0], e[1]

....:

男人 0.77537125349

我的某 0.589010596275

妈妈 0.576344847679

잘했어 0.562340974808

美丽 0.555426716805

爸爸 0.543958246708

新娘 0.543640494347

谎言 0.540272831917

妞儿 0.531066179276

老婆 0.528521537781



In [10]: result = model.most_similar(u"青蛙")



In [11]: for e in result:

print e[0], e[1]

....:

老鼠 0.559612870216

乌龟 0.489831030369

蜥蜴 0.478990525007

猫 0.46728849411

鳄鱼 0.461885392666

蟾蜍 0.448014199734

猴子 0.436584025621

白雪公主 0.434905380011

蚯蚓 0.433413207531

螃蟹 0.4314712286



In [12]: result = model.most_similar(u"姨夫")



In [13]: for e in result:

print e[0], e[1]

....:

堂伯 0.583935439587

祖父 0.574735701084

妃所生 0.569327116013

内弟 0.562012672424

早卒 0.558042645454

曕 0.553856015205

胤祯 0.553288519382

陈潜 0.550716996193

愔之 0.550510883331

叔父 0.550032019615



In [14]: result = model.most_similar(u"衣服")



In [15]: for e in result:

print e[0], e[1]

....:

鞋子 0.686688780785

穿着 0.672499775887

衣物 0.67173999548

大衣 0.667605519295

裤子 0.662670075893

内裤 0.662210345268

裙子 0.659705817699

西装 0.648508131504

洋装 0.647238850594

围裙 0.642895817757



In [16]: result = model.most_similar(u"公安局")



In [17]: for e in result:

print e[0], e[1]

....:

司法局 0.730189085007

公安厅 0.634275555611

公安 0.612798035145

房管局 0.597343325615

商业局 0.597183346748

军管会 0.59476184845

体育局 0.59283208847

财政局 0.588721752167

戒毒所 0.575558543205

新闻办 0.573395550251



In [18]: result = model.most_similar(u"铁道部")



In [19]: for e in result:

print e[0], e[1]

....:

盛光祖 0.565509021282

交通部 0.548688530922

批复 0.546967327595

刘志军 0.541010737419

立项 0.517836689949

报送 0.510296344757

计委 0.508456230164

水利部 0.503531932831

国务院 0.503227233887

经贸委 0.50156635046



In [20]: result = model.most_similar(u"清华大学")



In [21]: for e in result:

print e[0], e[1]

....:

北京大学 0.763922810555

化学系 0.724210739136

物理系 0.694550514221

数学系 0.684280991554

中山大学 0.677202701569

复旦 0.657914161682

师范大学 0.656435549259

哲学系 0.654701948166

生物系 0.654403865337

中文系 0.653147578239



In [22]: result = model.most_similar(u"卫视")



In [23]: for e in result:

print e[0], e[1]

....:

湖南 0.676812887192

中文台 0.626506924629

収蔵 0.621356606483

黄金档 0.582251906395

cctv 0.536769032478

安徽 0.536752820015

非同凡响 0.534517168999

唱响 0.533438682556

最强音 0.532605051994

金鹰 0.531676828861



In [24]: result = model.most_similar(u"习近平")



In [25]: for e in result:

print e[0], e[1]

....:

胡锦涛 0.809472680092

江泽民 0.754633367062

李克强 0.739740967751

贾庆林 0.737033963203

曾庆红 0.732847094536

吴邦国 0.726941585541

总书记 0.719057679176

李瑞环 0.716384887695

温家宝 0.711952567101

王岐山 0.703570842743



In [26]: result = model.most_similar(u"林丹")



In [27]: for e in result:

print e[0], e[1]

....:

黄综翰 0.538035452366

蒋燕皎 0.52646958828

刘鑫 0.522252976894

韩晶娜 0.516120731831

王晓理 0.512289524078

王适 0.508560419083

杨影 0.508159279823

陈跃 0.507353425026

龚智超 0.503159761429

李敬元 0.50262516737



In [28]: result = model.most_similar(u"语言学")



In [29]: for e in result:

print e[0], e[1]

....:

社会学 0.632598280907

人类学 0.623406708241

历史学 0.618442356586

比较文学 0.604823827744

心理学 0.600066184998

人文科学 0.577783346176

社会心理学 0.575571238995

政治学 0.574541330338

地理学 0.573896467686

哲学 0.573873817921



In [30]: result = model.most_similar(u"计算机")



In [31]: for e in result:

print e[0], e[1]

....:

自动化 0.674171924591

应用 0.614087462425

自动化系 0.611132860184

材料科学 0.607891201973

集成电路 0.600370049477

技术 0.597518980503

电子学 0.591316461563

建模 0.577238917351

工程学 0.572855889797

微电子 0.570086717606



In [32]: model.similarity(u"计算机", u"自动化")

Out[32]: 0.67417196002404789



In [33]: model.similarity(u"女人", u"男人")

Out[33]: 0.77537125129824813



In [34]: model.doesnt_match(u"早餐 晚餐 午餐 中心".split())

Out[34]: u'\u4e2d\u5fc3'



In [35]: print model.doesnt_match(u"早餐 晚餐 午餐 中心".split())

中心
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: