您的位置：首页 > 编程语言 > Python开发

python爬虫学习第三十一天

2017-08-26 21:08 148 查看

今天主要时了解python的自然语言工具包NLTK(Nature Language Toolkit ).这是一个python第三方库，需要自己下载，直接调用pip指令下载就可以。

安装稍微花了一点时间。

下载完成后打开IDLE，键入

import nltk

若没有报错，那么恭喜您，安装成功，否则请google，接下来键入

nitk.download()

指令来下载包，键入指令后会出现一个包列表可供选择，不知道选哪个就都下载下来把，一共也不是很大。

主要介绍到的模块有

bigrams

trigrams

ngrams

以上都是用来对文本进行n-grams处理的

FreqDist提供了统计频率相关的函数

nltk.book提供了9本书，以供没有样本的人用他们学习

Text nltk主要处理的时text对象，而text模块可以把python文本转换成text对象

后面的练习使用nltk.book 中的text6

再介绍一个用来标记语义的算法Penn Treebank语义标记，作用是标处句子中每个词的用处（类似于词性但比词性丰富），感兴趣的可以百科一下

NLTK 用英语的上下文无关文法（context-free grammar）识别词性。上下文无关文法基本上可以看成一个规则集合，用一个有序的列表确定一个词后面可以跟哪些词。NLTK 的上下文无关文法确定的是一个词性后面可以跟哪些词性。无论什么时候，只要遇到像“dust”这样一个含义不明确的单词，NLTK 都会用上下文无关文法的规则来判断，然后确定一个合适的词性

不但如此，NLTK还可以被训练，也就是通过机器学习创建一个全新的上下文无关文法规则，比如，一种外语的上下文无关文法规则。如果你用 Penn Treebank 词性标记手工完成了那种语言的大部分文本的语义标记，那么你就可以把结果传给 NLTK，然后训练它对其他未标记的文本进行语义标记。在任何一个机器学习案例中，机器训练都是不可或缺的部分.

自然语言中的许多歧义问题都可以用 NLTK 的 pos_tag 解决。不只是搜索目标单词或短语，而是搜索带标记的目标单词或短语，这样可以大大提高爬虫搜索的准确率和效率。

以下是他的对照表

练习打印所有“the”开头的4-grams

from nltk import ngrams
from nltk.book import *

print(text6)
fourgrams = ngrams(text6,4)
for fourgram in fourgrams:
if fourgram[0]=='the':
print(fourgram)

练习使用语义标记分析单词词性

from nltk.book import *
from nltk import pos_tag
from nltk import word_tokenize

text = word_tokenize("Strange women lying in ponds distributing swords is no basis for a system of government.  Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.")
print(pos_tag(text))
input()

输出是如下形式：

[(‘Strange’, ‘JJ’), (‘women’, ‘NNS’), (‘lying’, ‘VBG’), (‘in’, ‘IN’), (‘ponds’, ‘NNS’), (‘distributing’, ‘VBG’), (‘swords’, ‘NNS’), (‘is’, ‘VBZ’), (‘no’, ‘DT’), (‘basis’, ‘NN’), (‘for’, ‘IN’), (‘a’, ‘DT’), (‘system’, ‘NN’), (‘of’, ‘IN’), (‘government’, ‘NN’), (‘.’, ‘.’), (‘Supreme’, ‘NNP’), (‘executive’, ‘NN’), (‘power’, ‘NN’), (‘derives’, ‘VBZ’), (‘from’, ‘IN’), (‘a’, ‘DT’), (‘mandate’, ‘NN’), (‘from’, ‘IN’), (‘the’, ‘DT’), (‘masses’, ‘NNS’), (‘,’, ‘,’), (‘not’, ‘RB’), (‘from’, ‘IN’), (‘some’, ‘DT’), (‘farcical’, ‘JJ’), (‘aquatic’, ‘JJ’), (‘ceremony’, ‘NN’), (‘.’, ‘.’)]

每个单词被放在一个元组中，元祖的另一个元素是这个单词的词性

练习从一段话中找出某个词，要求词性必须是指定的词性

from nltk import word_tokenize,sent_tokenize,pos_tag

nons = ['NN', 'NNS', 'NNP', 'NNPS']
scentences = sent_tokenize("Google is one of the best companies in the world. I constantly google myself to see what I'm up to.")
for scentence in scentences:
if "google" in scentence.lower():
taggedtext = pos_tag(word_tokenize(scentence))
for item in taggedtext:
if item[0].lower() == "google" and item[1] in nons:
print(item)

今天先到这里啦，这个工具感觉学到这里连入门都不算，就是知道了几个函数，打卡~

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python

相关文章推荐

新的分享

章节导航