您的位置:首页 > 编程语言 > Python开发

python自然语言处理学习笔记第三章

2013-11-09 10:19 253 查看
从本章开始往后我们的例子程序将假设你以下面的导入语句开始你的交

互式会话或程序:

>>> from __future__ import division

>>> import nltk, re, pprint

读取网络上存储的数据:

>>> from __future__ import division

>>> import nltk,re,pprint

>>> from urllib import urlopen

>>> url = url = "http://www.gutenberg.org/files/2554/2554.txt"

>>> raw = urlopen(url).read()

>>> type(raw)

<type 'str'>

>>> len(raw)

1176893

>>> raw[:75]

'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

如果使用代理,则用下面代码:

如果你使用的Internet 代理Python不能正确检测出来,你可能需要用下面的方法手动指定代理:

>>> proxies = {'http': 'http://www.someproxy.com:3128'}

>>> raw = urlopen(url, proxies=proxies).read()

对读入的数据处理:

>>> tokens = nltk.word_tokenize(raw)

>>> type(tokens)

<type 'list'>

>>> len(tokens)

244484

>>> tokens[:10]

['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

>>> tokens[:15]

['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by', 'Fyodor', 'Dostoevsky', 'This', 'eBook', 'is']

对得到的文本进一步处理和操作:

>>> text = nltk.Text(tokens)

>>> type(text)

<class 'nltk.text.Text'>

>>> text[1020:1060]

['had', 'successfully', 'avoided', 'meeting', 'his', 'landlady', 'on', 'the', 'staircase.', 'His', 'garret', 'was', 'under', 'the', 'roof', 'of', 'a', 'high', ',', 'five-storied', 'house', 'and', 'was', 'more', 'like', 'a', 'cupboard', 'than', 'a', 'room.',
'The', 'landlady', 'who', 'provided', 'him', 'with', 'garret', ',', 'dinners', ',']

>>> text.collocations()

Building collocations list

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya

Romanovna; Marfa Petrovna; Rodion Romanovitch; Sofya Semyonovna; old

woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;

great deal; Project Gutenberg; Andrey Semyonovitch; Nikodim Fomitch;

young man; Dmitri Prokofitch; n't know; Ilya Petrovitch; Good heavens

>>> raw.find("PART I")

5338

>>> raw.rfind("End of Project Gutenberg's Crime")

1157743

>>> raw = raw[5303:1157681]

>>> raw.find("PART I")

35

查看网络上html格式的文件。

>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"

>>> html = urlopen(url).read()

>>> html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

通过 print html来打印得到的文件。

从HTML 中提取文本是极其常见的任务,NLTK 提供了一个辅助函数nltk.clean_html()将HTML 字符串作为参数,返回原始文本。然后我们可以对原始文本进行分词。

>>> raw = nltk.clern_html(html)

Traceback (most recent call last):

File "<pyshell#38>", line 1, in <module>

raw = nltk.clern_html(html)

AttributeError: 'module' object has no attribute 'clern_html'

>>> raw = nltk.clean_html(html) //消除html标记

>>> tokens = nltk.word_tokenize(raw) //把内容转换为列表

>>> tokens //显示出所有内容

>>> tokens =tokens[96:399]

>>> text = nltk.Text(tokens)

>>> text.concordance('gene')

Building index...

Displaying 4 of 4 matches:

hey say too few people now carry the gene for blondes to last beyond the next

have blonde hair , it must have the gene on both sides of the family in the g

ere is a disadvantage of having that gene or by chance. They do n't disappear

des would disappear is if having the gene was a disadvantage and I do not thin

处理搜索结果:
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: