您的位置:首页 > 其它

自然语言处理----处理原始文本

2017-06-08 20:13 288 查看
本文主要介绍编程访问网络文本的几种方式。

1. 访问网络资源

>>> import feedparser
>>> llog=feedparser.parse('http://weibo.com/ttarticle/p/show?id=2309404116343489194022')
>>> llog.keys()
['feed', 'status', 'version', 'encoding', 'bozo', 'headers', 'href', 'namespaces', 'entries', 'bozo_exception']
>>> type(llog['feed'])
<class 'feedparser.FeedParserDict'>
>>> llog['feed'].keys()
['meta', 'summary']
>>> llog['feed']['meta']
{'content': u'text/html; charset=gb2312', 'http-equiv': u'Content-type'}
>>> llog['feed']['summary']
u'<span id="message"></span>\n\n&&&&&&&&&&&&&&&&&&&&&&&&&'


View Code
3. 处理html

一般有三种方式:正则匹配, nltk.clean_html(), BeautifulSoup. 正则表达式比较繁琐,而nltk.clean_html()现在已经不支持了,比较简单常用的是用BeautifulSoup包。

from bs4 import BeautifulSoup

html_doc='''
<html><head><title>The Document's story</title></head>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body></html>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
content=soup.get_text()
print content


运行结果如下:

runfile('D:/my project/e_book/XXMLV-2/4.Python_代码/test.py', wdir='D:/my project/e_book/XXMLV-2/4.Python_代码')

The Document's story
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: