您的位置：首页 > 编程语言 > Python开发

《python进行自然语言处理》练习处理HTML内容的时候出现ImportError: No module named BeautifulSoup错误

2015-12-30 17:20 671 查看

在练习Python自然语言处理的时候，遇到了下面的错误：

# -*- coding: utf-8 -*-
"""
Created on Wed Dec 30 17:10:30 2015
@author: mahao
"""
#错误的形式
import nltk  
from urllib.request import urlopen  
from bs4 import BeautifulSoup
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"  
html = urlopen(url).read()  
#print(html[:60]) 
raw = nltk.clean_html(html)
#print(nltk.word_tokenize(raw))
tokens = nltk.word_tokenize(raw)  
tokens=tokens[96:399]  
text=nltk.Text(tokens)
print(text.concordance('gene'))
"""
<span style="color:#ff0000;"><span style="color:#ff0000;">报错：ImportError: No module named 'BeautifulSoup</span>'</span>
"""

最后参考了Beautiful Soup 4.2.0 文档，看到了：

原来Beautifusoup4改名为bs4了，所以把：

改为：

但是，改完之后，运行又会出现：

NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function

原来是使用clean_html()函数出错，网站：http://stackoverflow.com/questions/10524387/beautifulsoup-get-text-does-not-strip-all-tags-and-javascript介绍：

以后的版本，似乎不支持clean_html()和clean_url()这两个函数

Support for clean_html and clean_url will be dropped for future versions of nltk. Please use BeautifulSoup for now...it's very unfortunate.

然后我查了一下Beautiful Soup 4.2.0 文档，才发现，BeautifulSoup使用已经发生了改变，使用方式参照上面的例子，然后最终程序：

import nltk  
from urllib.request import urlopen  
from bs4 import BeautifulSoup
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"  
html = urlopen(url).read()  
#print(html[:60]) 
soup = BeautifulSoup(html)
raw = soup.get_text()
#print(nltk.word_tokenize(raw))
tokens = nltk.word_tokenize(raw)  
tokens=tokens[96:399]  
text=nltk.Text(tokens)
print(text.concordance('gene'))

然后就ok了！！！

python3令人头疼但快乐着。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航