【阅读】NLTK基础教程_用NLTK 和Python库构建机器学习应用 (2)
2018-03-07 17:23
411 查看
向NLTK进发 = =,为了更好的处理日常的信息 (ps 最终想语言控制所有的操作,这就是未来的科技吧)
-----------废话不多-------------上例子-----------
>>> import urllib
>>> import urllib.request
>>> response = urllib.request.urlopen('http://python.org/')
>>> html = response.read()
>>> print(len(html))
48851
>>>
-------------果然,python3中urllib.request 就是python2 中的urllib2------------
分析文档的主体
------------------------------------------------
清洗
------------------------------------------------
>>> tokens = [tok for tok in html.split()]
>>> print("Total no of tokens:"+ str(len(tokens)))
Total no of tokens:2936
>>> print(tokens[0:100])
[b'<!doctype', b'html>', b'<!--[if', b'lt', b'IE', b'7]>', b'<html', b'class="no-js', b'ie6', b'lt-ie7', b'lt-ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'IE', b'7]>', b'<html', b'class="no-js', b'ie7', b'lt-ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'IE', b'8]>', b'<html', b'class="no-js', b'ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'gt', b'IE', b'8]><!--><html', b'class="no-js"', b'lang="en"', b'dir="ltr">', b'<!--<![endif]-->', b'<head>', b'<meta', b'charset="utf-8">', b'<meta', b'http-equiv="X-UA-Compatible"', b'content="IE=edge">', b'<link', b'rel="prefetch"', b'href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">', b'<meta', b'name="application-name"', b'content="Python.org">', b'<meta', b'name="msapplication-tooltip"', b'content="The', b'official', b'home', b'of', b'the', b'Python', b'Programming', b'Language">', b'<meta', b'name="apple-mobile-web-app-title"', b'content="Python.org">', b'<meta', b'name="apple-mobile-web-app-capable"', b'content="yes">', b'<meta', b'name="apple-mobile-web-app-status-bar-style"', b'content="black">', b'<meta', b'name="viewport"', b'content="width=device-width,', b'initial-scale=1.0">', b'<meta', b'name="HandheldFriendly"', b'content="True">', b'<meta', b'name="format-detection"', b'content="telephone=no">', b'<meta', b'http-equiv="cleartype"', b'content="on">', b'<meta', b'http-equiv="imagetoolbar"', b'content="false">', b'<script', b'src="/static/js/libs/modernizr.js"></script>', b'<link', b'href="/static/stylesheets/style.css"', b'rel="stylesheet"', b'type="text/css"', b'title="default"', b'/>', b'<link', b'href="/static/stylesheets/mq.css"', b'rel="stylesheet"', b'type="text/css"', b'm
4000
edia="not', b'print,', b'braille,']
>>>
-----------------简化--------------
>>> tokens = re.split('\W+',html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\d\AppData\Local\Programs\Python\Python36\lib\re.py", line 212, in split
return _compile(pattern, flags).split(string, maxsplit)
TypeError: cannot use a string pattern on a bytes-like object
>>>
无法使用 = =--------------------------暂时忽略它---------------
其实书上强调了,nltk已经取消了clean的模块 = =,所以我就这样被愚弄了一回儿= =
---------------------
>>> clean = nltk.clean_html(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\oil\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\util.py", line 356, in clean_html
raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")
NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function
>>>
--------------------------------------
下面用beautifulsoup来进行吧= =
发现自己没有 = =安装beautifulsoup
$ pip install beautifulsoup4
--------------------------------------
>>> import urllib.request
>>> import urllib
>>> response = urllib.request.urlopen('http://python.org/')
>>> html = response.read()
>>> import BeautifulSoup4
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'BeautifulSoup4'
>>> import bs4
>>> soup = BeautifulSoup(html,"lxml")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'BeautifulSoup' is not defined
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html,"lxml")
>>> clean = soup.get_text()
>>> tokens = [tok for tok in clean.split()]
>>> print(tokens[:100])
['Welcome', 'to', 'Python.org', '{', '"@context":', '"http://schema.org",', '"@type":', '"WebSite",', '"url":', '"https://www.python.org/",', '"potentialAction":', '{', '"@type":', '"SearchAction",', '"target":', '"https://www.python.org/search/?q={search_term_string}",', '"query-input":', '"required', 'name=search_term_string"', '}', '}', 'var', '_gaq', '=', '_gaq', '||', '[];', "_gaq.push(['_setAccount',", "'UA-39055973-1']);", "_gaq.push(['_trackPageview']);", '(function()', '{', 'var', 'ga', '=', "document.createElement('script');", 'ga.type', '=', "'text/javascript';", 'ga.async', '=', 'true;', 'ga.src', '=', "('https:'", '==', 'document.location.protocol', '?', "'https://ssl'", ':', "'http://www')", '+', "'.google-analytics.com/ga.js';", 'var', 's', '=', "document.getElementsByTagName('script')[0];", 's.parentNode.insertBefore(ga,', 's);', '})();', 'Notice:', 'While', 'Javascript', 'is', 'not', 'essential', 'for', 'this', 'website,', 'your', 'interaction', 'with', 'the', 'content', 'will', 'be', 'limited.', 'Please', 'turn', 'Javascript', 'on', 'for', 'the', 'full', 'experience.', 'Skip', 'to', 'content', '▼', 'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', ' ▲', 'The', 'Python', 'Network']
>>>
-------------------下面又报错了------------
>>> for tok in tokens:
... if tok in freq_dis:
... freq_dis[tok]+=1
... else:
... freq_dis[tok]=1
... sorted_freq_dist= sorted(freq_dis.items(), key = operator.itemgetter(1),reverse=Ture)
File "<stdin>", line 6
sorted_freq_dist= sorted(freq_dis.items(), key = operator.itemgetter(1),reverse=Ture)
^
SyntaxError: invalid syntax
>>> sorted_freq_dist= sorted(freq_dis.items(), key = operator.itemgetter(1),reverse=Ture)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'Ture' is not defined
>>> print(sorted_freq_dist[:25])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'sorted_freq_dist' is not defined
>>>
--------------心累啊-----------------------
下面继续用,nltk的方法来弥补上面的失利
>>> import nltk
>>> Freq_dist_nltk = nltk.FreqDist(tokens)
>>> print(Freq_dist_nltk)
<FreqDist with 615 samples and 1125 outcomes>
>>>
>>> for k,v in Freq_dist_nltk.items():
... print(str(k)+':'+str(v))
...
Welcome:1
to:18
Python.org:1
{:3
"@context"::1
"http://schema.org",:1
"@type"::2
"WebSite",:1
"url"::1
"https://www.python.org/",:1
"potentialAction"::1
"SearchAction",:1
"target"::1
"https://www.python.org/search/?q={search_term_string}",:1
"query-input"::1
"required:1
name=search_term_string":1
}:2
var:3
_gaq:2
=:14
||:2
[];:1
_gaq.push(['_setAccount',:1
'UA-39055973-1']);:1
_gaq.push(['_trackPageview']);:1
(function():1
ga:1
document.createElement('script');:1
ga.type:1
'text/javascript';:1
ga.async:1
……………………
----------------------------
可视化了
代码如实:
>>> Freq_dist_nltk.plot(50,cumulative=False)
这有这一条
----------------------------------
然后使用停用词的方式进行进一步的处理:# 失败,没有停用词txt文件
>>> stopwords = [word.strip().lower() for word in open("PATH/english.stop.text")]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'PATH/english.stop.text'
>>> clean_tokens = [tok for tok in tokens if len(tok.lower())>1 and (tok.lower()not in stopwords)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
NameError: name 'stopwords' is not defined
>>> Freq_list_nltk = nltk.FreqDist(clean_tokens)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'clean_tokens' is not defined
>>> Freq_dist_nltk.plot(50, cumulative = False)
>>>
-----------------------------
2018年3月7日17:23:27,第一章就这样结束了,之后就是更细节的东西了
-----------废话不多-------------上例子-----------
>>> import urllib
>>> import urllib.request
>>> response = urllib.request.urlopen('http://python.org/')
>>> html = response.read()
>>> print(len(html))
48851
>>>
-------------果然,python3中urllib.request 就是python2 中的urllib2------------
分析文档的主体
------------------------------------------------
清洗
------------------------------------------------
>>> tokens = [tok for tok in html.split()]
>>> print("Total no of tokens:"+ str(len(tokens)))
Total no of tokens:2936
>>> print(tokens[0:100])
[b'<!doctype', b'html>', b'<!--[if', b'lt', b'IE', b'7]>', b'<html', b'class="no-js', b'ie6', b'lt-ie7', b'lt-ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'IE', b'7]>', b'<html', b'class="no-js', b'ie7', b'lt-ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'IE', b'8]>', b'<html', b'class="no-js', b'ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'gt', b'IE', b'8]><!--><html', b'class="no-js"', b'lang="en"', b'dir="ltr">', b'<!--<![endif]-->', b'<head>', b'<meta', b'charset="utf-8">', b'<meta', b'http-equiv="X-UA-Compatible"', b'content="IE=edge">', b'<link', b'rel="prefetch"', b'href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">', b'<meta', b'name="application-name"', b'content="Python.org">', b'<meta', b'name="msapplication-tooltip"', b'content="The', b'official', b'home', b'of', b'the', b'Python', b'Programming', b'Language">', b'<meta', b'name="apple-mobile-web-app-title"', b'content="Python.org">', b'<meta', b'name="apple-mobile-web-app-capable"', b'content="yes">', b'<meta', b'name="apple-mobile-web-app-status-bar-style"', b'content="black">', b'<meta', b'name="viewport"', b'content="width=device-width,', b'initial-scale=1.0">', b'<meta', b'name="HandheldFriendly"', b'content="True">', b'<meta', b'name="format-detection"', b'content="telephone=no">', b'<meta', b'http-equiv="cleartype"', b'content="on">', b'<meta', b'http-equiv="imagetoolbar"', b'content="false">', b'<script', b'src="/static/js/libs/modernizr.js"></script>', b'<link', b'href="/static/stylesheets/style.css"', b'rel="stylesheet"', b'type="text/css"', b'title="default"', b'/>', b'<link', b'href="/static/stylesheets/mq.css"', b'rel="stylesheet"', b'type="text/css"', b'm
4000
edia="not', b'print,', b'braille,']
>>>
-----------------简化--------------
>>> tokens = re.split('\W+',html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\d\AppData\Local\Programs\Python\Python36\lib\re.py", line 212, in split
return _compile(pattern, flags).split(string, maxsplit)
TypeError: cannot use a string pattern on a bytes-like object
>>>
无法使用 = =--------------------------暂时忽略它---------------
其实书上强调了,nltk已经取消了clean的模块 = =,所以我就这样被愚弄了一回儿= =
---------------------
>>> clean = nltk.clean_html(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\oil\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\util.py", line 356, in clean_html
raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")
NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function
>>>
--------------------------------------
下面用beautifulsoup来进行吧= =
发现自己没有 = =安装beautifulsoup
$ pip install beautifulsoup4
--------------------------------------
>>> import urllib.request
>>> import urllib
>>> response = urllib.request.urlopen('http://python.org/')
>>> html = response.read()
>>> import BeautifulSoup4
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'BeautifulSoup4'
>>> import bs4
>>> soup = BeautifulSoup(html,"lxml")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'BeautifulSoup' is not defined
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html,"lxml")
>>> clean = soup.get_text()
>>> tokens = [tok for tok in clean.split()]
>>> print(tokens[:100])
['Welcome', 'to', 'Python.org', '{', '"@context":', '"http://schema.org",', '"@type":', '"WebSite",', '"url":', '"https://www.python.org/",', '"potentialAction":', '{', '"@type":', '"SearchAction",', '"target":', '"https://www.python.org/search/?q={search_term_string}",', '"query-input":', '"required', 'name=search_term_string"', '}', '}', 'var', '_gaq', '=', '_gaq', '||', '[];', "_gaq.push(['_setAccount',", "'UA-39055973-1']);", "_gaq.push(['_trackPageview']);", '(function()', '{', 'var', 'ga', '=', "document.createElement('script');", 'ga.type', '=', "'text/javascript';", 'ga.async', '=', 'true;', 'ga.src', '=', "('https:'", '==', 'document.location.protocol', '?', "'https://ssl'", ':', "'http://www')", '+', "'.google-analytics.com/ga.js';", 'var', 's', '=', "document.getElementsByTagName('script')[0];", 's.parentNode.insertBefore(ga,', 's);', '})();', 'Notice:', 'While', 'Javascript', 'is', 'not', 'essential', 'for', 'this', 'website,', 'your', 'interaction', 'with', 'the', 'content', 'will', 'be', 'limited.', 'Please', 'turn', 'Javascript', 'on', 'for', 'the', 'full', 'experience.', 'Skip', 'to', 'content', '▼', 'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', ' ▲', 'The', 'Python', 'Network']
>>>
-------------------下面又报错了------------
>>> for tok in tokens:
... if tok in freq_dis:
... freq_dis[tok]+=1
... else:
... freq_dis[tok]=1
... sorted_freq_dist= sorted(freq_dis.items(), key = operator.itemgetter(1),reverse=Ture)
File "<stdin>", line 6
sorted_freq_dist= sorted(freq_dis.items(), key = operator.itemgetter(1),reverse=Ture)
^
SyntaxError: invalid syntax
>>> sorted_freq_dist= sorted(freq_dis.items(), key = operator.itemgetter(1),reverse=Ture)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'Ture' is not defined
>>> print(sorted_freq_dist[:25])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'sorted_freq_dist' is not defined
>>>
--------------心累啊-----------------------
下面继续用,nltk的方法来弥补上面的失利
>>> import nltk
>>> Freq_dist_nltk = nltk.FreqDist(tokens)
>>> print(Freq_dist_nltk)
<FreqDist with 615 samples and 1125 outcomes>
>>>
>>> for k,v in Freq_dist_nltk.items():
... print(str(k)+':'+str(v))
...
Welcome:1
to:18
Python.org:1
{:3
"@context"::1
"http://schema.org",:1
"@type"::2
"WebSite",:1
"url"::1
"https://www.python.org/",:1
"potentialAction"::1
"SearchAction",:1
"target"::1
"https://www.python.org/search/?q={search_term_string}",:1
"query-input"::1
"required:1
name=search_term_string":1
}:2
var:3
_gaq:2
=:14
||:2
[];:1
_gaq.push(['_setAccount',:1
'UA-39055973-1']);:1
_gaq.push(['_trackPageview']);:1
(function():1
ga:1
document.createElement('script');:1
ga.type:1
'text/javascript';:1
ga.async:1
……………………
----------------------------
可视化了
代码如实:
>>> Freq_dist_nltk.plot(50,cumulative=False)
这有这一条
----------------------------------
然后使用停用词的方式进行进一步的处理:# 失败,没有停用词txt文件
>>> stopwords = [word.strip().lower() for word in open("PATH/english.stop.text")]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'PATH/english.stop.text'
>>> clean_tokens = [tok for tok in tokens if len(tok.lower())>1 and (tok.lower()not in stopwords)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
NameError: name 'stopwords' is not defined
>>> Freq_list_nltk = nltk.FreqDist(clean_tokens)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'clean_tokens' is not defined
>>> Freq_dist_nltk.plot(50, cumulative = False)
>>>
-----------------------------
2018年3月7日17:23:27,第一章就这样结束了,之后就是更细节的东西了
相关文章推荐
- 【阅读】NLTK基础教程_用NLTK 和Python库构建机器学习应用 (1)
- NLTK01 《NLTK基础教程--用NLTK和Python库构建机器学习应用》
- python基础教程总结15——3 XML构建网址
- 【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】 分类: H3_NUTCH H4_SOLR/LUCENCE 2014-07-06 14:46 3543人阅读 评论(2) 收藏
- 【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二:内容分析 分类: H3_NUTCH H4_SOLR/LUCENCE 2014-07-13 14:18 3093人阅读 评论(0) 收藏
- 【Nutch2.3基础教程】集成Nutch/Hadoop/Hbase/Solr构建搜索引擎:安装及运行【集群环境】 分类: 1_Nutch 0_jediael开发 2015-01-24 17:24 3522人阅读 评论(1) 收藏
- <<Python基础教程>>学习笔记 | 第04章 | 字典
- Python基础入门教程,Python学习路线图
- 廖雪峰python教程阅读之条件判断
- python基础教程之获取本机ip数据包示例
- Python基础学习教程-第3讲第一个Python程序
- python 基础教程之迭代
- Python学习入门基础教程(learning Python)--2.3 Python自定义函数传参
- python基础教程1
- Python学习入门基础教程(learning Python)--5.3 Python写文件基础
- python基础教程_学习笔记20:标准库:一些最爱——os
- python参数 分类: python基础学习 python 2013-08-23 15:06 217人阅读 评论(0) 收藏
- python入门基础教程04 Python程序基本组成
- Hadoop基础教程-第1章 环境安装配置(1.5 构建集群)
- Python爬虫教程——入门一之爬虫基础了解