您的位置:首页 > 编程语言 > Python开发

【阅读】NLTK基础教程_用NLTK 和Python库构建机器学习应用 (2)

2018-03-07 17:23 411 查看
向NLTK进发 = =,为了更好的处理日常的信息 (ps 最终想语言控制所有的操作,这就是未来的科技吧)
-----------废话不多-------------上例子-----------
>>> import urllib
>>> import urllib.request
>>> response = urllib.request.urlopen('http://python.org/')
>>> html = response.read()
>>> print(len(html))
48851

>>>
-------------果然,python3中urllib.request 就是python2 中的urllib2------------
分析文档的主体 
------------------------------------------------
清洗
------------------------------------------------
>>> tokens = [tok for tok in html.split()]
>>> print("Total no of tokens:"+ str(len(tokens)))
Total no of tokens:2936

>>> print(tokens[0:100])
[b'<!doctype', b'html>', b'<!--[if', b'lt', b'IE', b'7]>', b'<html', b'class="no-js', b'ie6', b'lt-ie7', b'lt-ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'IE', b'7]>', b'<html', b'class="no-js', b'ie7', b'lt-ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'IE', b'8]>', b'<html', b'class="no-js', b'ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'gt', b'IE', b'8]><!--><html', b'class="no-js"', b'lang="en"', b'dir="ltr">', b'<!--<![endif]-->', b'<head>', b'<meta', b'charset="utf-8">', b'<meta', b'http-equiv="X-UA-Compatible"', b'content="IE=edge">', b'<link', b'rel="prefetch"', b'href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">', b'<meta', b'name="application-name"', b'content="Python.org">', b'<meta', b'name="msapplication-tooltip"', b'content="The', b'official', b'home', b'of', b'the', b'Python', b'Programming', b'Language">', b'<meta', b'name="apple-mobile-web-app-title"', b'content="Python.org">', b'<meta', b'name="apple-mobile-web-app-capable"', b'content="yes">', b'<meta', b'name="apple-mobile-web-app-status-bar-style"', b'content="black">', b'<meta', b'name="viewport"', b'content="width=device-width,', b'initial-scale=1.0">', b'<meta', b'name="HandheldFriendly"', b'content="True">', b'<meta', b'name="format-detection"', b'content="telephone=no">', b'<meta', b'http-equiv="cleartype"', b'content="on">', b'<meta', b'http-equiv="imagetoolbar"', b'content="false">', b'<script', b'src="/static/js/libs/modernizr.js"></script>', b'<link', b'href="/static/stylesheets/style.css"', b'rel="stylesheet"', b'type="text/css"', b'title="default"', b'/>', b'<link', b'href="/static/stylesheets/mq.css"', b'rel="stylesheet"', b'type="text/css"', b'm
4000
edia="not', b'print,', b'braille,']

>>>
-----------------简化--------------
>>> tokens = re.split('\W+',html)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\d\AppData\Local\Programs\Python\Python36\lib\re.py", line 212, in split
    return _compile(pattern, flags).split(string, maxsplit)
TypeError: cannot use a string pattern on a bytes-like object

>>>
无法使用 = =--------------------------暂时忽略它---------------
其实书上强调了,nltk已经取消了clean的模块 = =,所以我就这样被愚弄了一回儿= =
---------------------
>>> clean = nltk.clean_html(html)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\oil\AppData\Local\Programs\Python\Python36\lib\site-packages\nltk\util.py", line 356, in clean_html
    raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")
NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function

>>>
--------------------------------------
下面用beautifulsoup来进行吧= =
发现自己没有 = =安装beautifulsoup
$ pip install beautifulsoup4

--------------------------------------
>>> import urllib.request
>>> import urllib
>>> response = urllib.request.urlopen('http://python.org/')
>>> html = response.read()

>>> import BeautifulSoup4
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'BeautifulSoup4'
>>> import bs4
>>> soup = BeautifulSoup(html,"lxml")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'BeautifulSoup' is not defined
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html,"lxml")
>>> clean = soup.get_text()
>>> tokens = [tok for tok in clean.split()]
>>> print(tokens[:100])
['Welcome', 'to', 'Python.org', '{', '"@context":', '"http://schema.org",', '"@type":', '"WebSite",', '"url":', '"https://www.python.org/",', '"potentialAction":', '{', '"@type":', '"SearchAction",', '"target":', '"https://www.python.org/search/?q={search_term_string}",', '"query-input":', '"required', 'name=search_term_string"', '}', '}', 'var', '_gaq', '=', '_gaq', '||', '[];', "_gaq.push(['_setAccount',", "'UA-39055973-1']);", "_gaq.push(['_trackPageview']);", '(function()', '{', 'var', 'ga', '=', "document.createElement('script');", 'ga.type', '=', "'text/javascript';", 'ga.async', '=', 'true;', 'ga.src', '=', "('https:'", '==', 'document.location.protocol', '?', "'https://ssl'", ':', "'http://www')", '+', "'.google-analytics.com/ga.js';", 'var', 's', '=', "document.getElementsByTagName('script')[0];", 's.parentNode.insertBefore(ga,', 's);', '})();', 'Notice:', 'While', 'Javascript', 'is', 'not', 'essential', 'for', 'this', 'website,', 'your', 'interaction', 'with', 'the', 'content', 'will', 'be', 'limited.', 'Please', 'turn', 'Javascript', 'on', 'for', 'the', 'full', 'experience.', 'Skip', 'to', 'content', '▼', 'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', ' ▲', 'The', 'Python', 'Network']

>>>
-------------------下面又报错了------------
>>> for tok in tokens:
...     if tok in freq_dis:
...         freq_dis[tok]+=1
...     else:
...         freq_dis[tok]=1
... sorted_freq_dist= sorted(freq_dis.items(), key = operator.itemgetter(1),reverse=Ture)
  File "<stdin>", line 6
    sorted_freq_dist= sorted(freq_dis.items(), key = operator.itemgetter(1),reverse=Ture)
                   ^
SyntaxError: invalid syntax
>>> sorted_freq_dist= sorted(freq_dis.items(), key = operator.itemgetter(1),reverse=Ture)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'Ture' is not defined
>>> print(sorted_freq_dist[:25])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sorted_freq_dist' is not defined

>>>
--------------心累啊-----------------------
下面继续用,nltk的方法来弥补上面的失利
>>> import nltk
>>> Freq_dist_nltk = nltk.FreqDist(tokens)
>>> print(Freq_dist_nltk)
<FreqDist with 615 samples and 1125 outcomes>

>>>
>>> for k,v in Freq_dist_nltk.items():
...     print(str(k)+':'+str(v))
...
Welcome:1
to:18
Python.org:1
{:3
"@context"::1
"http://schema.org",:1
"@type"::2
"WebSite",:1
"url"::1
"https://www.python.org/",:1
"potentialAction"::1
"SearchAction",:1
"target"::1
"https://www.python.org/search/?q={search_term_string}",:1
"query-input"::1
"required:1
name=search_term_string":1
}:2
var:3
_gaq:2
=:14
||:2
[];:1
_gaq.push(['_setAccount',:1
'UA-39055973-1']);:1
_gaq.push(['_trackPageview']);:1
(function():1
ga:1
document.createElement('script');:1
ga.type:1
'text/javascript';:1
ga.async:1

……………………
----------------------------
可视化了



代码如实:
>>> Freq_dist_nltk.plot(50,cumulative=False)

这有这一条
----------------------------------
然后使用停用词的方式进行进一步的处理:# 失败,没有停用词txt文件
>>> stopwords = [word.strip().lower() for word in open("PATH/english.stop.text")]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'PATH/english.stop.text'
>>> clean_tokens = [tok for tok in tokens if len(tok.lower())>1 and (tok.lower()not in stopwords)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
NameError: name 'stopwords' is not defined
>>> Freq_list_nltk = nltk.FreqDist(clean_tokens)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'clean_tokens' is not defined
>>> Freq_dist_nltk.plot(50, cumulative = False)

>>>
-----------------------------
2018年3月7日17:23:27,第一章就这样结束了,之后就是更细节的东西了
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐