您的位置:首页 > 编程语言 > Python开发

python爬虫进阶(十二):自动摘要及正文抽取

2017-09-01 16:33 417 查看

一、文本长度分析

1、HTML中的换行

在HTML源码中,所有的换行都是依赖行级元素、块级元素以及<br>来实现的。

一般大量使用<p>标签来封装正文。

2、去除JavaScript及CSS

利用lxml的clean类,能删除HTML里所包含CSS及script

>>> from lxml.html import clean
>>> cleaner = clean.Cleaner(style=True, scripts=True, comments=True, javascript=True, page_structure=False, safe_attrs_only=False)
>>> content = cleaner.clean_html(content.decode('utf-8')).encode('utf-8')
style = True    去掉CSS

scripts = True    去掉js,插件等

comments = True   去掉注释

JavaScript = True   去掉js

3.去除所有HTML tag

利用下面的正则表达式,把HTML的tag和属性都去掉,最后只剩下正文部分:

reg = re.compile("<[^>]*>")
content = reg.sub('', content)

f = open('cleaned.txt', 'wb+')
f.write(content)
f.close()


如:

4、基于文本长度的分析方法

由上面可以得到每一行只有文本的文件,便可以统计每一行的字符长度。一般的,正文行的字符长度都比较大,而且比较集中,如下:

获取网页代码:

HtmlRetrival.py

import gzip
import re
import urllib.request

from io import StringIO

class HtmlRetrival:

dir_name = 'files'

def __init__(self, url):
self.url = url

def get_content(self):
request_headers = {
4000

'connection': "keep-alive",
'cache-control': "no-cache",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
'accept-language': "zh-CN,en-US;q=0.8,en;q=0.6",
'accept-charset ': 'utf-8'
}

filename = self.dir_name + '/' + re.findall('/([^/]+)$', self.url)[0]

try:
f = open(filename, 'r+',encoding='utf8')
content = f.read()
f.close()
except Exception:
req = urllib.request.Request(self.url, headers=request_headers)
response = urllib.request.urlopen(req)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO(response.read())
fzip = gzip.GzipFile(fileobj=buf)
content = fzip.read()
else:
content = response.read()

content = content.decode('utf8')
f = open(filename, 'w+',encoding='utf8')
f.write(content)
f.close()

return content

extract_demo1.py

# -*- coding: utf-8 -*-

import re
from HtmlRetrival import HtmlRetrival
from lxml.html import clean
import pylab

html_retrieval = HtmlRetrival('http://news.sina.com.cn/w/zx/2017-03-25/doc-ifycstww1059968.shtml')
content = html_retrieval.get_content()

cleaner = clean.Cleaner(style=True, scripts=True, comments=True, javascript=True, page_structure=False, safe_attrs_only=False)
content = cleaner.clean_html(content)

reg = re.compile("<[^>]*>")
content = reg.sub('', content)

f = open('cleaned.txt', 'w+',encoding='utf8')
f.write(content)
f.close()

lines = content.split('\n')
indexes = range(0, len(lines))
counts = []
for line in lines:
counts.append(len(line))

pylab.plot(indexes, counts,linewidth=1.0)
pylab.savefig('word_count.png')
pylab.show()

5、taxt-tag  ratio(len(text)/len(tag))

非正文区域

正文区域

由此,一般正文区域的taxt-tag  ratio越大;不仅如此,因为正文标签少,字数多,非正文标签多,文字少,经过处理后还能进一步降噪。

计算text-tag ratio:

# text only re
reg = re.compile("<[^>]*>")

lines = content.split('\n')
cleaned_lines = []
counts = []
tags = []
ratios = []

for line in lines:
# get tags count
tag = len(re.findall("<[^>]*>", line))
# get text from a html line
line = reg.sub('', line)
cleaned_lines.append(line)
if tag == 0:
tag = 1
counts.append(len(line))
tags.append(tag)
ratios.append(len(line)/tag)


6、K-Means

之前已经掌握了k均值聚类算法,可以参考这篇文章

Python实现过程,由于python已有包含该算法的库,因此可以直接调用。

# 聚合,预测每个元素所处的聚类,设置k = 2
kmeans = KMeans(k).fit(feature)
labels = kmeans.predict(feature)

算法优点:

算法缺点:

对数据进行平滑

考虑到实际情况,如一篇游记,主要内容包括图片和文字,此时可能很容易将文字内容当做非文本内容过滤,因此在实际操作中应对数据进行平滑,

ek表示平滑后的当前值,r是前后加入平滑的个数,TTRArray表示数据列表。

平滑后大值变小,小值变大。

for k in range( r, len(ratios) - r ):
ratio_smoth.append(sum(ratios[k-r:k+r+1])/(2*r+1))

中心点均值
分布更平均的时候,标准差减小;分布不均匀的时候,标准差增加。正文簇的均值一定是高于标准差的。

for k in range(1,4):

# 聚合,预测每个元素所处的聚类,设置k = 2
kmeans = KMeans(k).fit(feature)
labels = kmeans.predict(feature)

# 找出聚合中心点
centers = kmeans.cluster_centers_

print centers

# 根据聚类的结果来分类
clusters = {}
n = 0
for item in labels:
if item in clusters:
clusters[item].append(n)
else:
clusters[item] =

n +=1

index = 0
for i in centers:
if i[0] > std:
result_list += clusters[index]
index += 1

完整代码:

# -*- coding: utf-8 -*-
import gzip
import re

import urllib.request

from io import StringIO
from lxml.html import clean
from sklearn.cluster import KMeans
import numpy as np

request_headers = {
'connection': "keep-alive",
'cache-control': "no-cache",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
'accept-language': "zh-CN,en-US;q=0.8,en;q=0.6",
'accept-charset ': 'utf-8'
}

url = 'http://bbs.qyer.com/thread-2571140-1.html'
# url = 'http://news.sina.com.cn/c/gat/2017-03-13/doc-ifychhuq4322931.shtml'
# url = 'http://www.mafengwo.cn/i/6161801.html'

filename = re.findall('/([^/]+)$', url)[0]

try:
f = open(filename, 'r')
content = f.read()
f.close()
except Exception:
req = urllib.request.Request(url, headers=request_headers)
response = urllib.request.urlopen(req)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO(response.read())
fzip = gzip.GzipFile(fileobj=buf)
content = fzip.read()
else:
content = response.read()

f = open(filename, 'wb+')
f.write(content)
f.close()

# remove all javascript and css
cleaner = clean.Cleaner(style=True, scripts=True, comments=True, javascript=True, page_structure=False, safe_attrs_only=False)
content = cleaner.clean_html(content.decode('utf-8'))

# text only re
reg = re.compile("<[^>]*>")

lines = content.split('\n')
cleaned_lines = []
counts = []
tags = []
ratios = []

for line in lines:
# get tags count
tag = len(re.findall("<[^>]*>", line))
# get text from a html line
line = reg.sub('', line)
cleaned_lines.append(line)
if tag == 0:
tag = 1
counts.append(len(line))
tags.append(tag)
ratios.append(len(line)/tag)

# smooth the ratio, get average ratio of near 5 lines
# 对最近的5个取平均值,以排除噪声
r = 2
ratio_smoth = [0,0]
for k in range( r, len(ratios) - r ):
ratio_smoth.append(sum(ratios[k-r:k+r+1])/(2*r+1))

# 转化为 二维的数组
# reshape 讲一个数组改变为任意 m * n 维度 的数组
# m 为 -1 的时候,数组的数量由总数和每个数组的元素来确定
feature = np.array(ratio_smoth).reshape(-1,1)

result_list = []

# 得到均值与均方差
mean = np.mean(ratio_smoth)
std = np.std(ratio_smoth)

print(mean)
print(std)

for k in range(1,4):

# 聚合,预测每个元素所处的聚类,设置k = 2
kmeans = KMeans(k).fit(feature)
labels = kmeans.predict(feature)

# 找出聚合中心点
centers = kmeans.cluster_centers_

print(centers)

# 根据聚类的结果来分类
clusters = {}
n = 0
for item in labels:
if item in clusters:
clusters[item].append(n)
else:
clusters[item] =

n +=1

index = 0
for i in centers:
if i[0] > std:
result_list += clusters[index]
index += 1

result_set = sorted(set(result_list))

f = open('test.txt', 'w+',encoding='utf8')
index = 0

print(result_set)

# 将大于标准差的一类识别为正文,进行输出
for i in result_set:
f.write(cleaned_lines[i])

f.close()


其实不管如何,如果正文缺失不规则,无论如何是不能完全获取全部正文的,如一个人的微博正文,有长有短,该如何获取???

二、标签模板

1、通用模板与配置

# 定义正文的tag
tags = {
'title': '//h3[@class="b_tle"]',
'content': '//td[@class="editor bbsDetailContainer"]//*[self::p or self::span or self::h1]'
}


针对特定类型的网站,可以快速找出它们所使用的标签类型,我们把这些选择器以模板的方式来配置。
一般的,正文部分用标签<p>、<span>、<h1>包含,最主要的地方就是需要用人工的方式确定正文位置。

xpath支持or的选项,因此我们可以用

self::p or self::span or self::h1   来合并多种标签标记的正文。

2、使用场景对比

text/tag:

大规模抓取,没有模板的网站

优势:不需要规则,适用广泛

劣势:精准度差

模板:

有针对性的抓取,对于核心网站可以考虑使用模板

优势:精准度高,质量好,速度快

劣势:只能对特定网站使用,人工配置tags

代码:# -*- coding: utf-8 -*-

from HtmlRetrival import HtmlRetrival
from lxml import etree

html_re = HtmlRetrival('http://bbs.qyer.com/thread-2631045-1.html')
content = html_re.get_content()

# 定义正文的tag
tags = {
'title': '//h3[@class="b_tle"]',
'content': '//td[@class="editor bbsDetailContainer"]//*[self::p or self::span or self::h1]'
}

tr = etree.HTML(content)
info = {}

f = open('template.txt', 'w', encoding='utf8')

for tag in tags:
info[tag] = []
f.write('\r\n\r\n' + tag + '\r\n\r\n')
eles = tr.xpath(tags[tag])
for ele in eles:
if ele is None or ele.text is None:
continue
info[tag].append(ele.text)
f.write(ele.text + '\r\n')

f.close()

三、PyGoose

https://github.com/grangier/python-goose

安装:

git clone https://github.com/grangier/python-goose.git cd python-goose
pip install -r requirements.txt
python setup.py install

可能目前只支持Python2??????

使用教程

代码:

# -*- coding: utf-8 -*-

import urllib2
import re
from HtmlRetrival import HtmlRetrival

from goose import Goose
from goose.text import StopWordsChinese #导入语言
g = Goose({'stopwords_class': StopWordsChinese}) #指定网站使用语言

url = 'http://bbs.qyer.com/thread-2571140-1.html'

html_retrieval = HtmlRetrival(url)
content = html_retrieval.get_content()

article = g.extract(raw_html=content) #传入HTML文本,也可以直接传入URL

f = open('pythongoose.txt', 'wb+')
f.write(article.cleaned_text.encode('utf-8'))
f.close()


简单粗暴!!!!!!
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: