您的位置：首页 > 编程语言 > Python开发

python爬虫--抽取百度百科名词的第一段存入数据库

2016-11-10 15:03 411 查看

参考资料：Python网络爬虫（1）–url访问及参数设置

http://www.mamicode.com/info-detail-477628.html

1.url访问，直接调用urllib库函数即可

import urllib2

url=‘http://www.baidu.com/‘
response = urllib2.urlopen(url)
html=response.read()

print html

2.带参数的访问，以baidu搜索功能为例

使用Chrome浏览器访问效果，Chrome搜索引擎设置为baidu，地址栏中输入test，效果如下：

可以看到baidu搜索的url为 https://www.baidu.com/s?ie=UTF-8&wd=test

修改代码，增加访问参数

# coding=utf-8
import urllib
import urllib2

#url地址
url=‘https://www.baidu.com/s‘
#参数
values={
‘ie‘:‘UTF-8‘,
‘wd‘:‘test‘
}
#进行参数封装
data=urllib.urlencode(values)
#组装完整url
req=urllib2.Request(url,data)

#访问完整url
response = urllib2.urlopen(req)
html=response.read()

print html

运行代码，得到结果为

提示访问页面不存在，这个时候需要考虑一下访问方式的问题。urllib2.Request(url,data) 访问方式为POST方式，需要改用GET方式进行尝试，更改代码为

# coding=utf-8
import urllib
import urllib2

#url地址
url=‘https://www.baidu.com/s‘
#参数
values={
‘ie‘:‘UTF-8‘,
‘wd‘:‘test‘
}
#进行参数封装
data=urllib.urlencode(values)
#组装完整url
#req=urllib2.Request(url,data)
url=url+‘?‘+data

#访问完整url
#response = urllib2.urlopen(req)
response = urllib2.urlopen(url)
html=response.read()

print html

再次运行，获得结果为

https发生了重定向，需要改用http

# coding=utf-8
import urllib
import urllib2

#url地址
#url=‘https://www.baidu.com/s‘
url=‘http://www.baidu.com/s‘
#参数
values={
‘ie‘:‘UTF-8‘,
‘wd‘:‘test‘
}
#进行参数封装
data=urllib.urlencode(values)
#组装完整url
#req=urllib2.Request(url,data)
url=url+‘?‘+data

#访问完整url
#response = urllib2.urlopen(req)
response = urllib2.urlopen(url)
html=response.read()

print html

再次运行，可实现正常访问

根据输入的关键词获取百度百科页面：

根据上面的参考资料，可以看到当使用get方式时，传送的数据直接显示在url地址栏，则可以很快的对其进行参数设置，即可实现访问：

values={
‘ie‘:‘UTF-8‘,
‘wd‘:‘test‘
}

但是当使用百度百科时，输入词条后所传递的参数并不会显示在地址栏中，那么问题来了，怎么知道要传递什么样的参数呢？通过分析百度的源码：

从上图可以找到按钮“百度一下”，相应的，可以找到输入框，可以看到使用get方式传递的参数‘wd’为输入框的name属性。那么同样的只要找到百度百科的搜索框的位置，获得其name属性的名称即可知道要传递什么样的参数：

从图中可以看到name属性的值为“word”，所以我们可以对参数设置如下：

values = {
'word':input  #其中input为用户输入
}

在百度百科页面按F12，可以查看Request URL、Request Method等

由此可以对URL地址进行分析，设置url = ‘http://baike.baidu.com/search/word?’

Request Method为GET方式。

所以读取输入的词条对应的页面的代码为：

import urllib
import urllib2

url = 'http://baike.baidu.com/search/word?'
input = raw_input("enter:")

values = {
'word':input
}
data = urllib.urlencode(values)
url = url + data
response = urllib2.urlopen(url)
html = response.read()
file = open('ex2.html','w')
file.write(html.strip())
file.close()

程序运行结果：

获取文字介绍

由于存在对个标签嵌套的问题，所以处理的时候比较麻烦：

可参考python官方文档中对于BeautifulSoup的介绍：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

import urllib
import urllib2
#import re
from bs4 import BeautifulSoup
url = 'http://baike.baidu.com/search/word?'
input = raw_input("enter:")

values = {
'word':input
}
data = urllib.urlencode(values)
url = url + data
response = urllib2.urlopen(url)
html = response.read()
file = open('ex2.html','w')
file.write(html.strip())
file.close()

soup= BeautifulSoup(html,"html.parser")
tags = soup.find_all("div",attrs={"class": "para"})

for tag in tags:
for string in tag.stripped_strings:
print((string)),

运行结果：

爬取第一段并存入数据库

ex2.py文件

#-*- coding:utf-8 -*-
import urllib
import urllib2
from bs4 import BeautifulSoup
from DB import DBCLASS
#https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
def getTag(url):
response = urllib2.urlopen(url)
soup= BeautifulSoup(response,"html.parser")
tags = soup.find_all("div")
#text = soup.find_all("div",attrs={"class": "para"})
#divide = soup.find("div",{"class":"configModuleBanner"})

str = ""
for tag in tags:
#print tag['class']    Error
if tag.get('class') == [u'configModuleBanner']:
return str
elif tag.get('class') == [u'para']:
for item in tag.contents:
str =str + item.string
print item.string,
print ""
str += "\n"
else:
continue

if __name__ == '__main__':
url = 'http://baike.baidu.com/search/word?'
db = DBCLASS()
input = raw_input("enter:")
insertvalue = []
while input != "quit":
values = {
'word': input
}
data = urllib.urlencode(values)
str = getTag(url + data)
insertvalue.append((input,str))
input = raw_input("enter:")

db.insertValue(insertvalue)

DB.py文件

import sqlite3
class DBCLASS:
def __init__(self):
self.cx = sqlite3.connect("baike.db")
self.cx.text_factory = str
self.cu = self.cx.cursor()
# self.cu.execute("drop table if exists discription")
# self.cu.execute("create table discription(word text primary key ,first_para text)")
def insertValue(self,insertvalue ):
for value in insertvalue:
self.cu.execute("insert into discription (word,first_para)values(?,?)",value)
self.cx.commit()

运行结果：

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航