您的位置:首页 > 编程语言 > Python开发

论一只爬虫的自我修养 小甲鱼python教程笔记

2017-03-28 21:44 218 查看

论一只爬虫的自我修养

Python如何访问互联网

URL

URL的一般格式为(带方括号[]的为可选项):

protocol://hostname[:port]/path/[;parameters][?query]#fragment


URL 由三部分组成:

第一部分是协议:http,https,ftp,file,ed2k…

第二部分是存放资源的服务器的域名系统或IP地址(有时候要包含端口号,各种传输协议都有默认的端口,如http的默认端口是80)

第三部分是资源的具体地址,如目录或者文件名等

urllib

urllib.request for opening and reading URLs

urllib.error containing the exceptions raised by urllib.request

urllib.parse for parsing URLs # 解析URL

urllib.robotparser for parsing robots.txt files

实例一:

import urllib.request
response = urllib.request.urlopen("http://www.fishc.com")
html = response.read()
print(html)  # 二进制数据

html = html.decode("utf-8")  # 解码操作
print(html)


实战

实战一:

在placekitten网站下载一只猫的图片!

新建脚本 download_cat.py

import urllib.request
response = urllib.request.urlopen("http://placekitten.com/g/500/600") #urlopen返回一个对象
cat_img = response.read()  对象均可用print()打印出来  # response.geturl() 得到url  #response.getcode() 返回值200,说明网站正常响应  response.info()得到文件信息
with open('cat_500_600.jpg','wb') as f:
f.write(cat_img)


实战二:

利用有道翻译,进行翻译

新建脚本

import urllib.request
import urllib.parse

url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null'

data={}
data['type']= 'AUTO'
data['i'] = 'I love you !'
data['doctype']= 'json'
data['xmlVersion'] = 1.8
data['keyfrom'] ='fanyi.web'
data['ue'] = 'UTF-8'
data['typoResult']='true'
data = urllib.parse.urlencode(data).encode('utf-8')

response = urllib.request.urlopen(url,data)
html = response.read().decode('utf-8')

print(html)


运行结果:json格式数据

{"type":"EN2ZH_CN","errorCode":0,"elapsedTime":8,"translateResult":[[{"src":"I love you !","tgt":"我爱你!"}]]}


可通过导入json模块解决

import json
json.loads(html)  # 转化成字典,进而可访问字典的关键词,获得翻译结果

target = json.loads(html)
type(target)

target['translateResult'][0][0]['tgt']


整合美化后的最终代码为:

import urllib.request
import urllib.parse
import json

content = input("请输入需要翻译的内容:")

url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null'

data={}
data['type']= 'AUTO'
data['i'] = content
data['doctype']= 'json'
data['xmlVersion'] = 1.8
data['
94d1
keyfrom'] ='fanyi.web'
data['ue'] = 'UTF-8'
data['typoResult']='true'
data = urllib.parse.urlencode(data).encode('utf-8')

response = urllib.request.urlopen(url,data)
html = response.read().decode('utf-8')

target = json.loads(html)
print("翻译结果为:%s"%(target['translateResult'][0][0]['tgt']))


实战二补充:

Chrome浏览器 -> 审查元素 -> Network

客户端和服务器请求与响应的两种方法:

POST:向指定服务器提交被处理的数据

GET:从服务器请求获得数据

Headers

# General
Request URL:http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null #提交翻译的地址
Request Method:POST # 请求的方法
Status Code:200 OK # 正常响应
Remote Address:121.195.178.202:80 # 服务器的IP地址和端口号
Referrer Policy:no-referrer-when-downgrade

# Request Headers 客户端发送请求的头文件,服务器判断是否为人为访问
# Accept:application/json, text/javascript, */*; q=0.01
Accept-Encoding:gzip, deflate
Accept-Language:zh-CN,zh;q=0.8
Connection:keep-alive
Content-Length:136
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:JSESSIONID=abcVFB9yzWTB85OEsRuSv; SESSION_FROM_COOKIE=fanyiweb; OUTFOX_SEARCH_USER_ID=-892334323@111.187.32.171; _ntes_nnid=7440c4163851030df4c41a906ffa7303,1490700544257; OUTFOX_SEARCH_USER_ID_NCOO=148702195.1749897; ___rl__test__cookies=1490700570495
Host:fanyi.youdao.com
Origin:http://fanyi.youdao.com
Referer:http://fanyi.youdao.com/
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.19 Safari/537.36 # 识别浏览器还是来自代码,可自定义
X-Requested-With:XMLHttpRequest

# Response Headers
Connection:keep-alive
Content-Encoding:gzip
Content-Language:zh-CN
Content-Type:application/json;charset=utf-8
Date:Tue, 28 Mar 2017 11:29:30 GMT
Server:nginx
Transfer-Encoding:chunked
Vary:Accept-Encoding

#Query String Parameters
smartresult:dict
smartresult:rule
smartresult:ugc
sessionFrom:null

# Form Data  表单数据,即POST提交的内容
type:AUTO
i:I love you!
doctype:json
xmlVersion:1.8
keyfrom:fanyi.web
ue:UTF-8
action:FY_BY_CLICKBUTTON
typoResult:true


Preview

Response
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: