您的位置:首页 > 编程语言 > Python开发

re正则在python爬虫的应用

2018-02-27 16:32 495 查看
爬虫爬多了,肯定会遇上一些需求不是在H5标签里面的东西。这时候,就只能硬着头皮去使用re正则提取东西了。import re
import urllib2
from lxml import etree

ins_url = 'https://www.instagram.com/ahmad_monk/'
id = 22543622
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0"}
Ruery = urllib2.urlopen(urllib2.Request('https://www.instagram.com/static/bundles/ProfilePageContainer.js/8f295f2125e3.js',headers = headers)).read().decode('utf-8')
pattern = re.compile('queryId:"(\w+)",queryParams')
queryId = re.findall(pattern,Ruery)[1]
first = 12
pattern2 = re.compile('"end_cursor":"(.*?)"')
#提取的内容{"count":43295}}],"count":275,"page_info":{"has_next_page":true,"end_cursor":"AQCVTPsiY5qFjN9Usq6x3fLEAjcGoFCv6MelbGha_EEq3_4K6bjqCMqU7rVaJw-XeojaNP2DrkKJ7qcFI65qv3-JB0THTG5b4gg05F5qhTZc4w"}},"saved_media":{"nodes"碰到这种内容,需要用到非贪婪模式 (.*?)
request = urllib2.Request(ins_url)
response = urllib2.urlopen(request)
R = response.read()
after = re.findall(pattern2,R)[0]
print after



这就提取到了需要的字段

 
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐