搜狗微信公众号文章反爬虫完美攻克
2017-11-27 13:43
225 查看
很简单,selenium + chromedriver,搜狗的部分直接在chrome模拟浏览器内部操作即可,而mp.weixin.qq.com则是腾讯的了,不反爬虫,用urllib requests等等即可。
需要扫码登陆,不扫码只能采取10页数据
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import threading
driver = webdriver.Chrome()
driver.get("http://weixin.sogou.com/")
driver.find_element_by_xpath('//*[@id="loginBtn"]').click()
find = input("输入你想查找的关键词")
driver.find_element_by_xpath('//*[@id="query"]').send_keys("%s"%find)
driver.find_element_by_xpath('//*[@id="searchForm"]/div/input[3]').click()
time.sleep(2)
url_list = []
while True:
page_source = driver.page_source
#print(page_source)
bs_obj = BeautifulSoup(page_source,"html.parser")
one_url_list = bs_obj.findAll("div",{"class":"txt-box"})
for url in one_url_list:
url_list.append(url.h3.a.attrs['href'])
#print(url.h3.a.attrs['href'])
next_page = "http://weixin.sogou.com/weixin" + bs_obj.find("a",{"id":"sogou_next"}).attrs['href']
driver.get(next_page)
time.sleep(1)
def get_img(url,num,connect,cursor):
response = requests.get(url,headers = header).content
content = str(response,encoding = "utf-8")
bs_obj = BeautifulSoup(content,"html.parser")
img_list = bs_obj.findAll("img")
count = 0
for img in img_list:
try:
imgurl=get_total_url(img.attrs["data-src"])
store_name = "%s"%url_num+"%s"%count
path = r"C:\Users\Mr.Guo\Pictures\weixin"
check_mkdir(path)
urllib.request.urlretrieve(imgurl,r"C:\Users\Mr.Guo\Pictures\weixin\%s.jpeg" %store_name)
insert_into_table(connect,cursor,store_name,html)
count += 1
except Exception as e:
pass
for url_num in range(len(url_list)):
t = threading.Thread(target = get_img,args = (url_list[url_num],url_num,connect,cursor,))
t.start()
需要扫码登陆,不扫码只能采取10页数据
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import threading
driver = webdriver.Chrome()
driver.get("http://weixin.sogou.com/")
driver.find_element_by_xpath('//*[@id="loginBtn"]').click()
find = input("输入你想查找的关键词")
driver.find_element_by_xpath('//*[@id="query"]').send_keys("%s"%find)
driver.find_element_by_xpath('//*[@id="searchForm"]/div/input[3]').click()
time.sleep(2)
url_list = []
while True:
page_source = driver.page_source
#print(page_source)
bs_obj = BeautifulSoup(page_source,"html.parser")
one_url_list = bs_obj.findAll("div",{"class":"txt-box"})
for url in one_url_list:
url_list.append(url.h3.a.attrs['href'])
#print(url.h3.a.attrs['href'])
next_page = "http://weixin.sogou.com/weixin" + bs_obj.find("a",{"id":"sogou_next"}).attrs['href']
driver.get(next_page)
time.sleep(1)
def get_img(url,num,connect,cursor):
response = requests.get(url,headers = header).content
content = str(response,encoding = "utf-8")
bs_obj = BeautifulSoup(content,"html.parser")
img_list = bs_obj.findAll("img")
count = 0
for img in img_list:
try:
imgurl=get_total_url(img.attrs["data-src"])
store_name = "%s"%url_num+"%s"%count
path = r"C:\Users\Mr.Guo\Pictures\weixin"
check_mkdir(path)
urllib.request.urlretrieve(imgurl,r"C:\Users\Mr.Guo\Pictures\weixin\%s.jpeg" %store_name)
insert_into_table(connect,cursor,store_name,html)
count += 1
except Exception as e:
pass
for url_num in range(len(url_list)):
t = threading.Thread(target = get_img,args = (url_list[url_num],url_num,connect,cursor,))
t.start()
相关文章推荐
- 记一次企业级爬虫系统升级改造(四):爬取微信公众号文章(通过搜狗与新榜等第三方平台)
- 记一次企业级爬虫系统升级改造(四):爬取微信公众号文章(通过搜狗与新榜等第三方平台)
- 利用搜狗抓取微信公众号文章
- 微信公众号文章爬虫抓取实现原理!
- 通过搜狗的公众号搜索爬微信公众号文章
- 微信公众号 文章的爬虫系统
- 使用WebController爬虫框架进行微信公众号文章爬取并持久化
- 快速搭建基于《搜狗微信》的公众号爬虫---搜狗微信公众号爬虫教程
- python爬虫实战--------搜狗微信文章(IP代理池和用户代理池设定----scrapy)
- [Python爬虫] 之十五:Selenium +phantomjs根据微信公众号抓取微信文章
- 如何将Markdown文章轻松地搬运到微信公众号并完美地呈现代码内容
- python scrapy爬取微信公众号文章的爬虫
- 关于python爬取搜狗微信公众号文章永久链接
- 从搜狗网站爬取微信公众号文章
- nodejs爬虫抓取搜狗微信文章详解
- python爬虫实战(三)--------搜狗微信文章(IP代理池和用户代理池设定----scrapy)
- 爬虫抓取微信公众号文章及阅读点赞总结
- 搜狗微信公众号文章抓取
- nodejs爬虫-妹子图,微信公众号文章,小说
- 如何将Markdown文章轻松地搬运到微信公众号并完美地呈现代码内容