您的位置：首页 > 编程语言 > Python开发

python项目之爬虫爬取煎蛋jandan的妹子图-上

2016-03-08 14:20 696 查看

python项目之爬虫爬取煎蛋jandan的妹子图-上

抓取妹子图练练手。

网页url格式

http://jandan.net/ooxx/page-1777#comment

只需改变页码1777即可

分析页面源码发现妹子图有两个

一个是缩略图

<img src="http://ww1.sinaimg.cn/mw600/4bf31e43jw1f09htnzkh5j20dw0kumz0.jpg" /></p>

另一个是原图

<a href="http://ww1.sinaimg.cn/large/4bf31e43jw1f09htnzkh5j20dw0kumz0.jpg" target="_blank" class="view_img_link">[查看原图]</a>

这里我们抓取原图，使用class和target这个属性查找。

最终得到每一页的TXT文件，下篇是文件合并与图片存取。

源码如下

代理ip文件请自行查找:-D

# coding:utf-8
####################################################
# coding by 刘云飞
####################################################

import requests
import os
import time
import random
from bs4 import BeautifulSoup
import threading

url = "http://jandan.net/ooxx/page-"
img_lists = []
url_lists = []
not_url_lists = []
ips = []
thread_list = []

with open('ip2.txt', 'r') as f:
lines = f.readlines()
for line in lines:
ip_one = "http://" + line.strip()
ips.append(ip_one)

headers = {
'Host': 'jandan.net',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/42.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Referer': 'http://jandan.net/ooxx/',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
}

for i in range(1530, 1883):
url_lists.append(url + str(i) + '#comments')

def writeToTxt(name, list):
with open(name, 'w+') as f:
for urlOne in list:
f.write(urlOne + "\n")

def get_img_url(url):
single_ip_addr = random.choice(ips)
lists_tmp = []
page = int(url[28:32])
filename = str(page) + ".txt"
proxies = {'http': single_ip_addr}
try:
res = requests.get(url, headers=headers, proxies=proxies)
print(res.status_code)
if res.status_code == 200:
text = res.text
Soup = BeautifulSoup(text, 'lxml')
results = Soup.find_all("a", target="_blank", class_="view_img_link")
for img in results:
lists_tmp.append(img['href'])
url_lists.append(img['href'])
print(url + "  --->>>>抓取完毕！！")
writeToTxt(filename, lists_tmp)
else:
not_url_lists.append(url)

print("not ok")
except:
not_url_lists.append(url)
print("not ok")

for url in url_lists:
page = int(url[28:32])
filename = str(page) + ".txt"
if os.path.exists(filename):
print(url + "   is pass")
else:
# time.sleep(1)
get_img_url(url)

print(img_lists)

with open("img_url.txt", 'w+') as f:
for url in img_lists:
f.write(url + "\n")

print("共有 " + str(len(img_lists)) + " 张图片。")
print("all done!!!")

with open("not_url_lists.txt", 'w+') as f:
for url in not_url_lists:
f.write(url + "\n")

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航

python项目之 爬虫爬取煎蛋jandan的妹子图-上

python项目之 爬虫爬取煎蛋jandan的妹子图-上

网页url格式

分析页面源码发现妹子图有两个

最终得到每一页的TXT文件，下篇是文件合并与图片存取。

源码如下

python项目之爬虫爬取煎蛋jandan的妹子图-上

python项目之爬虫爬取煎蛋jandan的妹子图-上