您的位置：首页 > 其它

学以致用：批量下载豆瓣线上活动图片

2017-01-15 23:05 369 查看

背景：今天在浏览豆瓣网站的时候，发现一个在线活动”来一句王家卫式的话”,之前看过不少王家卫导的电影，从来都是比较喜欢其中的台词，但是比较急性子，不能耐心看完，也或许是碎片时间比较多，就有了下面的想法

使用爬虫抓取到每一个图片的url地址

使用java访问该地址，并且将该url对应的图片保存到本地

说干就干，这里我们使用jsoup来爬去网页上的数据。

第一步：获取”查看全部”地址

我们先打开豆瓣主页，看到下面的在线活动，点击“来一句王家卫式的话”,这一活动

这里可以看到有很多图片，一个网页是显示完的，一般情况下，都会点击”全部186张”继续浏览，所以我们首先要做的就是获取”全部186张”

对应的链接

从图中可以看到，其包含一个id=”pho-num的属性，全局查找也是唯一的，那么就可以根据属性获取当前的标签，继而获取当前标签对应的href值

Document doc = Jsoup.connect("https://www.douban.com/online/123060577/")
.timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();

Element element = doc.getElementsByAttributeValue("id","pho-num").get(0);
System.out.println(element.attr("href"));

此时打印结果如下：

可以看到，此时已经获取到浏览全部的地址了

第二步：获取每一个图片地址

我们可以根据关键字进行查找

比如”来自 TZ”和”来自白良宴”这样的关键字，快速定位到需要获取的标签位置

可以看到，这里我们需要获取的就是img标签的src属性，但是考虑到当前页面可能不止是我们需要获取的img标签，还有其他img标签是我们不需要的，所以先获取img的父标签然后在获取img标签本身

Document doc = Jsoup.connect("https://www.douban.com/online/123060577/album/1638403254/")
.timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();
Elements elements = doc.getElementsByAttributeValue("class","photo_wrap");
Element imageElement = null;
for (Element elementChild : elements) {
imageElement = elementChild.getElementsByTag("img").get(0);         System.out.println(imageElement.attr("src"));
}

此时效果如下：

由于当前页面有90条数据，太多了，所以这里我只截图了一部分

递归爬去下一页的数据

当前页面的img标签我们是获取到了src属性的值，但是肯定不止于此，我想获取所有的呢，模拟用户行为，获取”后页”的连接，然后在像之前的行为是一样的遍历查找即可。

可以看到，找到了”后页”所在的标签就简单了，获取点击”后页”时候的链接

Document doc = Jsoup.connect("https://www.douban.com/online/123060577/album/1638403254/")
.timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();
Element nextPage = doc.getElementsByAttributeValue("class","next").get(0);
Element aTag = nextPage.getElementsByTag("a").get(0);
System.out.println(aTag.attr("href"));

此时打印出下一页的图片链接了

判断是否是尾页

那么不管当前图片有多少，最终都会有一个尾页，尾页一般href链接是空的，这里目前只有三页数据，我们直接进入尾页

可以看到，尾页的”后页”是没有里面的超链接标签的，我们可以根据这个判断当前页面是否是尾页

一次性获取该活动的所有图片地址

有了上面的分析基础，一次获取该活动的所有图片地址就不是什么太难的问题了。

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class GetPicLink {

static List<String> urlLists = new ArrayList<>();

public static void main(String[] args) {
try {

//1. 根据当前后动的链接，获取"查看全部"的链接
Document doc = Jsoup.connect("https://www.douban.com/online/123060577/")
.timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();
Element element = doc.getElementsByAttributeValue("id","pho-num").get(0);

// 不断获取当前也的图片地址，并且将该地址放到urlLists集合中
spideAPage(element.attr("href"));

System.out.println(urlLists.size());
for (String string : urlLists) {
System.out.println(string);
}

} catch (IOException e) {
e.printStackTrace();
}
}

private static void spideAPage(String pageUrl) {
// 2. 传入 "查看全部"的链接 ，并且遍历获取当前页面的所有的图片地址
try {
Document doc = Jsoup.connect(pageUrl)
.timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();

// 获取图片的地址
Elements elements = doc.getElementsByAttributeValue("class","photo_wrap");
Element imageElement = null;
for (Element elementChild : elements) {
imageElement = elementChild.getElementsByTag("img").get(0);
// 将当前图片链接地址添加到urlLists集合中
urlLists.add(imageElement.attr("src"));
}

// 继续根据当前页面地址，获取"后页"的链接地址
Element nextPage = doc.getElementsByAttributeValue("class","next").get(0);
if (nextPage != null && nextPage.childNodeSize() > 1) { //防止当前页面是最后一页，否则会由于没有<a>标签出现 java.lang.IndexOutOfBoundsException
Elements aTags = nextPage.getElementsByTag("a");

// 3. 递归查找，直到最后一页
spideAPage(aTags.get(0).attr("href"));
}
} catch (IOException e) {
e.printStackTrace();
}
}

}

此时，我们需要做的就是根据这些图片地址，将其字节流保存到本地

添加下载代码

public static void  downLoadFromUrl(String urlStr,String fileName,String savePath) throws IOException{
URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection)url.openConnection();
//设置超时间为3秒
conn.setConnectTimeout(3*1000);
//防止屏蔽程序抓取而返回403错误
conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");

//得到输入流
InputStream inputStream = conn.getInputStream();
//获取自己数组
byte[] getData = readInputStream(inputStream);

//文件保存位置
File saveDir = new File(savePath);
if(!saveDir.exists()){
saveDir.mkdir();
}
File file = new File(saveDir+File.separator+fileName);
FileOutputStream fos = new FileOutputStream(file);
fos.write(getData);
if(fos!=null){
fos.close();
}
if(inputStream!=null){
inputStream.close();
}

System.out.println("info:"+url+" download success");

}

private static  byte[] readInputStream(InputStream inputStream) throws IOException {
byte[] buffer = new byte[1024];
int len = 0;
ByteArrayOutputStream bos = new ByteArrayOutputStream();
while((len = inputStream.read(buffer)) != -1) {
bos.write(buffer, 0, len);
}
bos.close();
return bos.toByteArray();
}

开始下载喽

for (int i = 0; i < urlLists.size(); i++) {
downLoadFromUrl(urlLists.get(i),i+"","/home/liuhang/Desktop/test");
}

此时效果如下：

在测试一下”午後的一張相片”这个活动

另外我们循环遍历的时候，需要为每一个活动分别创建当前的活动目录，这里我就以后缀为例

// https://www.douban.com/online/123077659/ System.out.println("https://www.douban.com/online/123077659/".substring("https://www.douban.com/online/".length(),"https://www.douban.com/online/123077659/".length() -1));

此时打印出的目录为”123077659”,另外在该目录下增加一个说明文件，文件的内容就是活动标题

增加说明文件

private static void writeActivityTitle(String title , String folderName) {
try {
File file = new File(folderName);
if (!file.exists()) {
file.mkdirs();
}
FileOutputStream fos = new FileOutputStream(folderName+"/filename.txt");
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
osw.write(title);
osw.flush();
} catch (Exception e) {
e.printStackTrace();
}
}

获取单个线上活动所有图片总结

下面是获取单个线上活动所有图片的所有代码

package doubanpic;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class test {

static List<String> urlLists = new ArrayList<>();

public static void main(String[] args) {
try {

//1. 根据当前后动的链接，获取"查看全部"的链接
Document doc = Jsoup.connect("https://www.douban.com/online/123060577/")
.timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();
Element element = doc.getElementsByAttributeValue("id","pho-num").get(0);

// 不断获取当前也的图片地址，并且将该地址放到urlLists集合中
spideAPage(element.attr("href"));

System.out.println(urlLists.size());
for (String string : urlLists) {
System.out.println(string);
}

for (int i = 0; i < urlLists.size(); i++) {
// 将thumb替换成photo，否则显示缩略图
downLoadFromUrl(urlLists.get(i).replace("thumb", "photo"),i+".jpg","/home/liuhang/Desktop/test");
}

} catch (IOException e) {
e.printStackTrace();
}
}
public static void  downLoadFromUrl(String urlStr,String fileName,String savePath) throws IOException{
URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection)url.openConnection();
//设置超时间为3秒
conn.setConnectTimeout(3*1000);
//防止屏蔽程序抓取而返回403错误
conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");

//得到输入流
InputStream inputStream = conn.getInputStream();
//获取自己数组
byte[] getData = readInputStream(inputStream);

//文件保存位置
File saveDir = new File(savePath);
if(!saveDir.exists()){
saveDir.mkdir();
}
File file = new File(saveDir+File.separator+fileName);
FileOutputStream fos = new FileOutputStream(file);
fos.write(getData);
if(fos!=null){
fos.close();
}
if(inputStream!=null){
inputStream.close();
}

System.out.println("info:"+url+" download success"+"    "+file.getAbsolutePath());

}

public static  byte[] readInputStream(InputStream inputStream) throws IOException {
byte[] buffer = new byte[1024];
int len = 0;
ByteArrayOutputStream bos = new ByteArrayOutputStream();
while((len = inputStream.read(buffer)) != -1) {
bos.write(buffer, 0, len);
}
bos.close();
return bos.toByteArray();
}

private static void writeActivityTitle(String title , String folderName) {
try {
File file = new File(folderName);
if (!file.exists()) {
file.mkdirs();
}
FileOutputStream fos = new FileOutputStream(folderName+"/filename.txt");
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
osw.write(title);
osw.flush();
} catch (Exception e) {
e.printStackTrace();
}
}

private static void spideAPage(String pageUrl) {
// 2. 传入 "查看全部"的链接 ，并且遍历获取当前页面的所有的图片地址
try {
Document doc = Jsoup.connect(pageUrl)
.timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();

// 获取图片的地址
Elements elements = doc.getElementsByAttributeValue("class","photo_wrap");
Element imageElement = null;
for (Element elementChild : elements) {
imageElement = elementChild.getElementsByTag("img").get(0);
// 将当前图片链接地址添加到urlLists集合中
urlLists.add(imageElement.attr("src"));
}

// 继续根据当前页面地址，获取"后页"的链接地址
Element nextPage = doc.getElementsByAttributeValue("class","next").get(0);
if (nextPage != null && nextPage.childNodeSize() > 1) { //防止当前页面是最后一页，否则会由于没有<a>标签出现 java.lang.IndexOutOfBoundsException
Elements aTags = nextPage.getElementsByTag("a");

// 3. 递归查找，直到最后一页
spideAPage(aTags.get(0).attr("href"));
}
} catch (IOException e) {
e.printStackTrace();
}
}

}

获取所有活动的所有图片

获取每个活动的链接，然后传入到之前分析的方法中

豆瓣线上活动的链接是这样子的

https://www.douban.com/online/list?g=h

获取所有活动的链接

public static void main(String[] args) {
try {

Document doc = Jsoup.connect("https://www.douban.com/online/?r=i")
.timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();

// 当前页面有很多<a>标签，我们要做的就是根据href的内容来匹配，另外需要过滤"线上活动"本身，请参考17.png
Elements elements = doc.getElementsByAttributeValueMatching("href", "https://www.douban.com/online/*");
for (Element element : elements) {
if (!"线上活动".equals(element.text()) && !"".equals(element.text())) {
System.out.println(element.attr("href")+" === "+element.text());
}
}
} catch (IOException e) {
e.printStackTrace();
}
}

此时打印如下：

可以看到，此时所有的活动链接都已经获取到了

呀，有点剪不断理还乱了，说下实现思路吧

获取所有线上活动的所有图片，可以划分为获取每一个线上活动的所有图片，然后遍历即可

获取所有线上活动的所有图片总结

前面已经解释的比较清楚，这里我直接上代码了，亲测可用哦。

package doubanpic;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class GetPicLink {

static List<String> urlLists = new ArrayList<>();
static Map<String,String> sMap = new HashMap<>();

public static void main(String[] args) {
try {

Document doc = Jsoup.connect("https://www.douban.com/online/?r=i")
.timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();

// 当前页面有很多<a>标签，我们要做的就是根据href的内容来匹配，另外需要过滤"线上活动"本身，请参考17.png
Elements elements = doc.getElementsByAttributeValueMatching("href", "https://www.douban.com/online/*");
Element activityElement = null;
String folderName = "";
for (Element element : elements) {
if (!"线上活动".equals(element.text()) && !"".equals(element.text())) {
System.out.println(element.attr("href"));
sMap.put(element.attr("href"), element.text());
}
}

Set<String> keys = sMap.keySet();
for (String string : keys) {
//1. 根据当前后动的链接，获取"查看全部"的链接
doc = Jsoup.connect(string)
.timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();
try {
activityElement = doc.getElementsByAttributeValue("id","pho-num").get(0);
} catch (Exception e) {
continue; //当走到这里，说明当前页面没有 "查看全部"的链接
}

// 不断获取当前也的图片地址，并且将该地址放到urlLists集合中
System.out.println(activityElement.attr("href"));
try {
spideAPage(activityElement.attr("href"));
} catch (Exception e) {
continue;
}

// 保存文本文件，用来存储当前 线上活动的标题
folderName = activityElement.attr("href").substring("https://www.douban.com/online/".length(),activityElement.attr("href").length() - 1);
writeActivityTitle(sMap.get(string),"/home/liuhang/Desktop/douban/"+folderName);

System.out.println("urlLists.size() is :"+urlLists.size());
for (int i = 0; i < urlLists.size(); i++) {
downLoadFromUrl(urlLists.get(i),i+"","/home/liuhang/Desktop/douban/"+folderName);
}
}

} catch (IOException e) {
e.printStackTrace();
}

}

private static void writeActivityTitle(String title , String folderName) {
try {
File file = new File(folderName);
if (!file.exists()) {
file.mkdirs();
}
FileOutputStream fos = new FileOutputStream(folderName+"/filename.txt");
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
osw.write(title);
osw.flush();
} catch (Exception e) {
e.printStackTrace();
}
}

private static void spideAPage(String pageUrl) throws Exception{
// 2. 传入 "查看全部"的链接 ，并且遍历获取当前页面的所有的图片地址
try {
Document doc = Jsoup.connect(pageUrl)
.timeout(10000).userAgent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)").get();

// 获取图片的地址
Elements elements = doc.getElementsByAttributeValue("class","photo_wrap");
Element imageElement = null;
for (Element elementChild : elements) {
imageElement = elementChild.getElementsByTag("img").get(0);
// 将当前图片链接地址添加到urlLists集合中
urlLists.add(imageElement.attr("src"));
}

// 继续根据当前页面地址，获取"后页"的链接地址
try {
Element nextPage = doc.getElementsByAttributeValue("class","next").get(0);
if (nextPage != null && nextPage.childNodeSize() > 1) { //防止当前页面是最后一页，否则会由于没有<a>标签出现 java.lang.IndexOutOfBoundsException
Elements aTags = nextPage.getElementsByTag("a");

// 3. 递归查找，直到最后一页
spideAPage(aTags.get(0).attr("href"));
}
} catch (Exception e) {
return;
}
} catch (IOException e) {
e.printStackTrace();
}
}

public static void  downLoadFromUrl(String urlStr,String fileName,String savePath) throws IOException{
URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection)url.openConnection();
//设置超时间为3秒
conn.setConnectTimeout(3*1000);
//防止屏蔽程序抓取而返回403错误
conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");

//得到输入流
InputStream inputStream = conn.getInputStream();
//获取自己数组
byte[] getData = readInputStream(inputStream);

//文件保存位置
File saveDir = new File(savePath);
if(!saveDir.exists()){
saveDir.mkdir();
}
File file = new File(saveDir+File.separator+fileName);
FileOutputStream fos = new FileOutputStream(file);
fos.write(getData);
if(fos!=null){
fos.close();
}
if(inputStream!=null){
inputStream.close();
}

System.out.println("info:"+url+" download success"+"    "+file.getAbsolutePath());

}

public static  byte[] readInputStream(InputStream inputStream) throws IOException {
byte[] buffer = new byte[1024];
int len = 0;
ByteArrayOutputStream bos = new ByteArrayOutputStream();
while((len = inputStream.read(buffer)) != -1) {
bos.write(buffer, 0, len);
}
bos.close();
return bos.toByteArray();
}
}

这里关于存储当前活动图片的地方写的比较粗糙，大家可以加一个filechooser来让用户选择文件夹等，最后，别忘了引入jsoup.jar文件。

Good Night，生命在于折腾。

源码下载

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航