您的位置:首页 > 理论基础 > 计算机网络

JAVA jsoup网络抓取图片

2015-06-03 11:13 633 查看
http://jsoup.org/

Jsoup
介绍

Jsoup
is
a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Jsoup
其实就是发起一个http请求,然后拿到dom.然后可以按照jquery那样的格式去拿对应的元素.

由于目前的大部分网站都是动态数据,所以很多数据都是固定到某一个element下面.

抓取的工具类

public class CrawUtil {
private static String USER_AGENT="Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0";
private static String timeOut="60000";
/**
* 根据url抓取网页,返回document对象
* @param uri 需要抓取的url
* @return 返回document 对象
* @throws IOException
*/
public static Document getDoc(String uri) throws IOException {
int time  = Integer.parseInt(timeOut);
Document doc = Jsoup.connect(uri).userAgent(USER_AGENT)
.ignoreContentType(true).timeout(time).get();
return doc;
}

}


从壁纸库抓取"美女图片"
程序猿太无聊

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

import org.apache.commons.lang3.StringUtils;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import net.sf.json.JSONArray;
import net.sf.json.JSONObject;

import com.luqiao.core.util.craw.CrawUtil;
import com.luqiao.core.util.httpclient.HttpClientUtil;

/**
* 从搜狗网站抓取图片
* @author anfu.yang
* @date 2015年6月2日 上午9:49:43
*/
public class CrawImages {
private static String url = "http://bizhi.sogou.com/cate/getCate/4/";
private static String detail_url = "http://bizhi.sogou.com/detail/info/";
private static String wp_id = "1171179";
private static String downloadPrefix = "H:\\Downloads\\";
/**
* 从分类信息获取图片ID 搜狗壁纸库
* @throws IOException
*/
@SuppressWarnings("unchecked")
public static List<String> getCategory() throws IOException{
String getCate = HttpClientUtil.doGet(url+wp_id, null);
JSONObject obj = JSONObject.fromObject(getCate);
wp_id = obj.getString("min_wp_id");
JSONArray array = obj.getJSONArray("wallpapers");
List<String> wpids = new ArrayList<String>();
for(int i = 0 ; i < array.size() ; i ++){
Map<String,Object> wallpaper = JSONObject.fromObject(array.get(i));
wpids.add((String)wallpaper.get("wp_id"));
}
return wpids;
}

/**
* 获取图片的url
* @param wpid
* @return
* @throws IOException
*/
public static String getImgUrl(String wpid) throws IOException{
Document detail = CrawUtil.getDoc(detail_url+wpid);
String img_url = "";
Elements elements = detail.getElementsByClass("unews_wp_big");
if(elements.size() > 0){
Element img = elements.get(0).getElementsByTag("img").get(0);
img_url = img.attr("src");

}
return img_url;
}
/**
* 下载图片
* @param imgUrl
*/
public static void downloadImg(String imgUrl){
String suffix = StringUtils.substring(imgUrl, StringUtils.lastIndexOf(imgUrl, "/"));
HttpClientUtil.download(imgUrl, downloadPrefix+suffix);
}

public static void main(String[] args) throws IOException {
ExecutorService pool = Executors.newFixedThreadPool(20);
List<String> wpids = getCategory();
execute(wpids, pool);
}

/**
* 递归获取图片地址
* @param wpids
* @param pool
* @throws IOException
*/
public  static void execute(List<String>  wpids,ExecutorService pool ) throws IOException{
for(int i = 0 ; i < wpids.size() ; i ++){
Download t = new CrawImages().new Download();
t.setWpid(wpids.get(i));
pool.execute(t);
if((i + 1) == wpids.size()){
wpids.clear();
wpids = getCategory();
execute(wpids,pool);
}
}
pool.shutdown();
}

/**
* 线程下载
* @author anfu.yang
* @date 2015年6月3日 上午11:22:28
*/
class Download implements Runnable{
String wpid;
@Override
public void run() {
try {
String img_url = getImgUrl(wpid);
System.out.println("下载... :"+img_url);
downloadImg(img_url);
} catch (IOException e) {
e.printStackTrace();
}
}
public String getWpid() {
return wpid;
}
public void setWpid(String wpid) {
this.wpid = wpid;
}

}
}


简单说下实现:

1,首先到搜狗壁纸分类去找到美女分类的url

2,然后通过页面的js请求的时候查看获取返回的json

3,解析json获取图片的详细信息页面

4,解析详细页面元素,拿到图片的url

5,执行多线程下载.

感兴趣的同学可以试试弄一个搜狗图片的图片抓取. 我就不贴代码了
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: