您的位置：首页 > 理论基础 > 计算机网络

[Java爬虫HttpClient_Demo2模拟浏览器并抓取Web图片]

2017-12-04 11:57 561 查看

项目托管平台: 码云地址：

https://gitee.com/HDMBS/JavaSpiderDemo.git

本程序依赖Maven_Jar

<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.3</version>
</dependency>

<dependency>
<groupId>net.sourceforge.htmlcleaner</groupId>
<artifactId>htmlcleaner</artifactId>
<version>2.9</version>
</dependency>

<!-- https://mvnrepository.com/artifact/commons-io/commons-io 复制资源-->
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.5</version>
</dependency>

public static void main(String[] args) throws IOException {
// 模拟出真实的HTTP交互并获取图片，请用文本编辑器打开
/*
* 1.设置请求对象 User-Agnet httpGet.setHeader("User-Agent",
* "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0"
* );
*
* 2.获取响应内容类型 Content-Type HttpEntity entity = respond.getEntity();
* System.out.println(entity.getContentType().getValue());
*
* 3.获取响应状态码 Status
*
* 200:正常 403:拒绝 500:服务器报错 400:未找到页面
*
* CloseableHttpResponse respond = httpclient.execute(httpGet);
* System.out.println(respond.getStatusLine().getStatusCode());
*
* 4.复制资源： commons io 2.5 _Jar : 复制网络中的资源
*
*
*/

// 访问网址
final String URL = "https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1510654808536&di=957d74f32cf5983dfe1b5448a145038a&imgtype=0&src=http%3A%2F%2Fwww.people.com.cn%2Fmediafile%2Fpic%2F20160812%2F56%2F6695386280472753768.jpg";

// 创建可关闭的HttpClient实例对象(新版本才可以)相当于创建了一个模拟浏览器
CloseableHttpClient httpclient = HttpClients.createDefault();

// 一般爬虫请求都用Get，Get请求在HTTP请求协议里代表安全的查看:这个请求对象里可以添加http的请求头等
HttpGet httpGet = new HttpGet(URL);

// 设置Get请求头的 User-Agent (模拟代理浏览器信息)
httpGet.setHeader("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0");

// 用浏览器模拟对象httpClient，发送一个Get请求:可以通过这个响应对象获得很多http的响应信息
CloseableHttpResponse respond = httpclient.execute(httpGet);

// 获得状态码
System.out.println(respond.getStatusLine().getStatusCode());

// 获取返回的网页实体
HttpEntity entity = respond.getEntity();
if (entity != null) {
// 获取响应内容类型
System.out.println(entity.getContentType().getValue());

// 读取地址
InputStream content = entity.getContent();
// 创建copy对象,创建写入地址并重命名资源;(用的是Commons io的方法读写)
FileUtils.copyToFile(content, new File("E://spider_depot//ssdfsd.jpg"));
}

// 获取网页实体对象转换为字符串，并指定最终编码
/* String entitystr = EntityUtils.toString(entity, "utf-8"); */

/* System.out.println(entitystr); */

// 关闭流资源
httpclient.close();
// 关闭流资源
respond.close();

}

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： maven

相关文章推荐

新的分享

章节导航