您的位置：首页 > 编程语言 > Java开发

java爬虫学习日记1-基本爬虫原理介绍

2016-04-22 17:18 435 查看

理解URL
一、URI

什么是uri？web上每种可用资源，如html文档、图像、视频、程序等都是由一个通用资源标志符URI（Universal Resource Identifer）进行定位。
URI通常由三部分组成：
访问资源的命名机制；

存放资源的主机名；

资源自身的名称，由路径表示。

如下面的URI：
http://www.webmonkey.com.cn/html/html40/
我们可以这样理解：这是一个通过HTML协议访问的资源，位于主机www.webmonkey.com.cn上，通过路径“/html/html40”访问。
二、URL
URL是URI的一个子集。是统一资源定位符（Universal Resource Locator）的缩写，URL是Internet上描述信息资源的字符串，主要用在各种WWW客户程序和服务器程序上。
URL的格式由三部分组成：
协议（或称为服务方式）

存有该资源的主机IP地址（有时包括端口）

主机资源的具体地址，如目录和文件名

HTTP协议的URL示例
例：http://www.baidu.com/talk/talk.htm
其计算机域名为www.baidu.com,超级文本文件（文件类型为".html"）是在目录"/talk"下的"talk.htm"

文件的URL
例：file://ftp.youku.com/pub/files/foobar.txt
上面这个URL代表存放在主句file://ftp.youku.com上的"pub/files/"目录下的一个文件，文件名为"foobar.txt"。

通过URL抓取网页内容上面讲了URL的构成，下面主要阐述根据URL抓取网页。所谓网页抓取就是把URL地址重指定的网络资源从网络流中读取出来，然后保存到本地。类似于使用程序模拟浏览器功能，把URL作为HTTP请求的内容发送到服务器，然后读取服务器的响应资源。
GET方式：
通过URL地址获取URL对象
java.net.URL url=new URL(path);

通过URL对象获取网络流
InputStream stream=url.openStream();

在实际项目中，网络环境比较复杂，只用java.net包中的API来模拟浏览器客户端的工作代码量非常大，需要处理HTTP返回的状态码，设置HTTP代理，处理HTTPS协议等工作，为了便于应用程序的开发，实际开发时常常使用Apache的HTTP客户端开源项目HttpClient。例如：
创建一个客户端，类似打开一个浏览器
HttpClient httpClient=new org.apache.commons.httpclient.HttpClient();

创建一个get方法，类似于在浏览器地址中输入一个地址
GetMethod getMethod=new org.apache.commons.httpclient.methods.GetMethod(path);//path为URL字符串

执行，返回响应状态码
int statusCode = httpClient.executeMethod(getMethod);

只处理状态码为200（请求成功）的请求
statusCode == HttpStatus.SC_OK

获取请求返回的内容流
InputStream input = getMethod.getResponseBodyAsStream();

获取文件输出流
String filename ="输出路径"+输出文件名；

OutputStream output = new FileOutputStream(filename);

输出到文件
int tempByte = -1;
while ((tempByte = input.read()) > 0) {
output.write(tempByte);
}

关闭输入输出流
input.close();
output.close();

下面代码可直接运行：

package spider;

import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.GetMethod;

/**
*
* @author CallMeWhy
*
*/
public class Spider {
private static HttpClient httpClient = new HttpClient();

/**
* @param path
* 目标网页的链接
* @return 返回布尔值，表示是否正常下载目标页面
* @throws Exception
* 读取网页流或写入本地文件流的IO异常
*/
public static boolean downloadPage(String path) throws Exception {
// 定义输入输出流
InputStream input = null;
OutputStream output = null;
// 得到 post 方法
GetMethod getMethod = new GetMethod(path);
// 执行，返回状态码
int statusCode = httpClient.executeMethod(getMethod);
// 针对状态码进行处理
// 简单起见，只处理返回值为 200 的状态码
if (statusCode == HttpStatus.SC_OK) {
input = getMethod.getResponseBodyAsStream();
// 通过对URL的得到文件名
String filename = path.substring(path.lastIndexOf('/') + 1)
+ ".html";
// 获得文件输出流
output = new FileOutputStream(filename);
// 输出到文件
int tempByte = -1;
while ((tempByte = input.read()) > 0) {
output.write(tempByte);
}
// 关闭输入流
if (input != null) {
input.close();
}
// 关闭输出流
if (output != null) {
output.close();
}
return true;
}
return false;
}

public static void main(String[] args) {
try {
// 抓取百度首页，输出
Spider.downloadPage("https://www.baidu.com");
} catch (Exception e) {
e.printStackTrace();
}
}
}

POST方式：

package spider;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.NameValuePair;
import org.apache.commons.httpclient.methods.PostMethod;

public class PostSpider {
private static HttpClient httpClient=new HttpClient();
//设置代理服务器
static{
//代理服务器IP地址和端口
httpClient.getHostConfiguration().setProxy("127.0.0.1", 8080);
}
public static boolean downloadPage(String path) throws HttpException,IOException{
boolean flag=false;
InputStream input=null;
OutputStream output=null;
PostMethod postMethod=new PostMethod(path);
//设置post方法的参数
NameValuePair[] postData=new NameValuePair[2];
postData[0]=new NameValuePair("name","xxxxxx");
postData[1]=new NameValuePair("password","xxxxxx");
postMethod.addParameters(postData);
//执行返回状态码
int statusCode=httpClient.executeMethod(postMethod);
//针对状态码进行处理（也可以处理其它状态码，这里只处理200的状态码）
if(statusCode==HttpStatus.SC_OK){
input=postMethod.getResponseBodyAsStream();
//文件名
String filename = path.substring(path.lastIndexOf('/') + 1)
+ ".html";
//获得文件输出流
output=new FileOutputStream(filename);
//输出到文件
int tempByte=-1;
while((tempByte=input.read())>0){
output.write(tempByte);
}

//关闭输入输出流
if(input!=null){
input.close();
}
if(output!=null){
output.close();
}
flag=true;
}
return flag;
}
public static void main(String[] args) {
try {
PostSpider.downloadPage("https://www.baidu.com");
} catch (Exception e) {
e.printStackTrace();
}
}
}

上面需要改动的是代理服务器、参数
本文出自 “西越” 博客，请务必保留此出处http://yiqiuqiuqiu.blog.51cto.com/5079820/1766789

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航