您的位置:首页 > 理论基础 > 计算机网络

Java网络编程(一) - Java网页爬虫 - 爬取自己的CSDN博客标题和阅读数(附源码)

2016-05-29 12:00 645 查看
版权声明:本文地址http://blog.csdn.net/caib1109/article/details/51518790

欢迎非商业目的的转载, 作者保留一切权利

什么是爬虫

一个Java爬虫需要哪些技术

基于Spring框架的Java爬虫有哪些优势
1 spring task组件提供的定时执行功能

2 spring的依赖注入DI降低了具体网站之间的耦合度

3 spring的Value读取配置文件网址或数据库很方便

基于Spring框架的Java爬虫的详细设计
1 项目类图

2 apachehttpclient发送POSTGET请求

3 html解析 - jericho包的使用

举个例子 - 爬取CSDN全部日志和阅读数量

4 参数化的爬虫配置信息

5 多线程协调爬虫

6 定时启动

0 什么是爬虫

网络有很多信息, 比如以”爬虫”为关键字搜索, 获得1,000,000条结果, 不可能靠人工去检测哪些信息是需要的.

所以爬虫的目的, 就是自动获得网页内容并保存有用信息.

1 一个Java爬虫需要哪些技术

向目标网站发送POST, GET请求

解析目标网站返回的html页面, 获得有用信息

抓取的信息写入文件

定时启动, 比如每天23点爬取一次, 检查有没有更新

总结: Java爬虫涉及了前端页面的html解析, http协议, Java读写文件的基础知识, 是不可多得的JAVA网络编程的入门项目. 每一个做Java网络编程的人都应该做一个Java爬虫

2 基于Spring框架的Java爬虫有哪些优势

2.1 spring task组件提供的定时执行功能

2.2 spring的依赖注入(DI)降低了具体网站之间的耦合度

2.3 spring的@Value读取配置文件(网址或数据库)很方便

具体的,

@Repository
public class RewardsTestDatabase {

@Value("#{jdbcProperties.databaseName}")
public void setDatabaseName(String dbName) { … }

@Value("#{jdbcProperties.databaseKeyGenerator}")
public void setKeyGenerator(KeyGenerator kg) { … }
}


其中, “jdbcProperties”是在applicationContext.xml中配置的

<!--src目录下的jdbcProperties.properties-->
<bean id="config"
class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
<property name="fileEncoding" value="UTF-8"></property>
<property name="locations">
<list>
<value>classpath:jdbcProperties.properties</value>
</list>
</property>
</bean>


术语:

Spring Expression Language - “#{strategyBean.databaseKeyGenerator}”. Spring EL是Spring 3的新特性.

3 基于Spring框架的Java爬虫的详细设计

3.1 项目类图

3.2 apache.httpclient发送POST/GET请求

依赖包:

apache.httpclient4.5.2 - HttpGet, HttpPost

apache.httpcore4.4 - BasicNameValuePair implements NameValuePair

common-logging.jar - 日志包, 必须引入, 不然httpclient4.5.2运行时提示NoClassDefFoundError: org/apache/commons/logging/LogFactory

详细用法请参考wangpeng047@CSDN的大作, 内容全且准.

下面是我写的GET/POST请求

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.List;
import java.util.regex.Pattern;

import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.HttpHost;
import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.DefaultProxyRoutePlanner;
import org.apache.http.message.BasicHeader;
import org.apache.http.util.EntityUtils;

public class HttpRequestTool {
private static HttpHost proxy;

/**
* set Proxy for httpclient
*
* @param proxyHost
*            127.0.0.1
* @param port
*            8080
* @return
*/
public static boolean setProxy(String proxyHost,String port){
if(proxyHost==null || port == null)
return false;
proxyHost=proxyHost.trim();
port=port.trim();
/*
* 0-9 \\d 进行匹配
* 10-99 [1-9]\\d 进行匹配
* 100-199 1\\d\\d 进行匹配
* 200-249 2[0-4]\\d 进行匹配
* 250-255 25[0-5] 进行匹配
* (xxx|xxx|xxx|xxx)逻辑或, ^xxx$全字符串匹配
* ^(\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5].){3}(\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5])$
*/
if(!Pattern.compile("^((\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5]).){3}(\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5])$").matcher(proxyHost).find())
return false;
if(Pattern.compile("[^\\d]").matcher(port).find())
return false;
int iPort = Integer.parseInt(port);
if(iPort>65535)
return false;
proxy = new HttpHost(proxyHost, iPort);
return true;
}

/**
* simple getMethod without headers and parameters.
*
* @param host
* @param resourcePath
* @return
* @throws URISyntaxException
* @throws IOException
*/
public static String getMethod(String host, String resourcePath) throws URISyntaxException, IOException{
return getMethod("http", host, null, resourcePath, null, null);
}

/**
* getMethod with headers and parameters.
*
* @param protocol
* @param host
* @param port
* @param resourcePath
* @param headKeyValueArray
* @param paraKeyValueList
* @return
* @throws URISyntaxException
* @throws IOException
*/
public static String getMethod(String protocol, String host, String port, String resourcePath, Header[] headKeyValueArray, List<NameValuePair> paraKeyValueList)
throws URISyntaxException, IOException {
URIBuilder builder = new URIBuilder().setScheme(protocol).setHost(host);
if(port!=null)
builder.setPort(Integer.parseInt(port));
if(resourcePath!=null)
builder.setPath("/" + resourcePath);
//Get请求参数
if(paraKeyValueList!=null)
builder.addParameters(paraKeyValueList); //中文参数自动转为utf-8
//不要用已经过时的httpGet.setParams(HetpParams params)方法
URI uri = builder.build();
HttpGet httpGet = new HttpGet(uri);
if (headKeyValueArray != null)
httpGet.setHeaders(headKeyValueArray);
CloseableHttpClient httpclient = (proxy==null)?
HttpClients.createDefault()
:
HttpClients.custom().setRoutePlanner(new DefaultProxyRoutePlanner(proxy)).build();
BufferedReader br = null;
InputStreamReader isr = null;
CloseableHttpResponse httpResponse = null;
try {
httpResponse = httpclient.execute(httpGet);
System.out.println(httpResponse.getStatusLine());
HttpEntity bodyEntity = httpResponse.getEntity();
isr = new InputStreamReader(bodyEntity.getContent());
br = new BufferedReader(isr);
StringBuffer httpBody = new StringBuffer();
String resTemp = "";
while ((resTemp = br.readLine()) != null) {
resTemp = resTemp.trim();
if (!"".equals(resTemp))
httpBody.append(resTemp.trim()).append("\n");
}
EntityUtils.consume(bodyEntity);
return httpBody.toString();
} finally {
try {
if (httpResponse != null)
httpResponse.close();
} catch (IOException e1) {
e1.printStackTrace();
}
if (isr != null) {
try {
isr.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
/**
*  版权声明:本文地址http://blog.csdn.net/caib1109/article/details/51518790
欢迎非商业目的的转载, 作者保留一切权利
*/
/**
* post Method with head and parameters
* @param protocol
* @param host
* @param port
* @param resourcePath
* @param headKeyValueArray
* @param paraKeyValueList
* @return
* @throws IOException
* @throws URISyntaxException
*/
public static String postMethod(String protocol, String host, String port, String resourcePath, Header[] headKeyValueArray, List<NameValuePair> paraKeyValueList)
throws IOException, URISyntaxException{
CloseableHttpClient httpclient = (proxy==null)?
HttpClients.createDefault()
:
HttpClients.custom().setRoutePlanner(new DefaultProxyRoutePlanner(proxy)).build();
try{
URIBuilder builder = new URIBuilder().setScheme(protocol).setHost(host);
if(port!=null){
builder.setPort(Integer.parseInt(port));
}
if(resourcePath!=null){
builder.setPath("/" + resourcePath);
}
URI uri = builder.build();
HttpPost httpPost = new HttpPost(uri);
if(headKeyValueArray!=null){
httpPost.setHeaders(headKeyValueArray);
}
postMethod.getParams().setParameter(
HttpMethodParams.HTTP_CONTENT_CHARSET, "UTF-8");
postMethod.setRequestBody(data);
int statusCode = httpClient.executeMethod(postMethod);
return postMethod.getResponseBodyAsString();
} finally {
if(httpResponse!=null){
httpResponse.close();
}
}
}
}


3.3 html解析 - jericho包的使用

jericho-html-3.4.jar包需要jdk7或以上

依赖于 log4j-api-2.4.1.jar, log4j-core-2.4.1.jar

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

public class CsdnGet {
protected Logger logger = LogManager.getLogger(this.getClass());

public void dealHtml(){
String str;
str = HttpRequestTool.getMethod("http", "write.blog.csdn.net", "80", "postlist", headerList, null);
//从html页面源码生成jericho树形结构Source
Source source = new Source(str);
//常用获得html标签的方法
//Element ele = source.getElementById("elementid");
//Element ele = source.getFirstElementByClass("elementclass");
//Element ele = source.getAllElementsByClass("elementclass");
//List<Element> eleList = source.getChildElements(); // 获得全部子标签,对分析<table>特别有用
//获得html标签文字内容的String
element.getTextExtractor().toString();
}
}


举个例子 - 爬取CSDN全部日志和阅读数量

import java.io.IOException;
import java.net.URISyntaxException;
import java.util.LinkedList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import net.htmlparser.jericho.Element;
import net.htmlparser.jericho.Source;

import org.apache.http.Header;
import org.apache.http.message.BasicHeader;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

import dto.Title_Num;

public class CsdnGet {
protected Logger logger = LogManager.getLogger(this.getClass());
private static final String articleListBox = "lstBox",
pageBox        = "page_nav";

public void getHtml() {
String str = null;
try {
HttpRequestTool.setProxy("10.37.84.117", "8080");
Header[] headerList = {
new BasicHeader("Host", "write.blog.csdn.net"),
new BasicHeader("User-Agent",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"),
new BasicHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
new BasicHeader("Accept-Language", "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3"),
new BasicHeader("Accept-Encoding", "gzip, deflate"),
new BasicHeader(
"Cookie",
"/*用抓包工具获得你的CSDN博客主页的cookie*/"),
new BasicHeader("Connection", "keep-alive") };
// list contains all title_num
List<Title_Num> itemlist = new LinkedList<Title_Num>();
//
str = HttpRequestTool.getMethod("http", "write.blog.csdn.net", "80", "postlist", headerList, null);
Source source = new Source(str);
getArticlesOnePage(source, itemlist);
// check total page 获得总页数的html标签
String pageInfo = source.getFirstElementByClass(pageBox).getFirstElement("span").getTextExtractor().toString();
// 正则表达式获得总页数
Matcher matcher = Pattern.compile("[^\\d](\\d{1,})[^\\d]").matcher(pageInfo);
String sTotalPage = null;
if(matcher.find())
sTotalPage = matcher.group(1);
int iTotalPage = Integer.parseInt(sTotalPage);
if(iTotalPage>1){
for(int i=2;i<=iTotalPage;i++){
String pageSuffix = String.format("postlist/0/0/enabled/%d", i);
str = HttpRequestTool.getMethod("http", "write.blog.csdn.net", "80", pageSuffix, headerList, null);
source = new Source(str);
getArticlesOnePage(source, itemlist);
}
}
// 输出
for(Title_Num title_Num:itemlist){
System.out.println(title_Num.getTitle()+title_Num.getNumber());
}
} catch (URISyntaxException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

private void getArticlesOnePage(Source source, List<Title_Num> itemlist){
// get 1st page
List<Element> articles = source.getElementById(articleListBox).getChildElements();
articles.remove(0);
for (Element article : articles) {
int col=0;
Title_Num title_Num = new Title_Num();
for (Element column : article.getChildElements()) {
if(col==0)
title_Num.setTitle(column.getTextExtractor().toString());
if(col==2)
title_Num.setNumber(Integer.parseInt(column.getTextExtractor().toString()));
col++;
}
itemlist.add(title_Num);
}
}

public static void main(String[] args) {
new CsdnGet().getHtml();
}
}


3.4 参数化的爬虫配置信息

3.5 多线程协调爬虫

3.6 定时启动

假设我们需要每天爬取自己的CSDN博客标题和阅读数, 和昨天的比较, 分析出每篇文章阅读量增加了多少.

那么, 我们需要每天手动启动爬虫进程吗?

No, Spring的Task组件可以完成定时启动的功能.

版权声明:本文地址http://blog.csdn.net/caib1109/article/details/51518790

欢迎非商业目的的转载, 作者保留一切权利
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息