Java网络编程(一) - Java网页爬虫 - 爬取自己的CSDN博客标题和阅读数(附源码)
2016-05-29 12:00
645 查看
版权声明:本文地址http://blog.csdn.net/caib1109/article/details/51518790
欢迎非商业目的的转载, 作者保留一切权利
什么是爬虫
一个Java爬虫需要哪些技术
基于Spring框架的Java爬虫有哪些优势
1 spring task组件提供的定时执行功能
2 spring的依赖注入DI降低了具体网站之间的耦合度
3 spring的Value读取配置文件网址或数据库很方便
基于Spring框架的Java爬虫的详细设计
1 项目类图
2 apachehttpclient发送POSTGET请求
3 html解析 - jericho包的使用
举个例子 - 爬取CSDN全部日志和阅读数量
4 参数化的爬虫配置信息
5 多线程协调爬虫
6 定时启动
所以爬虫的目的, 就是自动获得网页内容并保存有用信息.
解析目标网站返回的html页面, 获得有用信息
抓取的信息写入文件
定时启动, 比如每天23点爬取一次, 检查有没有更新
总结: Java爬虫涉及了前端页面的html解析, http协议, Java读写文件的基础知识, 是不可多得的JAVA网络编程的入门项目. 每一个做Java网络编程的人都应该做一个Java爬虫
其中, “jdbcProperties”是在applicationContext.xml中配置的
术语:
Spring Expression Language - “#{strategyBean.databaseKeyGenerator}”. Spring EL是Spring 3的新特性.
apache.httpclient4.5.2 - HttpGet, HttpPost
apache.httpcore4.4 - BasicNameValuePair implements NameValuePair
common-logging.jar - 日志包, 必须引入, 不然httpclient4.5.2运行时提示NoClassDefFoundError: org/apache/commons/logging/LogFactory
详细用法请参考wangpeng047@CSDN的大作, 内容全且准.
下面是我写的GET/POST请求
依赖于 log4j-api-2.4.1.jar, log4j-core-2.4.1.jar
那么, 我们需要每天手动启动爬虫进程吗?
No, Spring的Task组件可以完成定时启动的功能.
版权声明:本文地址http://blog.csdn.net/caib1109/article/details/51518790
欢迎非商业目的的转载, 作者保留一切权利
欢迎非商业目的的转载, 作者保留一切权利
什么是爬虫
一个Java爬虫需要哪些技术
基于Spring框架的Java爬虫有哪些优势
1 spring task组件提供的定时执行功能
2 spring的依赖注入DI降低了具体网站之间的耦合度
3 spring的Value读取配置文件网址或数据库很方便
基于Spring框架的Java爬虫的详细设计
1 项目类图
2 apachehttpclient发送POSTGET请求
3 html解析 - jericho包的使用
举个例子 - 爬取CSDN全部日志和阅读数量
4 参数化的爬虫配置信息
5 多线程协调爬虫
6 定时启动
0 什么是爬虫
网络有很多信息, 比如以”爬虫”为关键字搜索, 获得1,000,000条结果, 不可能靠人工去检测哪些信息是需要的.所以爬虫的目的, 就是自动获得网页内容并保存有用信息.
1 一个Java爬虫需要哪些技术
向目标网站发送POST, GET请求解析目标网站返回的html页面, 获得有用信息
抓取的信息写入文件
定时启动, 比如每天23点爬取一次, 检查有没有更新
总结: Java爬虫涉及了前端页面的html解析, http协议, Java读写文件的基础知识, 是不可多得的JAVA网络编程的入门项目. 每一个做Java网络编程的人都应该做一个Java爬虫
2 基于Spring框架的Java爬虫有哪些优势
2.1 spring task组件提供的定时执行功能
2.2 spring的依赖注入(DI)降低了具体网站之间的耦合度
2.3 spring的@Value读取配置文件(网址或数据库)很方便
具体的,@Repository public class RewardsTestDatabase { @Value("#{jdbcProperties.databaseName}") public void setDatabaseName(String dbName) { … } @Value("#{jdbcProperties.databaseKeyGenerator}") public void setKeyGenerator(KeyGenerator kg) { … } }
其中, “jdbcProperties”是在applicationContext.xml中配置的
<!--src目录下的jdbcProperties.properties--> <bean id="config" class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer"> <property name="fileEncoding" value="UTF-8"></property> <property name="locations"> <list> <value>classpath:jdbcProperties.properties</value> </list> </property> </bean>
术语:
Spring Expression Language - “#{strategyBean.databaseKeyGenerator}”. Spring EL是Spring 3的新特性.
3 基于Spring框架的Java爬虫的详细设计
3.1 项目类图
3.2 apache.httpclient发送POST/GET请求
依赖包:apache.httpclient4.5.2 - HttpGet, HttpPost
apache.httpcore4.4 - BasicNameValuePair implements NameValuePair
common-logging.jar - 日志包, 必须引入, 不然httpclient4.5.2运行时提示NoClassDefFoundError: org/apache/commons/logging/LogFactory
详细用法请参考wangpeng047@CSDN的大作, 内容全且准.
下面是我写的GET/POST请求
import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.URI; import java.net.URISyntaxException; import java.util.List; import java.util.regex.Pattern; import org.apache.http.Header; import org.apache.http.HttpEntity; import org.apache.http.HttpHost; import org.apache.http.NameValuePair; import org.apache.http.client.entity.UrlEncodedFormEntity; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpPost; import org.apache.http.client.utils.URIBuilder; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.conn.DefaultProxyRoutePlanner; import org.apache.http.message.BasicHeader; import org.apache.http.util.EntityUtils; public class HttpRequestTool { private static HttpHost proxy; /** * set Proxy for httpclient * * @param proxyHost * 127.0.0.1 * @param port * 8080 * @return */ public static boolean setProxy(String proxyHost,String port){ if(proxyHost==null || port == null) return false; proxyHost=proxyHost.trim(); port=port.trim(); /* * 0-9 \\d 进行匹配 * 10-99 [1-9]\\d 进行匹配 * 100-199 1\\d\\d 进行匹配 * 200-249 2[0-4]\\d 进行匹配 * 250-255 25[0-5] 进行匹配 * (xxx|xxx|xxx|xxx)逻辑或, ^xxx$全字符串匹配 * ^(\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5].){3}(\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5])$ */ if(!Pattern.compile("^((\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5]).){3}(\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5])$").matcher(proxyHost).find()) return false; if(Pattern.compile("[^\\d]").matcher(port).find()) return false; int iPort = Integer.parseInt(port); if(iPort>65535) return false; proxy = new HttpHost(proxyHost, iPort); return true; } /** * simple getMethod without headers and parameters. * * @param host * @param resourcePath * @return * @throws URISyntaxException * @throws IOException */ public static String getMethod(String host, String resourcePath) throws URISyntaxException, IOException{ return getMethod("http", host, null, resourcePath, null, null); } /** * getMethod with headers and parameters. * * @param protocol * @param host * @param port * @param resourcePath * @param headKeyValueArray * @param paraKeyValueList * @return * @throws URISyntaxException * @throws IOException */ public static String getMethod(String protocol, String host, String port, String resourcePath, Header[] headKeyValueArray, List<NameValuePair> paraKeyValueList) throws URISyntaxException, IOException { URIBuilder builder = new URIBuilder().setScheme(protocol).setHost(host); if(port!=null) builder.setPort(Integer.parseInt(port)); if(resourcePath!=null) builder.setPath("/" + resourcePath); //Get请求参数 if(paraKeyValueList!=null) builder.addParameters(paraKeyValueList); //中文参数自动转为utf-8 //不要用已经过时的httpGet.setParams(HetpParams params)方法 URI uri = builder.build(); HttpGet httpGet = new HttpGet(uri); if (headKeyValueArray != null) httpGet.setHeaders(headKeyValueArray); CloseableHttpClient httpclient = (proxy==null)? HttpClients.createDefault() : HttpClients.custom().setRoutePlanner(new DefaultProxyRoutePlanner(proxy)).build(); BufferedReader br = null; InputStreamReader isr = null; CloseableHttpResponse httpResponse = null; try { httpResponse = httpclient.execute(httpGet); System.out.println(httpResponse.getStatusLine()); HttpEntity bodyEntity = httpResponse.getEntity(); isr = new InputStreamReader(bodyEntity.getContent()); br = new BufferedReader(isr); StringBuffer httpBody = new StringBuffer(); String resTemp = ""; while ((resTemp = br.readLine()) != null) { resTemp = resTemp.trim(); if (!"".equals(resTemp)) httpBody.append(resTemp.trim()).append("\n"); } EntityUtils.consume(bodyEntity); return httpBody.toString(); } finally { try { if (httpResponse != null) httpResponse.close(); } catch (IOException e1) { e1.printStackTrace(); } if (isr != null) { try { isr.close(); } catch (IOException e) { e.printStackTrace(); } } if (br != null) { try { br.close(); } catch (IOException e) { e.printStackTrace(); } } } } /** * 版权声明:本文地址http://blog.csdn.net/caib1109/article/details/51518790 欢迎非商业目的的转载, 作者保留一切权利 */ /** * post Method with head and parameters * @param protocol * @param host * @param port * @param resourcePath * @param headKeyValueArray * @param paraKeyValueList * @return * @throws IOException * @throws URISyntaxException */ public static String postMethod(String protocol, String host, String port, String resourcePath, Header[] headKeyValueArray, List<NameValuePair> paraKeyValueList) throws IOException, URISyntaxException{ CloseableHttpClient httpclient = (proxy==null)? HttpClients.createDefault() : HttpClients.custom().setRoutePlanner(new DefaultProxyRoutePlanner(proxy)).build(); try{ URIBuilder builder = new URIBuilder().setScheme(protocol).setHost(host); if(port!=null){ builder.setPort(Integer.parseInt(port)); } if(resourcePath!=null){ builder.setPath("/" + resourcePath); } URI uri = builder.build(); HttpPost httpPost = new HttpPost(uri); if(headKeyValueArray!=null){ httpPost.setHeaders(headKeyValueArray); } postMethod.getParams().setParameter( HttpMethodParams.HTTP_CONTENT_CHARSET, "UTF-8"); postMethod.setRequestBody(data); int statusCode = httpClient.executeMethod(postMethod); return postMethod.getResponseBodyAsString(); } finally { if(httpResponse!=null){ httpResponse.close(); } } } }
3.3 html解析 - jericho包的使用
jericho-html-3.4.jar包需要jdk7或以上依赖于 log4j-api-2.4.1.jar, log4j-core-2.4.1.jar
import org.apache.logging.log4j.LogManager; import org.apache.logging.log4j.Logger; public class CsdnGet { protected Logger logger = LogManager.getLogger(this.getClass()); public void dealHtml(){ String str; str = HttpRequestTool.getMethod("http", "write.blog.csdn.net", "80", "postlist", headerList, null); //从html页面源码生成jericho树形结构Source Source source = new Source(str); //常用获得html标签的方法 //Element ele = source.getElementById("elementid"); //Element ele = source.getFirstElementByClass("elementclass"); //Element ele = source.getAllElementsByClass("elementclass"); //List<Element> eleList = source.getChildElements(); // 获得全部子标签,对分析<table>特别有用 //获得html标签文字内容的String element.getTextExtractor().toString(); } }
举个例子 - 爬取CSDN全部日志和阅读数量
import java.io.IOException; import java.net.URISyntaxException; import java.util.LinkedList; import java.util.List; import java.util.regex.Matcher; import java.util.regex.Pattern; import net.htmlparser.jericho.Element; import net.htmlparser.jericho.Source; import org.apache.http.Header; import org.apache.http.message.BasicHeader; import org.apache.logging.log4j.LogManager; import org.apache.logging.log4j.Logger; import dto.Title_Num; public class CsdnGet { protected Logger logger = LogManager.getLogger(this.getClass()); private static final String articleListBox = "lstBox", pageBox = "page_nav"; public void getHtml() { String str = null; try { HttpRequestTool.setProxy("10.37.84.117", "8080"); Header[] headerList = { new BasicHeader("Host", "write.blog.csdn.net"), new BasicHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"), new BasicHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"), new BasicHeader("Accept-Language", "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3"), new BasicHeader("Accept-Encoding", "gzip, deflate"), new BasicHeader( "Cookie", "/*用抓包工具获得你的CSDN博客主页的cookie*/"), new BasicHeader("Connection", "keep-alive") }; // list contains all title_num List<Title_Num> itemlist = new LinkedList<Title_Num>(); // str = HttpRequestTool.getMethod("http", "write.blog.csdn.net", "80", "postlist", headerList, null); Source source = new Source(str); getArticlesOnePage(source, itemlist); // check total page 获得总页数的html标签 String pageInfo = source.getFirstElementByClass(pageBox).getFirstElement("span").getTextExtractor().toString(); // 正则表达式获得总页数 Matcher matcher = Pattern.compile("[^\\d](\\d{1,})[^\\d]").matcher(pageInfo); String sTotalPage = null; if(matcher.find()) sTotalPage = matcher.group(1); int iTotalPage = Integer.parseInt(sTotalPage); if(iTotalPage>1){ for(int i=2;i<=iTotalPage;i++){ String pageSuffix = String.format("postlist/0/0/enabled/%d", i); str = HttpRequestTool.getMethod("http", "write.blog.csdn.net", "80", pageSuffix, headerList, null); source = new Source(str); getArticlesOnePage(source, itemlist); } } // 输出 for(Title_Num title_Num:itemlist){ System.out.println(title_Num.getTitle()+title_Num.getNumber()); } } catch (URISyntaxException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } private void getArticlesOnePage(Source source, List<Title_Num> itemlist){ // get 1st page List<Element> articles = source.getElementById(articleListBox).getChildElements(); articles.remove(0); for (Element article : articles) { int col=0; Title_Num title_Num = new Title_Num(); for (Element column : article.getChildElements()) { if(col==0) title_Num.setTitle(column.getTextExtractor().toString()); if(col==2) title_Num.setNumber(Integer.parseInt(column.getTextExtractor().toString())); col++; } itemlist.add(title_Num); } } public static void main(String[] args) { new CsdnGet().getHtml(); } }
3.4 参数化的爬虫配置信息
3.5 多线程协调爬虫
3.6 定时启动
假设我们需要每天爬取自己的CSDN博客标题和阅读数, 和昨天的比较, 分析出每篇文章阅读量增加了多少.那么, 我们需要每天手动启动爬虫进程吗?
No, Spring的Task组件可以完成定时启动的功能.
版权声明:本文地址http://blog.csdn.net/caib1109/article/details/51518790
欢迎非商业目的的转载, 作者保留一切权利
相关文章推荐
- java对世界各个时区(TimeZone)的通用转换处理方法(转载)
- java-注解annotation
- java-模拟tomcat服务器
- java-用HttpURLConnection发送Http请求.
- java-WEB中的监听器Lisener
- Android IPC进程间通讯机制
- Android Native 绘图方法
- Android java 与 javascript互访(相互调用)的方法例子
- Python3写爬虫(四)多线程实现数据爬取
- 介绍一款信息管理系统的开源框架---jeecg
- 聚类算法之kmeans算法java版本
- java实现 PageRank算法
- Scrapy的架构介绍
- PropertyChangeListener简单理解
- 爬虫笔记
- c++11 + SDL2 + ffmpeg +OpenAL + java = Android播放器
- 插入排序
- 冒泡排序