How to make a Web crawler using Java?
2015-12-23 16:45
555 查看
There are a lot of useful information on the Internet. How can we automatically get those information? - Yes, Web Crawler.
In this post, I will show you how to make a prototype of Web crawler step by step by using Java. Making a Web crawler is not as difficult as it sounds. Just follow the guide and you will quickly get there in about 1 hour or less, and then enjoy the huge amount
of information that it can get for you. The goal of this tutorial is to be the simplest tutorial in the world for making a crawler by using Java. As this is only a prototype, you need spend more time to customize it for your needs.
I assume you know the following:
Basic Java programming
A little bit about SQL and MySQL Database.
If you don't want to use a database, you can use
a file to track the crawling history.
1. The goal
In this tutorial, the goal is as the following:
Given a school root URL, e.g., "mit.edu", return all pages that contains a string "research" from this school
A typical crawler works in the following steps:
Parse the root web page ("mit.edu"), and get all links from this page. To access each URL and parse HTML page, I will use JSoup which
is a convenient and simple Java library.
Using the URLs that retrieved from step 1, and parse those URLs
When doing the above steps, we need to track which page has been processed before, so that each web page only get processed once. This is the reason why we need a database.
2. Set up MySQL database
If you are using Ubuntu, you can following this guide to install Apache, MySQL, PHP, and phpMyAdmin.
If you are using Windows, you can simply use WampServer. You can simple download it from wampserver.com and install it in a minute and good to go for next step.
I will use phpMyAdmin to manipulate MySQL database. It is simply a GUI interface for using MySQL. It is totally fine if you any other tools or use no GUI tools.
3. Create a database and a table
Create a database named "Crawler" and create a table called "Record" like the following:
4. Start crawling using Java
1). Download JSoup core library from http://jsoup.org/download.
Download mysql-connector-java-xxxbin.jar from http://dev.mysql.com/downloads/connector/j/
2). Now Create a project in your eclipse with name "Crawler" and add the JSoup and mysql-connector jar files you downloaded to Java Build Path. (right click the project --> select "Build Path" --> "Configure Build Path" --> click "Libraries" tab --> click "Add
External JARs")
3). Create a class named "DB" which is used for handling database actions.
4). Create a class with name "Main" which will be our crawler.
Now you have your own Web crawler. Of course, you will need to filter some links you don't want to crawl.
The output is the following when I run the code on May 26 2014.
Will you let me know if this is not the simplest crawler in the world?
Java Crawler Source Code Download
Java Crawler on GitHub
In this post, I will show you how to make a prototype of Web crawler step by step by using Java. Making a Web crawler is not as difficult as it sounds. Just follow the guide and you will quickly get there in about 1 hour or less, and then enjoy the huge amount
of information that it can get for you. The goal of this tutorial is to be the simplest tutorial in the world for making a crawler by using Java. As this is only a prototype, you need spend more time to customize it for your needs.
I assume you know the following:
Basic Java programming
A little bit about SQL and MySQL Database.
If you don't want to use a database, you can use
a file to track the crawling history.
1. The goal
In this tutorial, the goal is as the following:
Given a school root URL, e.g., "mit.edu", return all pages that contains a string "research" from this school
A typical crawler works in the following steps:
Parse the root web page ("mit.edu"), and get all links from this page. To access each URL and parse HTML page, I will use JSoup which
is a convenient and simple Java library.
Using the URLs that retrieved from step 1, and parse those URLs
When doing the above steps, we need to track which page has been processed before, so that each web page only get processed once. This is the reason why we need a database.
2. Set up MySQL database
If you are using Ubuntu, you can following this guide to install Apache, MySQL, PHP, and phpMyAdmin.
If you are using Windows, you can simply use WampServer. You can simple download it from wampserver.com and install it in a minute and good to go for next step.
I will use phpMyAdmin to manipulate MySQL database. It is simply a GUI interface for using MySQL. It is totally fine if you any other tools or use no GUI tools.
3. Create a database and a table
Create a database named "Crawler" and create a table called "Record" like the following:
CREATE TABLE IF NOT EXISTS `Record` ( `RecordID` INT(11) NOT NULL AUTO_INCREMENT, `URL` text NOT NULL, PRIMARY KEY (`RecordID`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ; |
4. Start crawling using Java
1). Download JSoup core library from http://jsoup.org/download.
Download mysql-connector-java-xxxbin.jar from http://dev.mysql.com/downloads/connector/j/
2). Now Create a project in your eclipse with name "Crawler" and add the JSoup and mysql-connector jar files you downloaded to Java Build Path. (right click the project --> select "Build Path" --> "Configure Build Path" --> click "Libraries" tab --> click "Add
External JARs")
3). Create a class named "DB" which is used for handling database actions.
import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; public class DB { public Connection conn = null; public DB() { try { Class.forName("com.mysql.jdbc.Driver"); String url = "jdbc:mysql://localhost:3306/Crawler"; conn = DriverManager.getConnection(url, "root", "admin213"); System.out.println("conn built"); } catch (SQLException e) { e.printStackTrace(); } catch (ClassNotFoundException e) { e.printStackTrace(); } } public ResultSet runSql(String sql) throws SQLException { Statement sta = conn.createStatement(); return sta.executeQuery(sql); } public boolean runSql2(String sql) throws SQLException { Statement sta = conn.createStatement(); return sta.execute(sql); } @Override protected void finalize() throws Throwable { if (conn != null || !conn.isClosed()) { conn.close(); } } } |
import java.io.IOException; import java.sql.PreparedStatement; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class Main { public static DB db = new DB(); public static void main(String[] args) throws SQLException, IOException { db.runSql2("TRUNCATE Record;"); processPage("http://www.mit.edu"); } public static void processPage(String URL) throws SQLException, IOException{ //check if the given URL is already in database String sql = "select * from Record where URL = '"+URL+"'"; ResultSet rs = db.runSql(sql); if(rs.next()){ }else{ //store the URL to database to avoid parsing again sql = "INSERT INTO `Crawler`.`Record` " + "(`URL`) VALUES " + "(?);"; PreparedStatement stmt = db.conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS); stmt.setString(1, URL); stmt.execute(); //get useful information Document doc = Jsoup.connect("http://www.mit.edu/").get(); if(doc.text().contains("research")){ System.out.println(URL); } //get all links and recursively call the processPage method Elements questions = doc.select("a[href]"); for(Element link: questions){ if(link.attr("href").contains("mit.edu")) processPage(link.attr("abs:href")); } } } } |
The output is the following when I run the code on May 26 2014.
Will you let me know if this is not the simplest crawler in the world?
Java Crawler Source Code Download
Java Crawler on GitHub
相关文章推荐
- 关于spring mvc加载本地xsd文件问题
- 判断是否是一个邮箱
- java打包成可执行的jar或者exe的详细步骤
- java画板学习笔记
- java画板学习笔记
- java 图片压缩
- Simply Implementing Communication By Muti-Thread From Socket
- Java调用webservice接口方法
- JavaEE之--------利用过滤器实现用户自动登录,安全登录,取消自动登录黑用户禁止登录
- 深入理解Java Servlet与Web容器之间的关系
- JAVA内存区域
- java 获取数组元素类型的class对象
- java中utf-8编码的byte数组转换成String类型代码
- JAVA MAIL发送邮件实例
- eclipse SVN插件 subclipse 同步出现 E175002 E200007错误解决办法
- eclipse常用快捷键
- JAVA基础正则表达式及日期相关类
- JAVA JDBC连接数据库的步骤
- maven管理的jar包,项目中引用不到,报错java.lang.ClassNotFoundException
- JAVA中获取当前系统时间