您的位置:首页 > 编程语言 > Java开发

How to make a Web crawler using Java?

2015-12-23 16:45 555 查看
There are a lot of useful information on the Internet. How can we automatically get those information? - Yes, Web Crawler.

In this post, I will show you how to make a prototype of Web crawler step by step by using Java. Making a Web crawler is not as difficult as it sounds. Just follow the guide and you will quickly get there in about 1 hour or less, and then enjoy the huge amount
of information that it can get for you. The goal of this tutorial is to be the simplest tutorial in the world for making a crawler by using Java. As this is only a prototype, you need spend more time to customize it for your needs.

I assume you know the following:

Basic Java programming
A little bit about SQL and MySQL Database.

If you don't want to use a database, you can use
a file to track the crawling history.

1. The goal

In this tutorial, the goal is as the following:

Given a school root URL, e.g., "mit.edu", return all pages that contains a string "research" from this school

A typical crawler works in the following steps:

Parse the root web page ("mit.edu"), and get all links from this page. To access each URL and parse HTML page, I will use JSoup which
is a convenient and simple Java library.
Using the URLs that retrieved from step 1, and parse those URLs
When doing the above steps, we need to track which page has been processed before, so that each web page only get processed once. This is the reason why we need a database.

2. Set up MySQL database

If you are using Ubuntu, you can following this guide to install Apache, MySQL, PHP, and phpMyAdmin.

If you are using Windows, you can simply use WampServer. You can simple download it from wampserver.com and install it in a minute and good to go for next step.

I will use phpMyAdmin to manipulate MySQL database. It is simply a GUI interface for using MySQL. It is totally fine if you any other tools or use no GUI tools.

3. Create a database and a table

Create a database named "Crawler" and create a table called "Record" like the following:

CREATE TABLE IF NOT EXISTS `Record` (
`RecordID` INT(11) NOT NULL AUTO_INCREMENT,
`URL` text NOT NULL,
PRIMARY KEY (`RecordID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;



4. Start crawling using Java

1). Download JSoup core library from http://jsoup.org/download.
Download mysql-connector-java-xxxbin.jar from http://dev.mysql.com/downloads/connector/j/
2). Now Create a project in your eclipse with name "Crawler" and add the JSoup and mysql-connector jar files you downloaded to Java Build Path. (right click the project --> select "Build Path" --> "Configure Build Path" --> click "Libraries" tab --> click "Add
External JARs")

3). Create a class named "DB" which is used for handling database actions.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;

public class DB {

public Connection conn = null;

public DB() {
try {
Class.forName("com.mysql.jdbc.Driver");
String url = "jdbc:mysql://localhost:3306/Crawler";
conn = DriverManager.getConnection(url, "root", "admin213");
System.out.println("conn built");
} catch (SQLException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}

public ResultSet runSql(String sql) throws SQLException {
Statement sta = conn.createStatement();
return sta.executeQuery(sql);
}

public boolean runSql2(String sql) throws SQLException {
Statement sta = conn.createStatement();
return sta.execute(sql);
}

@Override
protected void finalize() throws Throwable {
if (conn != null || !conn.isClosed()) {
conn.close();
}
}
}

4). Create a class with name "Main" which will be our crawler.

import java.io.IOException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Main {
public static DB db = new DB();

public static void main(String[] args) throws SQLException, IOException {
db.runSql2("TRUNCATE Record;");
processPage("http://www.mit.edu");
}

public static void processPage(String URL) throws SQLException, IOException{
//check if the given URL is already in database
String sql = "select * from Record where URL = '"+URL+"'";
ResultSet rs = db.runSql(sql);
if(rs.next()){

}else{
//store the URL to database to avoid parsing again
sql = "INSERT INTO  `Crawler`.`Record` " + "(`URL`) VALUES " + "(?);";
PreparedStatement stmt = db.conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
stmt.setString(1, URL);
stmt.execute();

//get useful information
Document doc = Jsoup.connect("http://www.mit.edu/").get();

if(doc.text().contains("research")){
System.out.println(URL);
}

//get all links and recursively call the processPage method
Elements questions = doc.select("a[href]");
for(Element link: questions){
if(link.attr("href").contains("mit.edu"))
processPage(link.attr("abs:href"));
}
}
}
}

Now you have your own Web crawler. Of course, you will need to filter some links you don't want to crawl.

The output is the following when I run the code on May 26 2014.



Will you let me know if this is not the simplest crawler in the world?


Java Crawler Source Code Download

Java Crawler on GitHub
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: