博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
【已测试】Java+MySQL实现网络爬虫程序
阅读量:2352 次
发布时间:2019-05-10

本文共 13440 字,大约阅读时间需要 44 分钟。

文章来源:http://johnhany.net/2013/11/web-crawler-using-java-and-mysql/#imageclose-413

代码下载网址:https://github.com/johnhany/WPCrawler

        网络爬虫,也叫网络蜘蛛,有的项目也把它称作“walker”。所给的定义是“一种系统地扫描互联网,以获取索引为目的的网络程序”。网络上有很多关于网络爬虫的开源项目,其中比较有名的是和。

        有时需要在网上搜集信息,如果需要搜集的是获取方法单一而人工搜集费时费力的信息,比如统计一个网站每个月发了多少篇文章、用了哪些标签,为自然语言处理项目搜集语料,或者为模式识别项目搜集图片等等,就需要爬虫程序来完成这样的任务。而且搜索引擎必不可少的组件之一也是网络爬虫。

        很多网络爬虫都是用Python,Java或C#实现的。我这里给出的是Java版本的爬虫程序。为了节省时间和空间,我把程序限制在只扫描本博客地址下的网页(也就是http://johnhan.net/但不包括http://johnhany.net/wp-content/下的内容),并从网址中统计出所用的所有标签。只要稍作修改,去掉代码里的限制条件就能作为扫描整个网络的程序使用。或者对输出格式稍作修改,可以作为生成博客sitemap的工具。

        代码也可以在这里下载:。


环境需求

        我的开发环境是Windows7 + 。

        需要提供通过url访问MySQL数据库的端口。

        还要用到三个开源的Java类库:

         提供HTTP接口,用来向目标网址提交HTTP请求,以获取网页的内容;

         用来解析网页,从DOM节点中提取网址链接;

         连接Java程序和MySQL,然后就可以用Java代码操作数据库。


代码

        代码位于三个文件中,分别是:crawler.java,httpGet.java和parsePage.java。包名为net.johnhany.wpcrawler。

crawler.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
package
net.johnhany.wpcrawler;
 
import
java.sql.Connection;
import
java.sql.DriverManager;
import
java.sql.ResultSet;
import
java.sql.SQLException;
import
java.sql.Statement;
 
public
class
crawler {
     
    
public
static
void
main(String args[])
throws
Exception {
        
String frontpage =
"http://johnhany.net/"
;
        
Connection conn =
null
;
         
        
//connect the MySQL database
        
try
{
            
Class.forName(
"com.mysql.jdbc.Driver"
);
            
String dburl =
"jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8"
;
            
conn = DriverManager.getConnection(dburl,
"root"
,
""
);
            
System.out.println(
"connection built"
);
        
}
catch
(SQLException e) {
            
e.printStackTrace();
        
}
catch
(ClassNotFoundException e) {
            
e.printStackTrace();
        
}
         
        
String sql =
null
;
        
String url = frontpage;
        
Statement stmt =
null
;
        
ResultSet rs =
null
;
        
int
count =
0
;
         
        
if
(conn !=
null
) {
            
//create database and table that will be needed
            
try
{
                
sql =
"CREATE DATABASE IF NOT EXISTS crawler"
;
                
stmt = conn.createStatement();
                
stmt.executeUpdate(sql);
                 
                
sql =
"USE crawler"
;
                
stmt = conn.createStatement();
                
stmt.executeUpdate(sql);
                 
                
sql =
"create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8"
;
                
stmt = conn.createStatement();
                
stmt.executeUpdate(sql);
                 
                
sql =
"create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8"
;
                
stmt = conn.createStatement();
                
stmt.executeUpdate(sql);
            
}
catch
(SQLException e) {
                
e.printStackTrace();
            
}
             
            
//crawl every link in the database
            
while
(
true
) {
                
//get page content of link "url"
                
httpGet.getByString(url,conn);
                
count++;
                 
                
//set boolean value "crawled" to true after crawling this page
                
sql =
"UPDATE record SET crawled = 1 WHERE URL = '"
+ url +
"'"
;
                
stmt = conn.createStatement();
                 
                
if
(stmt.executeUpdate(sql) >
0
) {
                    
//get the next page that has not been crawled yet
                    
sql =
"SELECT * FROM record WHERE crawled = 0"
;
                    
stmt = conn.createStatement();
                    
rs = stmt.executeQuery(sql);
                    
if
(rs.next()) {
                        
url = rs.getString(
2
);
                    
}
else
{
                        
//stop crawling if reach the bottom of the list
                        
break
;
                    
}
 
                    
//set a limit of crawling count
                    
if
(count >
1000
|| url ==
null
) {
                        
break
;
                    
}
                
}
            
}
            
conn.close();
            
conn =
null
;
             
            
System.out.println(
"Done."
);
            
System.out.println(count);
        
}
    
}
}

httpGet.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
package
net.johnhany.wpcrawler;
 
import
java.io.IOException;
import
java.sql.Connection;
 
import
org.apache.http.HttpEntity;
import
org.apache.http.HttpResponse;
import
org.apache.http.client.ClientProtocolException;
import
org.apache.http.client.ResponseHandler;
import
org.apache.http.client.methods.HttpGet;
import
org.apache.http.impl.client.CloseableHttpClient;
import
org.apache.http.impl.client.HttpClients;
import
org.apache.http.util.EntityUtils;
 
public
class
httpGet {
 
    
public
final
static
void
getByString(String url, Connection conn)
throws
Exception {
        
CloseableHttpClient httpclient = HttpClients.createDefault();
         
        
try
{
            
HttpGet httpget =
new
HttpGet(url);
            
System.out.println(
"executing request "
+ httpget.getURI());
 
            
ResponseHandler<String> responseHandler =
new
ResponseHandler<String>() {
 
                
public
String handleResponse(
                        
final
HttpResponse response)
throws
ClientProtocolException, IOException {
                    
int
status = response.getStatusLine().getStatusCode();
                    
if
(status >=
200
&& status <
300
) {
                        
HttpEntity entity = response.getEntity();
                        
return
entity !=
null
? EntityUtils.toString(entity) :
null
;
                    
}
else
{
                        
throw
new
ClientProtocolException(
"Unexpected response status: "
+ status);
                    
}
                
}
            
};
            
String responseBody = httpclient.execute(httpget, responseHandler);
            
/*
            
//print the content of the page
            
System.out.println("----------------------------------------");
            
System.out.println(responseBody);
            
System.out.println("----------------------------------------");
            
*/
            
parsePage.parseFromString(responseBody,conn);
             
        
}
finally
{
            
httpclient.close();
        
}
    
}
}

parsePage.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
package
net.johnhany.wpcrawler;
 
import
java.sql.Connection;
import
java.sql.PreparedStatement;
import
java.sql.ResultSet;
import
java.sql.SQLException;
import
java.sql.Statement;
 
import
org.htmlparser.Node;
import
org.htmlparser.Parser;
import
org.htmlparser.filters.HasAttributeFilter;
import
org.htmlparser.tags.LinkTag;
import
org.htmlparser.util.NodeList;
import
org.htmlparser.util.ParserException;
 
import
java.net.URLDecoder;
 
public
class
parsePage {
     
    
public
static
void
parseFromString(String content, Connection conn)
throws
Exception {
        
Parser parser =
new
Parser(content);
        
HasAttributeFilter filter =
new
HasAttributeFilter(
"href"
);
         
        
try
{
            
NodeList list = parser.parse(filter);
            
int
count = list.size();
             
            
//process every link on this page
            
for
(
int
i=
0
; i<count; i++) {
                
Node node = list.elementAt(i);
                 
                
if
(node
instanceof
LinkTag) {
                    
LinkTag link = (LinkTag) node;
                    
String nextlink = link.extractLink();
                    
String mainurl =
"http://johnhany.net/"
;
                    
String wpurl = mainurl +
"wp-content/"
;
 
                    
//only save page from "http://johnhany.net"
                    
if
(nextlink.startsWith(mainurl)) {
                        
String sql =
null
;
                        
ResultSet rs =
null
;
                        
PreparedStatement pstmt =
null
;
                        
Statement stmt =
null
;
                        
String tag =
null
;
                         
                        
//do not save any page from "wp-content"
                        
if
(nextlink.startsWith(wpurl)) {
                            
continue
;
                        
}
                         
                        
try
{
                            
//check if the link already exists in the database
                            
sql =
"SELECT * FROM record WHERE URL = '"
+ nextlink +
"'"
;
                            
stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);
                            
rs = stmt.executeQuery(sql);
 
                            
if
(rs.next()) {
                                 
                            
}
else
{
                                
//if the link does not exist in the database, insert it
                                
sql =
"INSERT INTO record (URL, crawled) VALUES ('"
+ nextlink +
"',0)"
;
                                
pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
                                
pstmt.execute();
                                
System.out.println(nextlink);
                                 
                                
//use substring for better comparison performance
                                
nextlink = nextlink.substring(mainurl.length());
                                
//System.out.println(nextlink);
                                 
                                
if
(nextlink.startsWith(
"tag/"
)) {
                                    
tag = nextlink.substring(
4
, nextlink.length()-
1
);
                                    
//decode in UTF-8 for Chinese characters
                                    
tag = URLDecoder.decode(tag,
"UTF-8"
);
                                    
sql =
"INSERT INTO tags (tagname) VALUES ('"
+ tag +
"')"
;
                                    
pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
                                    
//if the links are different from each other, the tags must be different
                                    
//so there is no need to check if the tag already exists
                                    
pstmt.execute();
                                
}
                            
}
                        
}
catch
(SQLException e) {
                            
//handle the exceptions
                            
System.out.println(
"SQLException: "
+ e.getMessage());
                            
System.out.println(
"SQLState: "
+ e.getSQLState());
                            
System.out.println(
"VendorError: "
+ e.getErrorCode());
                        
}
finally
{
                            
//close and release the resources of PreparedStatement, ResultSet and Statement
                            
if
(pstmt !=
null
) {
                                
try
{
                                    
pstmt.close();
                                
}
catch
(SQLException e2) {}
                            
}
                            
pstmt =
null
;
                             
                            
if
(rs !=
null
) {
                                
try
{
                                    
rs.close();
                                
}
catch
(SQLException e1) {}
                            
}
                            
rs =
null
;
                             
                            
if
(stmt !=
null
) {
                                
try
{
                                    
stmt.close();
                                
}
catch
(SQLException e3) {}
                            
}
                            
stmt =
null
;
                        
}
                         
                    
}
                
}
            
}
        
}
catch
(ParserException e) {
            
e.printStackTrace();
        
}
    
}
}

程序原理

        所谓“互联网”,是网状结构,任意两个节点间都有可能存在路径。爬虫程序对互联网的扫描,在图论角度来讲,就是对有向图的遍历(链接是从一个网页指向另一个网页,所以是有向的)。常见的遍历方法有深度优先和广度优先两种。相关理论知识可以参考树的遍历:和。我的程序采用的是广度优先方式。

        程序从crawler.java的main()开始运行。

1
2
3
4
Class.forName(
"com.mysql.jdbc.Driver"
);
String dburl =
"jdbc:mysql://localhost:3306?useUnicode=true&characterEncoding=utf8"
;
conn = DriverManager.getConnection(dburl,
"root"
,
""
);
System.out.println(
"connection built"
);

        首先,调用DriverManager连接MySQL服务。这里使用的是XAMPP的默认MySQL端口3306,端口值可以在XAMPP主界面看到:

image-410

        Apache和MySQL都启动之后,在浏览器地址栏输入“http://localhost/phpmyadmin/”就可以看到数据库了。等程序运行完之后可以在这里检查一下运行是否正确。

image-411

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sql =
"CREATE DATABASE IF NOT EXISTS crawler"
;
stmt = conn.createStatement();
stmt.executeUpdate(sql);
 
sql =
"USE crawler"
;
stmt = conn.createStatement();
stmt.executeUpdate(sql);
 
sql =
"create table if not exists record (recordID int(5) not null auto_increment, URL text not null, crawled tinyint(1) not null, primary key (recordID)) engine=InnoDB DEFAULT CHARSET=utf8"
;
stmt = conn.createStatement();
stmt.executeUpdate(sql);
 
sql =
"create table if not exists tags (tagnum int(4) not null auto_increment, tagname text not null, primary key (tagnum)) engine=InnoDB DEFAULT CHARSET=utf8"
;
stmt = conn.createStatement();
stmt.executeUpdate(sql);

        连接好数据库后,建立一个名为“crawler”的数据库,在库里建两个表,一个叫“record”,包含字段“recordID”,“URL”和“crawled”,分别记录地址编号、链接地址和地址是否被扫描过;另一个叫“tags”,包含字段“tagnum”和“tagname”,分别记录标签编号和标签名。

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
while
(
true
) {
    
httpGet.getByString(url,conn);
    
count++;
     
    
sql =
"UPDATE record SET crawled = 1 WHERE URL = '"
+ url +
"'"
;
    
stmt = conn.createStatement();
     
    
if
(stmt.executeUpdate(sql) >
0
) {
        
sql =
"SELECT * FROM record WHERE crawled = 0"
;
        
stmt = conn.createStatement();
        
rs = stmt.executeQuery(sql);
        
if
(rs.next()) {
            
url = rs.getString(
2
);
        
}
else
{
            
break
;
        
}
    
}
}

        接着在一个while循环内依次处理表record内的每个地址。每次处理时,把地址url传递给httpGet.getByString(),然后在表record中把crawled改为true,表明这个地址已经处理过。然后寻找下一个crawled为false的地址,继续处理,直到处理到表尾。

        这里需要注意的细节是,执行executeQuery()后,得到了一个ResultSet结构rs,rs包含SQL查询返回的所有行和一个指针,指针指向结果中第一行之前的位置,需要执行一次rs.next()才能让rs的指针指向第一个结果,同时返回true,之后每次执行rs.next()都会把指针移到下一个结果上并返回true,直至再也没有结果时,rs.next()的返回值变成了false。

        还有一个细节,在执行建库建表、INSERT、UPDATE时,需要用executeUpdate();在执行SELECT时,需要使用executeQuery()。executeQuery()总是返回一个ResultSet,executeUpdate()返回符合查询的行数。

 

        httpGet.java的getByString()类负责向所给的网址发送请求,然后下载网页内容。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
HttpGet httpget =
new
HttpGet(url);
System.out.println(
"executing request "
+ httpget.getURI());
 
ResponseHandler<String> responseHandler =
new
ResponseHandler<String>() {
    
public
String handleResponse(
            
final
HttpResponse response)
throws
ClientProtocolException, IOException {
        
int
status = response.getStatusLine().getStatusCode();
        
if
(status >=
200
&& status <
300
) {
            
HttpEntity entity = response.getEntity();
            
return
entity !=
null
? EntityUtils.toString(entity) :
null
;
        
}
else
{
            
throw
new
ClientProtocolException(
"Unexpected response status: "
+ status);
        
}
    
}
};
String responseBody = httpclient.execute(httpget, responseHandler);

        这段代码是HTTPComponents的HTTP Client组件中给出的样例,在很多情况下可以直接使用。这部分代码获得了一个字符串responseBody,里面保存着网页中的全部字符。

        接着,就需要把responseBody传递给parsePage.java的parseFromString类提取链接。

1
2
3
4
5
6
7
8
9
10
11
Parser parser =
new
Parser(content);
HasAttributeFilter filter =
new
HasAttributeFilter(
"href"
);
 
try
{
    
NodeList list = parser.parse(filter);
    
int
count = list.size();
     
    
//process every link on this page
    
for
(
int
i=
0
; i<count; i++) {
        
Node node = list.elementAt(i);
        
if
(node
instanceof
LinkTag) {

        在HTML文件中,链接一般都在a标签的href属性中,所以需要创建一个属性过滤器。NodeList保存着这个HTML文件中的所有DOM节点,通过在for循环中依次处理每个节点寻找符合要求的标签,可以把网页中的所有链接提取出来。

        然后通过nextlink.startsWith()进一步筛选,只处理以“http://johnhany.net/”开头的链接并跳过以“http://johnhany.net/wp-content/”开头的链接。

1
2
3
4
5
6
7
8
9
10
11
sql =
"SELECT * FROM record WHERE URL = '"
+ nextlink +
"'"
;
stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_UPDATABLE);
rs = stmt.executeQuery(sql);
 
if
(rs.next()) {
     
}
else
{
    
//if the link does not exist in the database, insert it
    
sql =
"INSERT INTO record (URL, crawled) VALUES ('"
+ nextlink +
"',0)"
;
    
pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
    
pstmt.execute();

        在表record中查找是否已经存在这个链接,如果存在(rs.next()==true),不做任何处理;如果不存在(rs.next()==false),在表中插入这个地址并把crawled置为false。因为之前recordID设为AUTO_INCREMENT,所以要用 Statement.RETURN_GENERATED_KEYS获取适当的编号。

1
2
3
4
5
6
7
8
nextlink = nextlink.substring(mainurl.length());
 
if
(nextlink.startsWith(
"tag/"
)) {
    
tag = nextlink.substring(
4
, nextlink.length()-
1
);
    
tag = URLDecoder.decode(tag,
"UTF-8"
);
    
sql =
"INSERT INTO tags (tagname) VALUES ('"
+ tag +
"')"
;
    
pstmt = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
    
pstmt.execute();

        去掉链接开头的“http://johnhany.net/”几个字符,提高字符比较的速度。如果含有“tag/”说明其后的字符是一个标签的名字,把这给名字提取出来,用UTF-8编码,保证汉字的正常显示,然后存入表tags。类似地还可以加入判断“article/”,“author/”,或“2013/11/”等对其他链接进行归类。


结果

这是两张数据库的截图,显示了程序的部分结果:

image-412

image-413

        在可以获得全部输出结果。可以与本博客的比较一下,看看如果想在其基础上实现sitemap生成工具,还要做哪些修改。

转载地址:http://asgvb.baihongyu.com/

你可能感兴趣的文章
-source 1.5 中不支持 diamond 运算 请使用 -source 7 或更高版本以启用
查看>>
jar包读取资源文件报错:找不到资源文件(No such file or directory)
查看>>
超简单:Linux安装rar/unrar工具与解压到目录示例
查看>>
Eclipse创建Maven Java8 Web项目,并直接部署Tomcat
查看>>
RedHad 7.x服务器操作记录
查看>>
BindException: Cannot assign requested address (Bind failed)解决办法
查看>>
Centos7:Docker安装Gitlab
查看>>
Kafka日志配置
查看>>
logstash 6.x 收集syslog日志
查看>>
Apache Kylin 2.3 构建Cube失败
查看>>
Apache Kylin 2.3 样例分析
查看>>
Apache Kylin 2.3 JDBC Java API 示例
查看>>
An internal error occurred during: "Initializing Java Tooling". java.lang.NullPointerException
查看>>
ClassNotFoundException: org.springframework.web.context.ContextLoaderListener
查看>>
IntelliJ IDEA 2018 基本配置
查看>>
Spring+Mybatis+多数据源(MySQL+Oracle)
查看>>
Mybatis读取Oracle数据库Blob字段,输出原文件
查看>>
信用卡反欺诈
查看>>
线性回归
查看>>
浏览器以只读方式打开PDF
查看>>