JAVA13-爬虫项目实战

从零开始做一个项目的原则

  • 把每个项目都当作人生中最好的一个项目来精雕细琢
    • 积累自己的Reputation(声誉)
    • 一丝不苟的写好文档
    • 代码质量++
  • 使用标准化业,界公认的模式和流程
  • (几乎)没有本地依赖,使用者能够好无障碍的运行
  • 小步快跑
    • 成就感
    • 越小的变更越容易debug

开发约定

  • [强制] 使用Github+主干&分支模型进行开发
    • 禁止直接push master
    • 所有变更通过PR进行
  • [强制] 自动化代码质量检查+测试
    • 越早代价越低
    • Checkstyle/SpotBugs
    • 最基本的自动化测试覆盖
  • [尽量] 一切工作自动化
  • 规范提交流程

项目的推进流程

  • 多人协作
    • 模块化
      • 各模块之间职责明确,界限清晰
      • 基本的文档
      • 基本的接口
    • 小步提交
      • 大的变更更加难以review
      • 大的变更冲突更加棘手
  • 单打独斗
    • 先实现功能
    • 再实现的过程中不停的抽取公共部分
      • 每当写出很长很啰嗦的代码的时候,就需要重构了
      • 每当你复制粘贴的时候,就需要重构了
    • 通过重构实现模块化,接口化

爬虫项目算法

  • 从一个节点开始,遍历所有的节点
  • 采用广度优先算法
    • 优先遍历同层次的节点 广度优先

算法流程

  1. 一开始有个链接池
  2. 从链接池中拿一个链接,判断是否处理过
    1. 处理过,重新从池子中拿一个链接
    2. 没有处理过,是我们想要的吗?
      1. 不是,重新从池子拿一个链接
      2. 是我们想要的,处理它,把新得到的链接放入链接池
        1. 如果是新闻的话,存储它
        2. 然后把链接加入已处理的链接池中
  3. 重新从2开始 算法图解

原始代码

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;

public class Main {
    public static void main(String[] args) throws IOException {
        ArrayList<String> unHandledPool = new ArrayList<>();
        HashSet<String> handledPool = new HashSet<>();
        unHandledPool.add("https://sina.cn");

        while (true) {
            if (unHandledPool.isEmpty()) {
                break;
            }

            String link = unHandledPool.get(unHandledPool.size() - 1);
            unHandledPool.remove(link);

            if (handledPool.contains(link)) {
                continue;
            }

            if (link.startsWith("//")) {
                link = "https:" + link;
            }

            if (!link.contains("passport.sina.cn") && ("https://sina.cn".equals(link) || link.contains("news.sina.cn"))) {
                CloseableHttpClient document = HttpClients.createDefault();
                HttpGet request = new HttpGet(link);
                request.setHeader("User-Agent",
                        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36");
                try (CloseableHttpResponse response = document.execute(request)) {
                    String html = EntityUtils.toString(response.getEntity());
                    Document parse = Jsoup.parse(html);
                    for (Element a : parse.select("a")) {
                        String href = a.attr("href");
                        unHandledPool.add(href);
                    }
                    Elements articles = parse.select("article");
                    if (!articles.isEmpty()) {
                        for (Element article : articles) {
                            System.out.println(article.child(0).text());
                        }
                    }
                    handledPool.add(link);
                }
            }
        }
    }
}

第一次重构

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;

public class Main {
    public static void main(String[] args) throws IOException {
        ArrayList<String> unHandledPool = new ArrayList<>();
        HashSet<String> handledPool = new HashSet<>();

        unHandledPool.add("https://sina.cn");
        while (!unHandledPool.isEmpty()) {
            String link = unHandledPool.remove(unHandledPool.size() - 1);

            if (handledPool.contains(link)) {
                continue;
            }

            if (isWantTo(link)) {
                Document jsoup = Jsoup.parse(httpGetHTML(link));
                jsoup.select("a").stream().map(a -> a.attr("href")).forEach(unHandledPool::add);
                storeNewsToDatabase(jsoup);
                handledPool.add(link);
            }
        }
    }


    /**
     * Get HTML content
     *
     * @param link http or https link
     * @return html body
     * @throws IOException IOException
     */
    private static String httpGetHTML(String link) throws IOException {
        if (link.startsWith("//")) {
            link = "https:" + link;
        }
        CloseableHttpClient document = HttpClients.createDefault();
        HttpGet request = new HttpGet(link);
        request.setHeader("User-Agent",
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36");
        try (CloseableHttpResponse response = document.execute(request)) {
            return EntityUtils.toString(response.getEntity());
        }
    }

    /**
     * Determine whether it is the desired link
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isWantTo(String link) {
        return isNotLogin(link) && (isIndex(link) || isNews(link));
    }

    /**
     * Determine whether it is a login page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isNotLogin(String link) {
        return !link.contains("passport.sina.cn");
    }

    /**
     * Determine whether it is the home page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isIndex(String link) {
        return "https://sina.cn".equals(link);
    }

    /**
     * Determine whether it is a news page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isNews(String link) {
        return link.contains("news.sina.cn");
    }

    /**
     * Save news to database
     *
     * @param jsoup Jsoup parse
     */
    private static void storeNewsToDatabase(Document jsoup) {
        Elements articles = jsoup.select("article");
        if (!articles.isEmpty()) {
            for (Element article : articles) {
                System.out.println(article.child(0).text());
            }
        }
    }
}

Maven 生命周期

Maven会从头往下执行,当它执行到某个阶段的时候,它会看看这个阶段有没有什么工作要作
如果没有什么工作要作,他就继续向下执行,如果有对应工作要做,就执行对应工作.
工作是通过插件的形式绑定到指定生命周期阶段的,Maven中一个插件目标称为goal

  • Maven有哪些生命周期阶段? Maven生命周期

  • 其中常用的有

    1. compile -> maven-compile-plugin
    2. test -> maven-surefire-plugin
    3. package
    4. verify
    5. install
    6. deploy
  • 下面是一个插件绑定到compile阶段的例子

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<plugins>
    <plugin>
        <artifactId>maven-checkstyle-plugin</artifactId>
        <version>3.1.0</version>
        <configuration>
            <configLocation>${basedir}/.circleci/checkstyle.xml</configLocation>
            <includeTestSourceDirectory>true</includeTestSourceDirectory>
            <enableRulesSummary>false</enableRulesSummary>
        </configuration>
        <executions>
            <execution>
                <id>compile</id>
                <phase>compile</phase>
                <goals>
                    <goal>check</goal>
                </goals>
            </execution>
        </executions>
        <dependencies>
            <dependency>
                <groupId>com.puppycrawl.tools</groupId>
                <artifactId>checkstyle</artifactId>
                <version>8.29</version>
            </dependency>
        </dependencies>
    </plugin>
</plugins>

第二次重构,引入数据库实现断点续传

数据库定义

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.ArrayList;

public class Main {
    private static Connection connection;

    public static void main(String[] args) throws IOException, SQLException {
        setConnection();
        while (!selectUnHandledUrls().isEmpty()) {
            String url = removeUnHandledMaxUrl();
            if (url == null) break;
            if (selectInHandledWhereUrl(url)) continue;
            if (isWantTo(url)) {
                Document jsoup = parseHtmlToHrefAndInsertIntoLinksUnHandled(url);
                jsoupCssSelectAirtcleAndParseToInsertIntoNews(jsoup);
                updateUrlWithSql(url, "INSERT INTO LINKS_IN_HANDLED(link) VALUES ( ? )");
            }
        }
    }

    /**
     * Load links from databases
     *
     * @return ArrayList
     * @throws SQLException SQLException
     */
    private static ArrayList<String> selectUnHandledUrls() throws SQLException {
        ArrayList<String> list = new ArrayList<>();
        try (PreparedStatement preparedStatement = connection.prepareStatement("SELECT link FROM LINKS_UN_HANDLED");
             ResultSet resultSet = preparedStatement.executeQuery()) {
            while (resultSet.next()) {
                list.add(resultSet.getString(1));
            }
            return list;
        }
    }

    /**
     * DELETE FROM LINKS_UN_HANDLED WHERE link = (SELECT max(id) FROM LINKS_UN_HANDLED)
     *
     * @return return deleted link
     * @throws SQLException SQLException
     */
    private static String removeUnHandledMaxUrl() throws SQLException {
        try (PreparedStatement preparedStatement = connection.prepareStatement(
                "SELECT * FROM LINKS_UN_HANDLED WHERE id=(SELECT max(id) FROM LINKS_UN_HANDLED)");
             ResultSet resultSet = preparedStatement.executeQuery()) {
            if (resultSet.next()) {
                String url = resultSet.getString(2);
                updateUrlWithSql(url, "DELETE FROM LINKS_UN_HANDLED WHERE link = ?");
                return url;
            }
            return null;
        }
    }

    /**
     * SELECT link FROM LINKS_IN_HANDLED WHERE link = ?
     *
     * @param url http or https link
     * @return if exists return true, else return false
     * @throws SQLException SQLException
     */
    private static boolean selectInHandledWhereUrl(String url) throws SQLException {
        try (PreparedStatement preparedStatement
                     = connection.prepareStatement("SELECT link FROM LINKS_IN_HANDLED WHERE link = ?")) {
            preparedStatement.setString(1, url);
            try (ResultSet resultSet = preparedStatement.executeQuery()) {
                return resultSet.next();
            }
        }
    }

    /**
     * parse html to a tag, and 'INSERT INTO LINKS_UN_HANDLED'
     *
     * @param url http or https link
     * @return Jsoup Document
     * @throws SQLException SQLException
     * @throws IOException  IOException
     */
    private static Document parseHtmlToHrefAndInsertIntoLinksUnHandled(String url) throws SQLException, IOException {
        Document jsoup = Jsoup.parse(httpGetHTML(url));
        for (Element a : jsoup.select("a")) {
            String href = a.attr("href");
            updateUrlWithSql(href, "INSERT INTO LINKS_UN_HANDLED(link) VALUES ( ? )");
        }
        return jsoup;
    }

    /**
     * executeUpdate for link
     *
     * @param link http or https link
     * @param sql  sql
     * @throws SQLException SQLException
     */
    private static void updateUrlWithSql(String link, String sql) throws SQLException {
        try (PreparedStatement preparedStatement
                     = connection.prepareStatement(sql)) {
            preparedStatement.setString(1, link);
            preparedStatement.executeUpdate();
        }
    }


    /**
     * Get HTML content
     *
     * @param link http or https link
     * @return html body
     * @throws IOException IOException
     */
    private static String httpGetHTML(String link) throws IOException {
        if (link.startsWith("//")) {
            link = "https:" + link;
        }
        CloseableHttpClient document = HttpClients.createDefault();
        HttpGet request = new HttpGet(link);
        request.setHeader("User-Agent",
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36");
        try (CloseableHttpResponse response = document.execute(request)) {
            return EntityUtils.toString(response.getEntity());
        }
    }

    /**
     * Save news to database
     *
     * @param jsoup Jsoup parse
     */
    private static void jsoupCssSelectAirtcleAndParseToInsertIntoNews(Document jsoup) {
        Elements articles = jsoup.select("article");
        if (!articles.isEmpty()) {
            for (Element article : articles) {
                System.out.println(article.child(0).text());
            }
        }
    }

    /**
     * Determine whether it is the desired link
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isWantTo(String link) {
        return isNotLogin(link) && (isIndex(link) || isNews(link));
    }

    /**
     * Determine whether it is the home page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isIndex(String link) {
        return "https://sina.cn".equals(link);
    }

    /**
     * Determine whether it is a login page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isNotLogin(String link) {
        return !link.contains("passport.sina.cn");
    }

    /**
     * Determine whether it is a news page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isNews(String link) {
        return link.contains("news.sina.cn");
    }

    /**
     * Set database connection
     *
     * @throws SQLException SQLException
     */
    public static void setConnection() throws SQLException {
        File projectDir = new File(System.getProperty("basedir", System.getProperty("user.dir")));
        String jdbcUrl = "jdbc:h2:file:" + new File(projectDir, "news").getAbsolutePath();
        connection = DriverManager.getConnection(jdbcUrl, "root", "toor");
    }
}

利用flyway自动化迁移数据库

  • pom.xml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
<plugin>
    <groupId>org.flywaydb</groupId>
    <artifactId>flyway-maven-plugin</artifactId>
    <version>8.0.4</version>
    <configuration>
        <url>jdbc:h2:file:./news</url>
        <user>root</user>
        <password>toor</password>
    </configuration>
    <dependencies>
        <dependency>
            <groupId>com.h2database</groupId>
            <artifactId>h2</artifactId>
            <version>1.4.200</version>
        </dependency>
    </dependencies>
</plugin>

使用方法

  • 命名规则 命名规则说明

  • 目录路径 目录路径说明

  • 使用方法

1
mvn flyway:migrate
  • 你还可以把它挂载到Maven的initialize生命周期上
1
2
3
4
5
6
7
8
9
<executions>
    <execution>
        <id>initialize</id>
        <phase>initialize</phase>
        <goals>
            <goal>migrate</goal>
        </goals>
    </execution>
</executions>

第三次重构,v1.0.0

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.stream.Collectors;

public class Main {
    private static Connection connection;

    public static void main(String[] args) throws IOException, SQLException {
        setConnection();
        while (!selectUnHandledUrls().isEmpty()) {
            String url = removeUnHandledMaxUrl();
            if (url == null) break;
            if (selectInHandledWhereUrl(url)) continue;
            if (isWantTo(url)) {
                Document jsoup = parseHtmlToHrefAndInsertIntoLinksUnHandled(url);
                jsoupCssSelectAirtcleAndParseToInsertIntoNews(jsoup, url);
                updateUrlWithSql(url, "INSERT INTO LINKS_IN_HANDLED(link) VALUES ( ? )");
            }
        }
    }

    /**
     * Load links from databases
     *
     * @return ArrayList
     * @throws SQLException SQLException
     */
    private static ArrayList<String> selectUnHandledUrls() throws SQLException {
        ArrayList<String> list = new ArrayList<>();
        try (PreparedStatement preparedStatement = connection.prepareStatement("SELECT link FROM LINKS_UN_HANDLED");
             ResultSet resultSet = preparedStatement.executeQuery()) {
            while (resultSet.next()) {
                list.add(resultSet.getString(1));
            }
            return list;
        }
    }

    /**
     * DELETE FROM LINKS_UN_HANDLED WHERE link = (SELECT max(id) FROM LINKS_UN_HANDLED)
     *
     * @return return deleted link
     * @throws SQLException SQLException
     */
    private static String removeUnHandledMaxUrl() throws SQLException {
        try (PreparedStatement preparedStatement = connection.prepareStatement(
                "SELECT * FROM LINKS_UN_HANDLED WHERE id=(SELECT max(id) FROM LINKS_UN_HANDLED)");
             ResultSet resultSet = preparedStatement.executeQuery()) {
            if (resultSet.next()) {
                String url = resultSet.getString(2);
                updateUrlWithSql(url, "DELETE FROM LINKS_UN_HANDLED WHERE link = ?");
                return url;
            }
            return null;
        }
    }

    /**
     * SELECT link FROM LINKS_IN_HANDLED WHERE link = ?
     *
     * @param url http or https link
     * @return if exists return true, else return false
     * @throws SQLException SQLException
     */
    private static boolean selectInHandledWhereUrl(String url) throws SQLException {
        try (PreparedStatement preparedStatement
                     = connection.prepareStatement("SELECT link FROM LINKS_IN_HANDLED WHERE link = ?")) {
            preparedStatement.setString(1, url);
            try (ResultSet resultSet = preparedStatement.executeQuery()) {
                return resultSet.next();
            }
        }
    }

    /**
     * parse html to a tag, and 'INSERT INTO LINKS_UN_HANDLED'
     *
     * @param url http or https link
     * @return Jsoup Document
     * @throws SQLException SQLException
     * @throws IOException  IOException
     */
    private static Document parseHtmlToHrefAndInsertIntoLinksUnHandled(String url) throws SQLException, IOException {
        Document jsoup = Jsoup.parse(httpGetHTML(url));
        for (Element a : jsoup.select("a")) {
            String href = a.attr("href");
            if (href.toLowerCase().startsWith("http")) {
                if (isWantTo(href)) {
                    updateUrlWithSql(href, "INSERT INTO LINKS_UN_HANDLED(link) VALUES ( ? )");
                }
            }
        }
        return jsoup;
    }

    /**
     * executeUpdate for link
     *
     * @param link http or https link
     * @param sql  sql
     * @throws SQLException SQLException
     */
    private static void updateUrlWithSql(String link, String sql) throws SQLException {
        try (PreparedStatement preparedStatement
                     = connection.prepareStatement(sql)) {
            preparedStatement.setString(1, link);
            preparedStatement.executeUpdate();
        }
    }


    /**
     * Get HTML content
     *
     * @param link http or https link
     * @return html body
     * @throws IOException IOException
     */
    private static String httpGetHTML(String link) throws IOException {
        if (link.startsWith("//")) {
            link = "https:" + link;
        }
        CloseableHttpClient document = HttpClients.createDefault();
        HttpGet request = new HttpGet(link);
        request.setHeader("User-Agent",
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36");
        try (CloseableHttpResponse response = document.execute(request)) {
            return EntityUtils.toString(response.getEntity());
        }
    }

    /**
     * Save news to database
     *
     * @param jsoup Jsoup parse
     */
    private static void jsoupCssSelectAirtcleAndParseToInsertIntoNews(Document jsoup, String url) throws SQLException {
        Elements articles = jsoup.select("article");
        if (!articles.isEmpty()) {
            for (Element article : articles) {
                String title = article.select("h1.art_tit_h1")
                        .stream().map(Element::text).collect(Collectors.joining(","));
                String content = article.select("p.art_p")
                        .stream().map(Element::text).collect(Collectors.joining("\n"));
                try (PreparedStatement preparedStatement
                             = connection.prepareStatement("INSERT INTO NEWS(title, content, url) VALUES ( ?, ?, ? )")) {
                    preparedStatement.setString(1, title);
                    preparedStatement.setString(2, content);
                    preparedStatement.setString(3, url);
                    preparedStatement.executeUpdate();
                }
                System.out.println(url);
                System.out.println(title);
            }
        }
    }

    /**
     * Determine whether it is the desired link
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isWantTo(String link) {
        return isNotLogin(link) && (isIndex(link) || isNews(link));
    }

    /**
     * Determine whether it is the home page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isIndex(String link) {
        return "https://sina.cn".equals(link);
    }

    /**
     * Determine whether it is a login page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isNotLogin(String link) {
        return !link.contains("passport.sina.cn");
    }

    /**
     * Determine whether it is a news page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isNews(String link) {
        return link.contains("news.sina.cn");
    }

    /**
     * Set database connection
     *
     * @throws SQLException SQLException
     */
    public static void setConnection() throws SQLException {
        File projectDir = new File(System.getProperty("basedir", System.getProperty("user.dir")));
        String jdbcUrl = "jdbc:h2:file:" + new File(projectDir, "news").getAbsolutePath();
        connection = DriverManager.getConnection(jdbcUrl, "root", "toor");
    }
}

第四次重构,将数据库操作剥离出去

  • Java中对于数据库操作我们称之为DAO(Data Access Object)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.stream.Collectors;

public class Crawler {
    private final CrawlerJdbcDao dao;

    public Crawler() {
        try {
            this.dao = new CrawlerJdbcDao();
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
    }


    public static void main(String[] args) throws IOException, SQLException {
        new Crawler().run();
    }

    private void run() throws SQLException, IOException {
        String link;
        while ((link = removeLink()) != null) {
            if (isHandled(link)) continue;
            if (isWantTo(link)) {
                Document jsoup = parseHtmlToHrefAndInsertIntoLinksUnHandled(link);
                jsoupCssSelectAirtcleAndParseToInsertIntoNews(jsoup, link);
                dao.updateLinksInHandled(link);
            }
        }
    }

    /**
     * Get a link from database and delete it
     *
     * @return return deleted link
     * @throws SQLException SQLException
     */
    private String removeLink() throws SQLException {
        try (ResultSet resultSet = dao.selectLinkFromLinksUnHandledLimit1()) {
            if (resultSet.next()) {
                String url = resultSet.getString(1);
                dao.deleteFromLinksUnHandledWhereLinkIs(url);
                return url;
            }
        }
        return null;
    }

    /**
     * Get link is handled
     *
     * @param url http or https link
     * @return if exists return true, else return false
     * @throws SQLException SQLException
     */
    private boolean isHandled(String url) throws SQLException {
        try (ResultSet resultSet = dao.selectLinkFromLinksInHandledWhereLinkIs(url)) {
            return resultSet.next();
        }
    }

    /**
     * Parse html to a tag, and insert into links_un_handled
     *
     * @param url http or https link
     * @return Jsoup Document
     * @throws SQLException SQLException
     * @throws IOException  IOException
     */
    private Document parseHtmlToHrefAndInsertIntoLinksUnHandled(String url) throws SQLException, IOException {
        Document jsoup = Jsoup.parse(httpGetHTML(url));
        for (Element a : jsoup.select("a")) {
            String href = a.attr("href");
            if (href.toLowerCase().startsWith("http")) {
                if (isWantTo(href)) {
                    dao.updateLinksUnHandled(href);
                }
            }
        }
        return jsoup;
    }

    /**
     * Get HTML content
     *
     * @param link http or https link
     * @return html body
     * @throws IOException IOException
     */
    private static String httpGetHTML(String link) throws IOException {
        if (link.startsWith("//")) {
            link = "https:" + link;
        }
        CloseableHttpClient document = HttpClients.createDefault();
        HttpGet request = new HttpGet(link);
        request.setHeader("User-Agent",
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36");
        try (CloseableHttpResponse response = document.execute(request)) {
            return EntityUtils.toString(response.getEntity());
        }
    }

    /**
     * Save news to database
     *
     * @param jsoup Jsoup parse
     */
    private void jsoupCssSelectAirtcleAndParseToInsertIntoNews(Document jsoup, String url) throws SQLException {
        Elements articles = jsoup.select("article");
        if (!articles.isEmpty()) {
            for (Element article : articles) {
                String title = article.select("h1.art_tit_h1")
                        .stream().map(Element::text).collect(Collectors.joining(","));
                String content = article.select("p.art_p")
                        .stream().map(Element::text).collect(Collectors.joining("\n"));
                dao.updateNews(title, content, url);
                System.out.println(url);
                System.out.println(title);
            }
        }
    }

    /**
     * Determine whether it is the desired link
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isWantTo(String link) {
        return isNotLogin(link) && (isIndex(link) || isNews(link));
    }

    /**
     * Determine whether it is the home page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isIndex(String link) {
        return "https://sina.cn".equals(link);
    }

    /**
     * Determine whether it is a login page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isNotLogin(String link) {
        return !link.contains("passport.sina.cn");
    }

    /**
     * Determine whether it is a news page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isNews(String link) {
        return link.contains("news.sina.cn");
    }
}
  • DAO Interface
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import java.sql.ResultSet;
import java.sql.SQLException;

public interface CrawlerDao {
    void deleteFromLinksUnHandledWhereLinkIs(String link) throws SQLException;

    void updateLinksUnHandled(String link) throws SQLException;

    void updateLinksInHandled(String link) throws SQLException;

    void updateNews(String title, String content, String url) throws SQLException;

    ResultSet selectLinkFromLinksInHandledWhereLinkIs(String url) throws SQLException;

    ResultSet selectLinkFromLinksUnHandledLimit1() throws SQLException;
}
  • CrawlerJdbcDao
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import java.io.File;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;

public class CrawlerJdbcDao implements CrawlerDao {
    private static Connection connection;

    public CrawlerJdbcDao() throws SQLException {
        setConnection();
    }

    /**
     * Delete link from LINKS_UN_HANDLED
     *
     * @param link http or https link
     * @throws SQLException SQLException
     */
    @Override
    public void deleteFromLinksUnHandledWhereLinkIs(String link) throws SQLException {
        updateLink("DELETE FROM LINKS_UN_HANDLED WHERE link = ?", link);
    }

    /**
     * Insert into link to LINKS_UN_HANDLED
     *
     * @param link http or https link
     * @throws SQLException SQLException
     */
    @Override
    public void updateLinksUnHandled(String link) throws SQLException {
        updateLink("INSERT INTO LINKS_UN_HANDLED(link) VALUES ( ? )", link);
    }

    /**
     * Insert into link to LINKS_IN_HANDLED
     *
     * @param link http or https link
     * @throws SQLException SQLException
     */
    @Override
    public void updateLinksInHandled(String link) throws SQLException {
        updateLink("INSERT INTO LINKS_IN_HANDLED(link) VALUES ( ? )", link);
    }

    /**
     * Insert into news
     *
     * @param title   news title
     * @param content news content
     * @param url     news url
     * @throws SQLException SQLException
     */
    @Override
    public void updateNews(String title, String content, String url) throws SQLException {
        try (PreparedStatement preparedStatement
                     = connection.prepareStatement("INSERT INTO NEWS(title, content, url) VALUES ( ?, ?, ? )")) {
            preparedStatement.setString(1, title);
            preparedStatement.setString(2, content);
            preparedStatement.setString(3, url);
            preparedStatement.executeUpdate();
        }
    }

    /**
     * Get link from LINKS_IN_HANDLED
     *
     * @param url http or https link
     * @return ResultSet
     * @throws SQLException SQLException
     */
    @Override
    public ResultSet selectLinkFromLinksInHandledWhereLinkIs(String url) throws SQLException {
        return queryLink("SELECT link FROM LINKS_IN_HANDLED WHERE link = ?", url);
    }

    /**
     * Get a link from LINKS_UN_HANDLED
     *
     * @return ResutSet
     * @throws SQLException SQLException
     */
    @Override
    public ResultSet selectLinkFromLinksUnHandledLimit1() throws SQLException {
        return query("SELECT link FROM LINKS_UN_HANDLED LIMIT 1");
    }

    /**
     * Execute query
     *
     * @param sql sql
     * @return ResultSet
     * @throws SQLException SQLException
     */
    private ResultSet query(String sql) throws SQLException {
        return connection.prepareStatement(sql).executeQuery();
    }

    /**
     * Execute query for link
     *
     * @param sql  sql
     * @param link http or https link
     * @return ResultSet
     * @throws SQLException SQLException
     */
    private ResultSet queryLink(String sql, String link) throws SQLException {
        PreparedStatement preparedStatement = connection.prepareStatement(sql);
        preparedStatement.setString(1, link);
        return preparedStatement.executeQuery();
    }

    /**
     * Update link
     *
     * @param sql  sql
     * @param link http or https link
     * @throws SQLException SQLException
     */
    private void updateLink(String sql, String link) throws SQLException {
        try (PreparedStatement preparedStatement = connection.prepareStatement(sql)) {
            preparedStatement.setString(1, link);
            preparedStatement.executeUpdate();
        }
    }

    /**
     * Set database connection
     *
     * @throws SQLException SQLException
     */
    private void setConnection() throws SQLException {
        File projectDir = new File(System.getProperty("basedir", System.getProperty("user.dir")));
        String jdbcUrl = "jdbc:h2:file:" + new File(projectDir, "news").getAbsolutePath();
        connection = DriverManager.getConnection(jdbcUrl, "root", "toor");
    }
}

第五次重构,引入ORM(Object–relational mapping)对象关系映射

ORM主要作用就是简化你的数据库操作

  • src/main/resources/db/mybatis/mybatis-config.xml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE configuration
        PUBLIC "-//mybatis.org//DTD Config 3.0//EN"
        "http://mybatis.org/dtd/mybatis-3-config.dtd">
<configuration>
    <environments default="development">
        <environment id="development">
            <transactionManager type="JDBC"/>
            <dataSource type="POOLED">
                <property name="driver" value="org.h2.Driver"/>
                <property name="url" value="jdbc:h2:file:./news"/>
                <property name="username" value="root"/>
                <property name="password" value="toor"/>
            </dataSource>
        </environment>
    </environments>
    <mappers>
        <mapper resource="db/mybatis/crawler-mapper.xml"/>
    </mappers>
</configuration>
  • src/main/resources/db/mybatis/crawler-mapper.xml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE mapper
        PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN"
        "http://mybatis.org/dtd/mybatis-3-mapper.dtd">
<mapper namespace="com.github.wjinlei.mybatis">
    <delete id="deleteFromLinksUnHandledWhereLinkIs" parameterType="String">
        DELETE
        FROM LINKS_UN_HANDLED
        WHERE link = #{link}
    </delete>

    <update id="updateLinksUnHandled" parameterType="HashMap">
        INSERT INTO
        <choose>
            <when test="table_name == 'LINKS_IN_HANDLED'">
                LINKS_IN_HANDLED
            </when>
            <otherwise>
                LINKS_UN_HANDLED
            </otherwise>
        </choose>
        (link)
        VALUES (#{link})
    </update>

    <update id="updateNews" parameterType="String">
        INSERT INTO NEWS(title, content, url)
        VALUES (#{title}, #{content}, #{url})
    </update>

    <select id="selectLinkFromLinksInHandledWhereLinkIs" parameterType="String" resultType="boolean">
        SELECT count(link)
        FROM LINKS_IN_HANDLED
        WHERE link = #{url}
    </select>

    <select id="selectLinkFromLinksUnHandledLimit1" resultType="String">
        SELECT link
        FROM LINKS_UN_HANDLED
        LIMIT 1
    </select>
</mapper>
  • src/main/java/com/github/wjinlei/dao/CrawlerDao.java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
package com.github.wjinlei.dao;

import java.sql.SQLException;

public interface CrawlerDao {
    void deleteFromLinksUnHandledWhereLinkIs(String link) throws SQLException;

    void updateLinksUnHandled(String link) throws SQLException;

    void updateLinksInHandled(String link) throws SQLException;

    void updateNews(String title, String content, String url) throws SQLException;

    boolean selectLinkFromLinksInHandledWhereLinkIs(String url) throws SQLException;

    String selectLinkFromLinksUnHandledLimit1() throws SQLException;
}
  • src/main/java/com/github/wjinlei/dao/CrawlerMybatisDao.java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
package com.github.wjinlei.dao;

import org.apache.ibatis.io.Resources;
import org.apache.ibatis.session.SqlSession;
import org.apache.ibatis.session.SqlSessionFactory;
import org.apache.ibatis.session.SqlSessionFactoryBuilder;

import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;

public class CrawlerMybatisDao implements CrawlerDao {
    private final SqlSessionFactory sqlSessionFactory;
    private final String NAMESPACE = "com.github.wjinlei.mybatis";

    public CrawlerMybatisDao() {
        String resource = "db/mybatis/mybatis-config.xml";
        try (InputStream inputStream = Resources.getResourceAsStream(resource)) {
            sqlSessionFactory = new SqlSessionFactoryBuilder().build(inputStream);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    @Override
    public void deleteFromLinksUnHandledWhereLinkIs(String link) {
        try (SqlSession sqlSession = sqlSessionFactory.openSession(true)) {
            sqlSession.delete(NAMESPACE + ".deleteFromLinksUnHandledWhereLinkIs", link);
        }
    }

    @Override
    public void updateLinksUnHandled(String link) {
        updateLink(link, "LINKS_UN_HANDLED");
    }

    @Override
    public void updateLinksInHandled(String link) {
        updateLink(link, "LINKS_IN_HANDLED");
    }


    @Override
    public void updateNews(String title, String content, String url) {
        try (SqlSession sqlSession = sqlSessionFactory.openSession(true)) {
            HashMap<String, Object> hashMap = new HashMap<>();
            hashMap.put("title", title);
            hashMap.put("content", content);
            hashMap.put("url", url);
            sqlSession.update(NAMESPACE + ".updateNews", hashMap);
        }
    }

    @Override
    public boolean selectLinkFromLinksInHandledWhereLinkIs(String url) {
        try (SqlSession sqlSession = sqlSessionFactory.openSession()) {
            Object result = sqlSession.selectOne(NAMESPACE + ".selectLinkFromLinksInHandledWhereLinkIs", url);
            if (result != null) {
                return (boolean) result;
            }
            return false;
        }
    }

    @Override
    public String selectLinkFromLinksUnHandledLimit1() {
        try (SqlSession sqlSession = sqlSessionFactory.openSession()) {
            return sqlSession.selectOne(NAMESPACE + ".selectLinkFromLinksUnHandledLimit1");
        }
    }

    private void updateLink(String link, String links_un_handled) {
        try (SqlSession sqlSession = sqlSessionFactory.openSession(true)) {
            HashMap<String, Object> hashMap = new HashMap<>();
            hashMap.put("table_name", links_un_handled);
            hashMap.put("link", link);
            sqlSession.update(NAMESPACE + ".updateLinksUnHandled", hashMap);
        }
    }
}
  • src/main/java/com/github/wjinlei/dao/CrawlerJdbcDao.java
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
package com.github.wjinlei.dao;

import java.io.File;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;

public class CrawlerJdbcDao implements CrawlerDao {
    private static Connection connection;

    public CrawlerJdbcDao() {
        File projectDir = new File(System.getProperty("basedir", System.getProperty("user.dir")));
        String jdbcUrl = "jdbc:h2:file:" + new File(projectDir, "news").getAbsolutePath();
        try {
            connection = DriverManager.getConnection(jdbcUrl, "root", "toor");
        } catch (SQLException e) {
            throw new RuntimeException(e);
        }
    }

    /**
     * Delete link from LINKS_UN_HANDLED
     *
     * @param link http or https link
     * @throws SQLException SQLException
     */
    @Override
    public void deleteFromLinksUnHandledWhereLinkIs(String link) throws SQLException {
        updateLink("DELETE FROM LINKS_UN_HANDLED WHERE link = ?", link);
    }

    /**
     * Insert into link to LINKS_UN_HANDLED
     *
     * @param link http or https link
     * @throws SQLException SQLException
     */
    @Override
    public void updateLinksUnHandled(String link) throws SQLException {
        updateLink("INSERT INTO LINKS_UN_HANDLED(link) VALUES ( ? )", link);
    }

    /**
     * Insert into link to LINKS_IN_HANDLED
     *
     * @param link http or https link
     * @throws SQLException SQLException
     */
    @Override
    public void updateLinksInHandled(String link) throws SQLException {
        updateLink("INSERT INTO LINKS_IN_HANDLED(link) VALUES ( ? )", link);
    }

    /**
     * Insert into news
     *
     * @param title   news title
     * @param content news content
     * @param url     news url
     * @throws SQLException SQLException
     */
    @Override
    public void updateNews(String title, String content, String url) throws SQLException {
        try (PreparedStatement preparedStatement
                     = connection.prepareStatement("INSERT INTO NEWS(title, content, url) VALUES ( ?, ?, ? )")) {
            preparedStatement.setString(1, title);
            preparedStatement.setString(2, content);
            preparedStatement.setString(3, url);
            preparedStatement.executeUpdate();
        }
    }

    /**
     * Get link from LINKS_IN_HANDLED
     *
     * @param url http or https link
     * @return ResultSet
     * @throws SQLException SQLException
     */
    @Override
    public boolean selectLinkFromLinksInHandledWhereLinkIs(String url) throws SQLException {
        try (ResultSet resultSet = queryLink("SELECT link FROM LINKS_IN_HANDLED WHERE link = ?", url)) {
            return resultSet.next();
        }
    }

    /**
     * Get a link from LINKS_UN_HANDLED
     *
     * @return ResutSet
     * @throws SQLException SQLException
     */
    @Override
    public String selectLinkFromLinksUnHandledLimit1() throws SQLException {
        try (ResultSet resultSet = query("SELECT link FROM LINKS_UN_HANDLED LIMIT 1")) {
            if (resultSet.next()) {
                return resultSet.getString(1);
            }
        }
        return null;
    }

    /**
     * Execute query
     *
     * @param sql sql
     * @return ResultSet
     * @throws SQLException SQLException
     */
    private ResultSet query(String sql) throws SQLException {
        return connection.prepareStatement(sql).executeQuery();
    }

    /**
     * Execute query for link
     *
     * @param sql  sql
     * @param link http or https link
     * @return ResultSet
     * @throws SQLException SQLException
     */
    private ResultSet queryLink(String sql, String link) throws SQLException {
        PreparedStatement preparedStatement = connection.prepareStatement(sql);
        preparedStatement.setString(1, link);
        return preparedStatement.executeQuery();
    }

    /**
     * Update link
     *
     * @param sql  sql
     * @param link http or https link
     * @throws SQLException SQLException
     */
    private void updateLink(String sql, String link) throws SQLException {
        try (PreparedStatement preparedStatement = connection.prepareStatement(sql)) {
            preparedStatement.setString(1, link);
            preparedStatement.executeUpdate();
        }
    }
}
  • src/main/java/com/github/wjinlei/Crawler.java
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
package com.github.wjinlei;

import com.github.wjinlei.dao.CrawlerDao;
import com.github.wjinlei.dao.CrawlerMybatisDao;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.sql.SQLException;
import java.util.stream.Collectors;

public class Crawler {
    private final CrawlerDao dao;

    public Crawler() {
        this.dao = new CrawlerMybatisDao();
    }

    public static void main(String[] args) throws IOException, SQLException {
        new Crawler().run();
    }

    private void run() throws SQLException, IOException {
        String link;
        while ((link = removeLink()) != null) {
            if (isHandled(link)) continue;
            if (isWantTo(link)) {
                Document jsoup = parseHtmlToHrefAndInsertIntoLinksUnHandled(link);
                jsoupCssSelectAirtcleAndParseToInsertIntoNews(jsoup, link);
                dao.updateLinksInHandled(link);
            }
        }
    }

    /**
     * Get a link from database and delete it
     *
     * @return return deleted link
     * @throws SQLException SQLException
     */
    private String removeLink() throws SQLException {
        String link = dao.selectLinkFromLinksUnHandledLimit1();
        if (link != null) {
            dao.deleteFromLinksUnHandledWhereLinkIs(link);
            return link;
        }
        return null;
    }

    /**
     * Get link is handled
     *
     * @param url http or https link
     * @return if exists return true, else return false
     * @throws SQLException SQLException
     */
    private boolean isHandled(String url) throws SQLException {
        return dao.selectLinkFromLinksInHandledWhereLinkIs(url);
    }

    /**
     * Parse html to a tag, and insert into links_un_handled
     *
     * @param url http or https link
     * @return Jsoup Document
     * @throws SQLException SQLException
     * @throws IOException  IOException
     */
    private Document parseHtmlToHrefAndInsertIntoLinksUnHandled(String url) throws SQLException, IOException {
        Document jsoup = Jsoup.parse(httpGetHTML(url));
        for (Element a : jsoup.select("a")) {
            String href = a.attr("href");
            if (href.toLowerCase().startsWith("http")) {
                if (isWantTo(href)) {
                    dao.updateLinksUnHandled(href);
                }
            }
        }
        return jsoup;
    }

    /**
     * Get HTML content
     *
     * @param link http or https link
     * @return html body
     * @throws IOException IOException
     */
    private static String httpGetHTML(String link) throws IOException {
        if (link.startsWith("//")) {
            link = "https:" + link;
        }
        CloseableHttpClient document = HttpClients.createDefault();
        HttpGet request = new HttpGet(link);
        request.setHeader("User-Agent",
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36");
        try (CloseableHttpResponse response = document.execute(request)) {
            return EntityUtils.toString(response.getEntity());
        }
    }

    /**
     * Save news to database
     *
     * @param jsoup Jsoup parse
     */
    private void jsoupCssSelectAirtcleAndParseToInsertIntoNews(Document jsoup, String url) throws SQLException {
        Elements articles = jsoup.select("article");
        if (!articles.isEmpty()) {
            for (Element article : articles) {
                String title = article.select("h1.art_tit_h1")
                        .stream().map(Element::text).collect(Collectors.joining(","));
                String content = article.select("p.art_p")
                        .stream().map(Element::text).collect(Collectors.joining("\n"));
                dao.updateNews(title, content, url);
                System.out.println(url);
                System.out.println(title);
            }
        }
    }

    /**
     * Determine whether it is the desired link
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isWantTo(String link) {
        return isNotLogin(link) && (isIndex(link) || isNews(link));
    }

    /**
     * Determine whether it is the home page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isIndex(String link) {
        return "https://sina.cn".equals(link);
    }

    /**
     * Determine whether it is a login page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isNotLogin(String link) {
        return !link.contains("passport.sina.cn");
    }

    /**
     * Determine whether it is a news page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isNews(String link) {
        return link.contains("news.sina.cn");
    }
}

迁移到MySQL

  1. 添加MySQL JDBC驱动
1
2
3
4
5
6
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>8.0.27</version>
</dependency>
  1. 修改flyway连接配置
1
<url>jdbc:mysql://localhost:3306/news?characterEncoding=utf-8</url>
  1. 修改Mybatis的驱动和连接配置
  • 值得注意的是新版的jdbc驱动class是com.mysql.cj.jdbc.Driver
1
2
3
4
5
6
<dataSource>
    <property name="driver" value="com.mysql.jdbc.Driver"/>
    <property name="url" value="jdbc:mysql://localhost:3306/news?characterEncoding=utf-8"/>
    <property name="username" value="root"/>
    <property name="password" value="toor"/>
</dataSource>
  1. 在MySQL中创建news数据库,并指定utf8mb4编码
    • 注意MySQL帐号是root密码是toor

第六次重构,改造多线程提升爬虫性能

  • Crawler类继承Thread并覆盖run方法
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import com.github.wjinlei.dao.CrawlerDao;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.sql.SQLException;
import java.util.stream.Collectors;

public class Crawler extends Thread {
    private final CrawlerDao dao;

    public Crawler(CrawlerDao dao) {
        this.dao = dao;
    }

    @Override
    public void run() {
        String link;
        try {
            while ((link = dao.removeLinksUnHandled()) != null) {
                if (dao.selectLinksInHandled(link)) continue;
                if (isWantTo(link)) {
                    Document jsoup = parseHtmlToHrefAndInsertIntoLinksUnHandled(link);
                    jsoupCssSelectAirtcleAndParseToInsertIntoNews(jsoup, link);
                    dao.updateLinksInHandled(link);
                }
            }
        } catch (IOException | SQLException e) {
            throw new RuntimeException(e);
        }
    }

    /**
     * Parse html to a tag, and insert into links_un_handled
     *
     * @param url http or https link
     * @return Jsoup Document
     * @throws SQLException SQLException
     * @throws IOException  IOException
     */
    private Document parseHtmlToHrefAndInsertIntoLinksUnHandled(String url) throws SQLException, IOException {
        Document jsoup = Jsoup.parse(httpGetHTML(url));
        for (Element a : jsoup.select("a")) {
            String href = a.attr("href");
            if (href.toLowerCase().startsWith("http")) {
                if (isWantTo(href)) {
                    dao.updateLinksUnHandled(href);
                }
            }
        }
        return jsoup;
    }

    /**
     * Get HTML content
     *
     * @param link http or https link
     * @return html body
     * @throws IOException IOException
     */
    private static String httpGetHTML(String link) throws IOException {
        if (link.startsWith("//")) {
            link = "https:" + link;
        }
        CloseableHttpClient document = HttpClients.createDefault();
        HttpGet request = new HttpGet(link);
        request.setHeader("User-Agent",
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36");
        try (CloseableHttpResponse response = document.execute(request)) {
            return EntityUtils.toString(response.getEntity());
        }
    }

    /**
     * Save news to database
     *
     * @param jsoup Jsoup parse
     */
    private void jsoupCssSelectAirtcleAndParseToInsertIntoNews(Document jsoup, String url) throws SQLException {
        Elements articles = jsoup.select("article");
        if (!articles.isEmpty()) {
            for (Element article : articles) {
                String title = article.select("h1.art_tit_h1")
                        .stream().map(Element::text).collect(Collectors.joining(","));
                String content = article.select("p.art_p")
                        .stream().map(Element::text).collect(Collectors.joining("\n"));
                dao.updateNews(title, content, url);
                System.out.println(url);
                System.out.println(title);
            }
        }
    }

    /**
     * Determine whether it is the desired link
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isWantTo(String link) {
        return isNotLogin(link) && (isIndex(link) || isNews(link));
    }

    /**
     * Determine whether it is the home page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isIndex(String link) {
        return "https://sina.cn".equals(link);
    }

    /**
     * Determine whether it is a login page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isNotLogin(String link) {
        return !link.contains("passport.sina.cn");
    }

    /**
     * Determine whether it is a news page
     *
     * @param link http or https link
     * @return true or false
     */
    private static boolean isNews(String link) {
        return link.contains("news.sina.cn");
    }
}
  • 抽取成单独的Main类
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
package com.github.wjinlei;

import com.github.wjinlei.dao.CrawlerMybatisDao;

public class Main {
    public static void main(String[] args) {
        CrawlerMybatisDao crawlerMybatisDao = new CrawlerMybatisDao();
        for (int i = 0; i < 16; i++) {
            new Crawler(crawlerMybatisDao).start();
        }
    }
}
  • 优化DAO接口
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
package com.github.wjinlei.dao;

import java.sql.SQLException;

public interface CrawlerDao {

    void deleteLinksUnHandled(String link) throws SQLException;

    void updateLinksUnHandled(String link) throws SQLException;

    void updateLinksInHandled(String link) throws SQLException;

    void updateNews(String title, String content, String url) throws SQLException;

    boolean selectLinksInHandled(String link) throws SQLException;

    String removeLinksUnHandled() throws SQLException;
}
  • MybatisDao,同步关键函数,避免一个连接被处理多次
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import org.apache.ibatis.io.Resources;
import org.apache.ibatis.session.SqlSession;
import org.apache.ibatis.session.SqlSessionFactory;
import org.apache.ibatis.session.SqlSessionFactoryBuilder;

import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;

public class CrawlerMybatisDao implements CrawlerDao {
    private final SqlSessionFactory sqlSessionFactory;
    private final String NAMESPACE = "com.github.wjinlei.mybatis";

    public CrawlerMybatisDao() {
        String resource = "db/mybatis/mybatis-config.xml";
        try (InputStream inputStream = Resources.getResourceAsStream(resource)) {
            sqlSessionFactory = new SqlSessionFactoryBuilder().build(inputStream);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    @Override
    public void deleteLinksUnHandled(String link) {
        try (SqlSession sqlSession = sqlSessionFactory.openSession(true)) {
            sqlSession.delete(NAMESPACE + ".deleteFromLinksUnHandledWhereLinkIs", link);
        }
    }

    @Override
    public void updateLinksUnHandled(String link) {
        updateLink(link, "LINKS_UN_HANDLED");
    }

    @Override
    public void updateLinksInHandled(String link) {
        updateLink(link, "LINKS_IN_HANDLED");
    }


    @Override
    public void updateNews(String title, String content, String url) {
        try (SqlSession sqlSession = sqlSessionFactory.openSession(true)) {
            HashMap<String, Object> hashMap = new HashMap<>();
            hashMap.put("title", title);
            hashMap.put("content", content);
            hashMap.put("url", url);
            sqlSession.update(NAMESPACE + ".updateNews", hashMap);
        }
    }

    @Override
    public boolean selectLinksInHandled(String url) {
        try (SqlSession sqlSession = sqlSessionFactory.openSession()) {
            Object result = sqlSession.selectOne(NAMESPACE + ".selectLinkFromLinksInHandledWhereLinkIs", url);
            if (result != null) {
                return (boolean) result;
            }
            return false;
        }
    }

    @Override
    public synchronized String removeLinksUnHandled() {
        try (SqlSession sqlSession = sqlSessionFactory.openSession()) {
            String link = sqlSession.selectOne(NAMESPACE + ".selectLinkFromLinksUnHandledLimit1");
            if (link != null) {
                deleteLinksUnHandled(link);
                return link;
            }
        }
        return null;
    }

    private void updateLink(String link, String links_un_handled) {
        try (SqlSession sqlSession = sqlSessionFactory.openSession(true)) {
            HashMap<String, Object> hashMap = new HashMap<>();
            hashMap.put("table_name", links_un_handled);
            hashMap.put("link", link);
            sqlSession.update(NAMESPACE + ".updateLinksUnHandled", hashMap);
        }
    }
}

引入Elasticsearch

  • Elasticsearch的主要能力是搜索,你可以把它理解为是一个数据库,它对于文本的检索能力是最强的
  • 可参考 Elasticsearch 权威指南
  • Index和传统数据库的对应
  • Document和传统数据库的行(row)对应
  • Field和传统数据库的列(Column)对应
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
package com.github.wjinlei.dao;

import org.apache.http.HttpHost;
import org.elasticsearch.action.delete.DeleteRequest;
import org.elasticsearch.action.get.GetRequest;
import org.elasticsearch.action.get.GetResponse;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.client.indices.GetIndexRequest;
import org.elasticsearch.search.builder.SearchSourceBuilder;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

public class CrawlerElasticsearchDao implements CrawlerDao {
    private static final String unHandled = "links_un_handled";
    private static final String inHandled = "links_in_handled";
    private static final String news = "news";

    public CrawlerElasticsearchDao() {
        createIndex(unHandled);
        createIndex(inHandled);
        createIndex(news);
        updateLink("https://sina.cn", unHandled);

        try {
            Thread.sleep(3000); // Waiting for initialization
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

    private RestHighLevelClient newClient() {
        return new RestHighLevelClient(RestClient.builder(
                new HttpHost("127.0.0.1", 9200, "http")));
    }

    @Override
    public void deleteLinksUnHandled(String link) {
        try (RestHighLevelClient client = newClient()) {
            client.delete(new DeleteRequest(unHandled, link), RequestOptions.DEFAULT);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    @Override
    public void updateLinksUnHandled(String link) {
        updateLink(link, unHandled);
    }

    @Override
    public void updateLinksInHandled(String link) {
        updateLink(link, inHandled);
    }

    private void updateLink(String link, String index) {
        Map<String, Object> document = new HashMap<>();
        document.put("link", link);
        try (RestHighLevelClient client = newClient()) {
            client.index(new IndexRequest(index).source(document).id(link), RequestOptions.DEFAULT);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    @Override
    public void updateNews(String title, String content, String url) {
        Map<String, Object> document = new HashMap<>();
        document.put("title", title);
        document.put("content", content);
        document.put("url", url);
        try (RestHighLevelClient client = newClient()) {
            client.index(new IndexRequest(news).source(document).id(url), RequestOptions.DEFAULT);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    @Override
    public boolean selectLinksInHandled(String link) {
        try (RestHighLevelClient client = newClient()) {
            GetResponse response = client.get(new GetRequest(inHandled, link), RequestOptions.DEFAULT);
            return response.isExists();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    @Override
    public synchronized String removeLinksUnHandled() {
        try (RestHighLevelClient client = newClient()) {
            SearchRequest request = new SearchRequest(unHandled).source(new SearchSourceBuilder().size(1));
            SearchResponse response = client.search(request, RequestOptions.DEFAULT);
            if (response.getHits().getTotalHits().value < 1) return null;
            String link = (String) response.getHits().getAt(0).getSourceAsMap().get("link");
            if (link == null) return null;
            deleteLinksUnHandled(link);
            return link;
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    private void createIndex(String index) {
        try (RestHighLevelClient client = newClient()) {
            if (!client.indices().exists(new GetIndexRequest(index), RequestOptions.DEFAULT)) {
                CreateIndexRequest request = new CreateIndexRequest(index);
                client.indices().create(request, RequestOptions.DEFAULT);
            }
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
}

最终效果

效果

updatedupdated2025-03-012025-03-01