Elegant Web Scraping with Java - Jsoup

Jsoup is a Java library for handling real HTML. It provides a very convenient API for extracting and manipulating data, using the best DOM, CSS, and jQuery-like methods.

Table of Contents#

Overview of Jsoup
Use Cases

DOM Parsing
CSS Selectors
HTML Filtering

Logical Analysis
Conclusion

Overview of Jsoup#

Official explanation:

Jsoup is a Java library for handling real HTML. It provides a very convenient API for extracting and manipulating data, using the best DOM, CSS, and jQuery-like methods.

I came across Jsoup when I was writing a web scraper in Java. I was frustrated with using regex extensively, as it not only reduced the readability of the code but also made it more time-consuming and laborious. That's when I stumbled upon Jsoup.

As a lightweight and powerful web scraping framework, Jsoup makes it elegant and convenient to fetch web page information. Although it is a Java library, its usage logic is remarkably similar to jQuery, to the point where anyone familiar with or knowledgeable about jQuery can easily get started with this framework.

Use Cases#

DOM Parsing#

Jsoup's DOM parsing is extremely simple. You just need to create a Document object to retrieve the elements of a web page. Let's take parsing a web page as an example. By converting the web page into a Document class, the Element class and its subclasses can be treated as individual nodes, allowing for easy traversal of the entire document by calling relevant methods. The getElementByTag method of the Element class is reminiscent of similar methods in JavaScript, making it familiar to anyone with a basic understanding of JS and Java. Here's an example of querying student grade information:

public class JsoupTest {

    public String getGrade(String stuNum, String idNum) throws IOException {
        String testURL = "http://jwc.cqupt.edu.cn/showStuQmcj.php";  // Target web page
        Connection con = Jsoup.connect(testURL);                    // Get connection
        con.data("xh", stuNum);                                // Fill in parameters
        con.data("sfzh", idNum);
        Document document = con.post();                         // Choose the sending method, get the entire web page information and store it in the document class

        Element pTable = document.body().getElementsByClass("pTable").get(0);   // Get child elements by class attribute

        Elements trs = pTable.getElementsByTag("tbody").get(0).children();
        trs.forEach(tr -> {                                                               // Traverse <tr> tags
            if (!tr.children().isEmpty()) {
                Element element = tr.getElementsByTag("td").get(0);
                
                if (!element.text().equals("课程类型")) {
                    GradeInfo gradeInfo = new GradeInfo();
                    gradeInfo.setProperty(tr.getElementsByTag("td").get(0).text());
                    String term = tr.getElementsByTag("td").get(1).text();

                    System.out.println(term);
                    System.out.println(tr.getElementsByTag("td").get(2).text());
                    System.out.println(tr.getElementsByTag("td").get(5).text());
                    System.out.println(tr.getElementsByTag("td").get(6).text());
                    System.out.println(tr.getElementsByTag("td").get(7).text());
                }
            }
        });
        return "";
    }
}

Result:

With just a few lines of code, basic HTML parsing can be accomplished, and all operations can be explained using JavaScript logic. It is much more practical than using native regex matching.

CSS Selectors#

Jsoup has made a determined effort to catch up with front-end development. In addition to basic DOM parsing operations, it has also incorporated CSS selectors. At first glance, this operation may seem useless, but once you learn how to use it, you will realize how powerful it is. When dealing with complex statement matching, selectors can easily filter out the elements you want, saving you a lot of code. You can use the Element.select(String selector) and Element.select(String selector) methods to achieve this.

public void getBySelect() throws IOException {
       String testURL = "<html>" +
               "<head></head>"+
               "<body>"+"<span id=\"grade\">成绩</span>"+"<span id = \"subject\">课程</span>"+"<span id = \"name\">姓名</span>"+"<span id = \"stunum\">学号</span>"+
               "<span class = \"score\">85</span >"+"<span class = \"class\">语文</span>"+"<span class = \"stuname\">小明</span class = \"number\">"+"<span>201721001</span>"+
               "<span id=\"grade\">80</span>"+"<span id = \"subject\">数学</span>"+"<span id = \"name\">小明</span>"+"<span id = \"stunum\">2017210001</span>"+
               "</body></html>";  // Use string concatenation to create HTML tags
                     // Get connection

       Document document = Jsoup.parse(testURL);                         // Convert HTML into a traversable document class
       Elements elements = document.select("span:matchesOwn(^8)");
       for (Element element:
            elements) {
           System.out.println(elements.text());
       }
   }

Result:

CSS selectors are similar to those used in jQuery and CSS. They allow you to easily select the elements you want by using specific selector syntax. For more information on selector filtering, I recommend reading this article: Detailed Explanation of JSoup's Select Selector Syntax.

HTML Filtering#

This feature is something I stumbled upon, but in retrospect, it is only natural for Jsoup to handle filtering of web page information. Filtering web page information itself is part of Jsoup's job. XSS injection fundamentally involves inserting specific tags into HTML to change the original meaning of the tags. Therefore, the essence of preventing XSS attacks is being able to recognize and filter out unnecessary or invalid HTML tags in a timely manner. Jsoup has a whitelist mechanism for this purpose, and the clean method can clean all tags according to the filtering rules set in the whitelist, while also preserving appropriate tags and disabling image display.

/**
 * XSS filtering
 *
 */
public class JsoupUtil {

    /**
     * Use the built-in basicWithImages whitelist
     * Allowed tags include a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, span,
     * strike, strong, sub, sup, u, ul, img
     * as well as the href attribute of the a tag and the src, align, alt, height, width, and title attributes of the img tag
     */
    private static final Whitelist whitelist = Whitelist.basicWithImages();
    /** Configure the formatting parameters without formatting the code */
    private static final Document.OutputSettings outputSettings = new Document.OutputSettings().prettyPrint(false);
    static {
        // Some styles are implemented using the style attribute in rich text editing
        // For example, red font style="color:red;"
        // So all tags need to have the style attribute added
        whitelist.addAttributes(":all", "style");
    }

    public static String clean(String content) {
        return Jsoup.clean(content, "", whitelist, outputSettings);
    }

    public static void main(String[] args) throws FileNotFoundException, IOException {
        String text = "<a href=\"http://www.baidu.com/a\" onclick=\"alert(1);\">sss</a><script>alert(0);</script>sss";
        System.out.println(clean(text));
    }
}

Logical Analysis of Jsoup#

As a lightweight web scraping framework developed solely by Jonathan Hedley, the code of Jsoup is much simpler compared to other cumbersome frameworks. I have deepened my understanding of Jsoup through some blog posts analyzing its source code. Without further ado, let's take a look at the charm of Jsoup!

Borrowing an image from another blog post, you can see that the ability to store nested tags in a tag-like structure similar to JavaScript is achieved by using a custom Node abstract class. This allows properties to be stored in a tree-like structure, which not only facilitates DOM tree parsing but also makes traversal easier, leading to improved performance. Let's also take a look at the implementation logic of CSS selectors. Here is a list of the source code for the Selector class:

Jsoup's implementation of selectors is roughly achieved through the Evaluator abstract class. The expressions selected by the Selector are ultimately compiled into the corresponding Evaluator class through the QueryParser, which then has many derived subclasses to implement different functionalities. The logical thinking is relatively simple, but I haven't studied the specific code in detail, so I won't go into further detail here. However, the idea of nesting objects for implementation is worth learning from.

In terms of HTML filtering, Jsoup's strategy to prevent XSS attacks is as follows:

Parse the HTML string into a Document object, ensuring that it is not possible to inject useless scripts or manipulate strings to change the functionality of the web page.
Add frequently occurring tags with high risk factors to a whitelist for early filtering.

Conclusion#

Jsoup has demonstrated its capabilities in terms of convenience, but in terms of performance, considering that it still relies on regex matching at its core, using regex directly may be the fastest for simple HTML parsing. However, when the HTML page becomes more complex, that's when Jsoup shines.

However, Jsoup still has some limitations, such as:

It can only handle static pages and cannot properly scrape dynamically displayed or server-rendered pages. In such cases, other tools such as HttpUnit need to be used to simulate AJAX requests.
Jsoup is ultimately a personal project, so there is some uncertainty in terms of long-term maintenance. If you need to use it in large-scale, long-term projects, careful consideration is required.
Jsoup's underlying implementation still relies on regex matching. Although Jsoup itself is lightweight, it still needs to parse the entire HTML and then perform further searches. Therefore, for simple web page parsing, using regex directly is undoubtedly more convenient. However, if the structure of the web page is complex enough that using regex would result in a huge amount of code, then Jsoup is a good choice.

In conclusion, Jsoup is a lightweight web scraping framework that performs well in HTML parsing. If you want to save time and code, I highly recommend using it for your projects.