Technical Guide to Java's Using Jsoup to Parses and Manipulate HTML

1. Brief description

In modern Java development, processing HTML data is a common requirement. Whether it is crawling web page data, parsing HTML documents, or manipulating DOM trees, Jsoup is a powerful tool. It is a Java-based HTML parsing library that supports parsing HTML from URLs, files, or strings. It provides a jQuery-like API for easy selection and manipulation of DOM elements.

This article will introduce the basic functionality of Jsoup and show how to use it to parse and manipulate HTML through multiple detailed code examples.

2. Why choose Jsoup?

Simple and easy to use: API design friendly and features rich.
Powerful selector: Supports CSS selector and DOM traversal.
Flexible HTML operations: HTML can be modified easily.
Strong compatibility: supports parsing HTML5 and loose HTML.
Efficient: You can quickly crawl content from a URL.

Before using Jsoup, it needs to be added. Here are Jsoup's Maven dependencies:

<dependency>
    <groupId></groupId>
    <artifactId>jsoup</artifactId>
    <version>1.15.4</version>
</dependency>

3. Basic usage method

Spring Boot integrates Jsoup, the following example demonstrates how to use Jsoup to parse HTML files and operate DOM.

3.1 Crawl the content of the web page from the URL

import ;
import ;

public class JsoupFromUrl {
    public static void main(String[] args) {
        try {
            // Crawl the web page content from the URL            Document document = ("").get();
            
            // Output web page title            ("Title: " + ());
            
            // Output the first paragraph of the web page            ("First Paragraph: " + ("p").first().text());
        } catch (Exception e) {
            ();
        }
    }
}

3.2 Parsing HTML from a String

import ;
import ;

public class JsoupFromString {
    public static void main(String[] args) {
        String html = "&lt;html&gt;&lt;head&gt;&lt;title&gt;Jsoup Example&lt;/title&gt;&lt;/head&gt;" +
                      "&lt;body&gt;&lt;p&gt;Hello, Jsoup!&lt;/p&gt;&lt;/body&gt;&lt;/html&gt;";

        // parse HTML strings        Document document = (html);

        // Output title and paragraph content        ("Title: " + ());
        ("Body Text: " + ().text());
    }
}

3.3 Extract content using CSS selector

import ;
import ;
import ;

public class JsoupCssSelector {
    public static void main(String[] args) {
        String html = "&lt;html&gt;&lt;body&gt;" +
                      "&lt;div class='content'&gt;&lt;h1&gt;Header&lt;/h1&gt;&lt;p&gt;Paragraph 1&lt;/p&gt;&lt;/div&gt;" +
                      "&lt;div class='footer'&gt;&lt;p&gt;Footer Paragraph&lt;/p&gt;&lt;/div&gt;" +
                      "&lt;/body&gt;&lt;/html&gt;";

        // parse HTML        Document document = (html);

        // Use CSS selector to extract content        Elements content = (".content h1");
        ("Header: " + ());

        Elements footer = (".footer p");
        ("Footer: " + ());
    }
}

3.4 Modify HTML content

import ;
import ;

public class JsoupModifyHtml {
    public static void main(String[] args) {
        String html = "&lt;html&gt;&lt;body&gt;&lt;p&gt;Original Paragraph&lt;/p&gt;&lt;/body&gt;&lt;/html&gt;";

        // parse HTML        Document document = (html);

        // Modify the paragraph content        ("p").first().text("Updated Paragraph");

        // Output modified HTML        (());
    }
}

3.5 Extract links and pictures from web pages

import ;
import ;
import ;
import ;

public class JsoupExtractLinks {
    public static void main(String[] args) {
        String html = "&lt;html&gt;&lt;body&gt;" +
                      "&lt;a href=''&gt;Example&lt;/a&gt;" +
                      "&lt;img src='' alt='Example Image'&gt;" +
                      "&lt;/body&gt;&lt;/html&gt;";

        // parse HTML        Document document = (html);

        // Extract link        Elements links = ("a[href]");
        for (Element link : links) {
            ("Link: " + ("href") + " Text: " + ());
        }

        // Extract pictures        Elements images = ("img[src]");
        for (Element image : images) {
            ("Image: " + ("src") + " Alt: " + ("alt"));
        }
    }
}

3.6 Processing form data

import ;
import ;
import ;

public class JsoupFormExample {
    public static void main(String[] args) {
        try {
            // Submit the form             response = ("/login")
                    .data("username", "user123")
                    .data("password", "pass123")
                    .method()
                    .execute();

            // Get the response HTML            Document document = ();
            ("Response: " + ().text());
        } catch (Exception e) {
            ();
        }
    }
}

4. Use scenarios

Web crawling: Extract web page content, such as titles, paragraphs, links, etc.
HTML Cleaning: Clean and format user-generated HTML.
Form Submission: Simulate user login or submit data.
DOM operation: parse and modify HTML files.
Data Extraction: Extract structured data from HTML tables.

5. Summary

Jsoup is a powerful tool for processing HTML. It has the ability to quickly crawl, parse and manipulate HTML, and is suitable for a variety of application scenarios.

Common advantages:

Simple to use and low learning cost.
Powerful and supports a variety of HTML operations.
Strong compatibility and can handle various HTML formats.

Common disadvantages:

Only single-threaded operation is supported, and the crawling efficiency is limited.
For dynamically loaded web pages (such as AJAX), it needs to be used in conjunction with other tools.

The above is the detailed content of the technical guide for Java using Jsoup to parse and operate HTML. For more information about Java Jsoup parsing and operate HTML, please pay attention to my other related articles!