1. Brief description
In modern Java development, processing HTML data is a common requirement. Whether it is crawling web page data, parsing HTML documents, or manipulating DOM trees, Jsoup is a powerful tool. It is a Java-based HTML parsing library that supports parsing HTML from URLs, files, or strings. It provides a jQuery-like API for easy selection and manipulation of DOM elements.
This article will introduce the basic functionality of Jsoup and show how to use it to parse and manipulate HTML through multiple detailed code examples.
2. Why choose Jsoup?
- Simple and easy to use: API design friendly and features rich.
- Powerful selector: Supports CSS selector and DOM traversal.
- Flexible HTML operations: HTML can be modified easily.
- Strong compatibility: supports parsing HTML5 and loose HTML.
- Efficient: You can quickly crawl content from a URL.
Before using Jsoup, it needs to be added. Here are Jsoup's Maven dependencies:
<dependency> <groupId></groupId> <artifactId>jsoup</artifactId> <version>1.15.4</version> </dependency>
3. Basic usage method
Spring Boot integrates Jsoup, the following example demonstrates how to use Jsoup to parse HTML files and operate DOM.
3.1 Crawl the content of the web page from the URL
import ; import ; public class JsoupFromUrl { public static void main(String[] args) { try { // Crawl the web page content from the URL Document document = ("").get(); // Output web page title ("Title: " + ()); // Output the first paragraph of the web page ("First Paragraph: " + ("p").first().text()); } catch (Exception e) { (); } } }
3.2 Parsing HTML from a String
import ; import ; public class JsoupFromString { public static void main(String[] args) { String html = "<html><head><title>Jsoup Example</title></head>" + "<body><p>Hello, Jsoup!</p></body></html>"; // parse HTML strings Document document = (html); // Output title and paragraph content ("Title: " + ()); ("Body Text: " + ().text()); } }
3.3 Extract content using CSS selector
import ; import ; import ; public class JsoupCssSelector { public static void main(String[] args) { String html = "<html><body>" + "<div class='content'><h1>Header</h1><p>Paragraph 1</p></div>" + "<div class='footer'><p>Footer Paragraph</p></div>" + "</body></html>"; // parse HTML Document document = (html); // Use CSS selector to extract content Elements content = (".content h1"); ("Header: " + ()); Elements footer = (".footer p"); ("Footer: " + ()); } }
3.4 Modify HTML content
import ; import ; public class JsoupModifyHtml { public static void main(String[] args) { String html = "<html><body><p>Original Paragraph</p></body></html>"; // parse HTML Document document = (html); // Modify the paragraph content ("p").first().text("Updated Paragraph"); // Output modified HTML (()); } }
3.5 Extract links and pictures from web pages
import ; import ; import ; import ; public class JsoupExtractLinks { public static void main(String[] args) { String html = "<html><body>" + "<a href=''>Example</a>" + "<img src='' alt='Example Image'>" + "</body></html>"; // parse HTML Document document = (html); // Extract link Elements links = ("a[href]"); for (Element link : links) { ("Link: " + ("href") + " Text: " + ()); } // Extract pictures Elements images = ("img[src]"); for (Element image : images) { ("Image: " + ("src") + " Alt: " + ("alt")); } } }
3.6 Processing form data
import ; import ; import ; public class JsoupFormExample { public static void main(String[] args) { try { // Submit the form response = ("/login") .data("username", "user123") .data("password", "pass123") .method() .execute(); // Get the response HTML Document document = (); ("Response: " + ().text()); } catch (Exception e) { (); } } }
4. Use scenarios
- Web crawling: Extract web page content, such as titles, paragraphs, links, etc.
- HTML Cleaning: Clean and format user-generated HTML.
- Form Submission: Simulate user login or submit data.
- DOM operation: parse and modify HTML files.
- Data Extraction: Extract structured data from HTML tables.
5. Summary
Jsoup is a powerful tool for processing HTML. It has the ability to quickly crawl, parse and manipulate HTML, and is suitable for a variety of application scenarios.
Common advantages:
- Simple to use and low learning cost.
- Powerful and supports a variety of HTML operations.
- Strong compatibility and can handle various HTML formats.
Common disadvantages:
- Only single-threaded operation is supported, and the crawling efficiency is limited.
- For dynamically loaded web pages (such as AJAX), it needs to be used in conjunction with other tools.
The above is the detailed content of the technical guide for Java using Jsoup to parse and operate HTML. For more information about Java Jsoup parsing and operate HTML, please pay attention to my other related articles!