Install PHP and Selenium
Selenium is a web automation testing tool that simulates users' actions on web pages. Selenium can interact with multiple languages, including PHP.
Integrate Selenium in PHP
Install PHPSelenium
library. Can be passedComposer
To install it:
composer require facebook/webdriver
Define your web driver
The Chrome browser is used here, and of course Selenium supports multiple browsers. The following code can be saved as a separate file:
use FacebookWebDriverRemoteDesiredCapabilities; use FacebookWebDriverRemoteRemoteWebDriver; require_once('vendor/'); $host = 'http://localhost:4444/wd/hub'; $capabilities = DesiredCapabilities::chrome(); $capabilities->setCapability('goog:chromeOptions', ['args' => ['--headless']]); $driver = RemoteWebDriver::create($host, $capabilities);
Introduce necessary classes and files
Defines the driver's address and chrome browser options
pass
RemoteWebDriver
Class creates a connection to the driver
Simulate user operations
For example, visit a website:
$driver->get('');
This will open Baidu News and get all news links:
$news_links = $driver->findElements(WebDriverBy::cssSelector('.c-title a')); $links = []; foreach ($news_links as $news_link) { $links[] = $news_link->getAttribute('href'); }
- use
WebDriverBy::cssSelector
Get all news links through CSS selector - Iterate through each link and get the URL of each link
Now that you have all the news links, you can iterate through them to crawl the content of each link in turn:
foreach ($links as $link) { $driver->get($link); $news_title = $driver->findElement(WebDriverBy::cssSelector('.article-title'))->getText(); $news_content = $driver->findElement(WebDriverBy::cssSelector('.article-content'))->getText(); // Save news titles and content to the database}
pass
WebDriverBy::cssSelector
Position the specified element and get the element text contentStore news titles and content in a database
The above is the basis for building efficient web crawlers with PHP and Selenium. Of course, if further optimization is needed, you can combine multiple tools and technologies to use, such as using multi-threading to improve efficiency, and using font anti-obfuscation to solve the problem of some websites deconfusing fonts. etc. The world of crawlers is full of strange things, I hope you can find the most suitable methods and tools for you!
For more information about PHP Selenium web crawler, please follow my other related articles!