PHP 浏览器自动化

MeteorCat2026-03-062026-03-06

对于一些要求比较高的数据采集流程, 会采用模拟正常浏览器处理方式来避免被风控, 这种技术就是 浏览器模拟/浏览器渲染型爬虫

核心是用真实浏览器内核执行 JS、渲染页面, 完全模拟真人访问行为来绕过风控

比较老牌的就是 Python 的 Selenium 处理, 但如果采用 PHP 做底层采集框架(接口服务)就感觉没必要引入额外的开发语言

而 PHP 之中第三方也是具有这方面组件库, 也就是这篇文章的主角 php-webdriver

利用 php-webdriver 可以实现模拟玩家浏览器行为来采集数据, 并且一步到位来暴露自身 API 服务接口功能

注: 这里采用 PHP 的 Composer 来做包管理功能

环境安装

首先需要知道模拟玩家浏览器行为是需要用到底层浏览器驱动的, 目前底层浏览器驱动有以下方案:

ChromeDriver: https://sites.google.com/chromium.org/driver
EdgeDriver: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver
GeckoDriver: https://github.com/mozilla/geckodriver/releases

这部分比较常用的是 ChromeDriver 驱动, 需要在 LINUX 系统用命令行安装:

sudo apt install -y chromium-chromedriver # 有的发行版是 chromium-driver

# 安装完成之后确认版本是否完成
chromedriver --version

# 查看安装路径, 一般都在 /usr/bin/chromedriver
# 这个路径比较重要后面需要记住
whereis chromedriver

之后在自己依赖的 composer.json 项目之中引入 php-webdriver 依赖:

1 2	# 声明项目依赖 php-webdriver composer require php-webdriver/webdriver

这样前置准备就处理好, 之后就是实现唤起浏览器的采集功能, 这部分需要参照官方文档来开发

功能实现

这部分可以先模拟人类点击 bing.com 网址并且跳转浏览, 首先这里先定义个 getChromeDriver 函数:

use Facebook\WebDriver\Chrome\ChromeDriver;
use Facebook\WebDriver\Chrome\ChromeDriverService;
use Facebook\WebDriver\Chrome\ChromeOptions;
use Facebook\WebDriver\Remote\DesiredCapabilities;
use Facebook\WebDriver\Remote\RemoteWebDriver;

/**
 * 获取浏览器驱动
 * @return RemoteWebDriver
 */
function getChromeDriver(): RemoteWebDriver
{
    // 构建浏览器配置
    $desiredCapabilities = DesiredCapabilities::chrome();

    // chrome 配置
    $chromeOptions = new ChromeOptions();
    $chromeOptions->addArguments([
        '--headless=new', # 无界面模式
        '--no-sandbox', # 解决沙箱权限问题
        '--disable-gpu',# 禁用 GPU
        '--start-maximized', # 默认启动最大化
        '--disable-blink-features=AutomationControlled', # 隐藏 navigator.webdriver 避免风控
        
        # 模拟正常访问
        '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
        '--lang=zh-CN',
        '--ignore-certificate-errors'
    ]);
    
    // 核心风控配置
    $chromeOptions->setExperimentalOption('excludeSwitches', ['enable-automation']);
    $chromeOptions->setExperimentalOption('useAutomationExtension', false);
    $desiredCapabilities->setCapability(ChromeOptions::CAPABILITY, $chromeOptions);

    // chrome 服务
    // 注意: /usr/bin/chromedriver 就是调用浏览器驱动程序在本地的地址
    // 而其中的端口号则是会采用远程方式做交互处理
    $chromeServicePort = 9515;
    $chromeService = new ChromeDriverService(
        "/usr/bin/chromedriver",
        $chromeServicePort,
        ["--port={$chromeServicePort}"]
    );

    // 构建驱动
    return ChromeDriver::start(
        $desiredCapabilities,
        $chromeService
    );
}

具体的配置项参照(需要工具访问): https://sites.google.com/chromium.org/driver/capabilities

这里罗列比较重要配置用于提供给 ChromeOptions 做参考

--headless=new: 2023之前的老版本采用 --headless 用于声明采用简化版 Chrome 内核处理
--no-sandbox: 关闭安全沙箱机制, 有的时候服务器环境中权限可能导致操作受限, 避免分配给 www-data 等系统用户导致响应错误
--disable-gpu: 服务器一般都不会专门配置显卡, 所以需要关闭显卡硬件渲染
窗口尺寸配置, 这个不是必须
- --screen-info={1024x768}: 界面运行模式的窗口设置, 这里设置为 1024x768 窗口化处理
- --window-size=1024,768: 无界面运行模式的窗口设置, 模拟成 1024x768 窗口避免被风控检测
- --start-maximized: 设置窗口最大化
--user-agent=Mozilla/5.0...: 设置访问的请求客户端代理信息
--lang=zh-CN: 设置默认本地语言
--ignore-certificate-errors: 忽略 https 证书异常问题
--disable-blink-features=AutomationControlled: 访问浏览器不设置 navigator.webdriver 来说明自己是模拟启动

这里其实还有个代理访问的问题, 一般有些数据是基于外网来读取, 要额外配置 Socks5 代理访问:

/**
 * 获取浏览器驱动
 * @return RemoteWebDriver
 */
function getChromeDriver(): RemoteWebDriver
{
    // 构建浏览器配置
    $desiredCapabilities = DesiredCapabilities::chrome();

    // chrome 配置
    $chromeOptions = new ChromeOptions();
    // 其他略
    
    // 代理设置
    $chromeOptions->addArguments([
        '--proxy-server=socks5://127.0.0.1:1080', # Socks5 代理地址
        '--proxy-bypass-list=localhost;127.0.0.1;192.168.*;10.*' # 跳过内网相关地址
    ]);

    // 其他略   
}

这里的主要设置就是 --proxy-server 和 --proxy-bypass-list, 之后就是模拟正常人浏览网页处理:

$driver = getChromeDriver(); // 获取浏览器驱动
try {

    // 请求地址
    $driver->get('https://cn.bing.com/');

    // 最大化窗口
    $driver->manage()->window()->maximize();

    // 等待页面加载, 最多等待 10s, 验证标准就是是否存在指定 id
    $driver->wait(10)->until(
        WebDriverExpectedCondition::presenceOfElementLocated(
            WebDriverBy::id('sb_form_q')
        )
    );

    // 模拟在输入框输入特定字符串
    $searchBox = $driver->findElement(WebDriverBy::id('sb_form_q'));
    $searchBox->sendKeys('PHP 模拟测试');
    sleep(3); # 暂停下 3s

    // 全选输入框文本
    $searchBox->sendKeys([WebDriverKeys::CONTROL, "a"]);

    // 按下删除键清空文本
    $searchBox->sendKeys(WebDriverKeys::BACKSPACE);

    // 重新输入
    $searchBox->sendKeys("Selenium 模拟输入和按键");

    // 确保搜索框处于激活状态并直接在搜索框触发回车提交
    $searchBox->click();
    $searchBox->sendKeys(WebDriverKeys::ENTER);


    // 等待页面加载, 最多等待 5s
    $driver->wait(5)->until(
        WebDriverExpectedCondition::presenceOfElementLocated(
            WebDriverBy::id('b_content')
        )
    );

    sleep(5); // 退出之前暂停 5s
} catch (\Exception $exception) {
    echo $exception->getMessage().PHP_EOL;
} finally {
    $driver->quit();
}

注意: 写死 sleep 定时停止太过于机械化, 建议采用 sleep(mt_rand(2,4)) 来随机停止执行

这样运行下测试看看是否能够自动化启动, 如果想要看到具体流程可以先把 --headless=new 注释掉看看行为是否正常.

数据采集

现在行为模拟已经完成, 之后就是机械化的提取页面信息, 依靠 PHP 生态来保存成 JSON 文件或者写入数据库

我这边抓取凤凰网卫视的财经新闻信息(https://www.ifeng.com/)并将新闻保存到本地 JSON 文件

这里因为我集成的是 codeigniter4 框架, 需要创建自己的 Spark 命令

创建 Spark 命令: https://codeigniter.org.cn/user_guide/cli/cli_commands.html

所以这部分拉取数据功能都是写入下 App\Commands 命名空间之下, 具体内容如下:

<?php

namespace App\Commands;

use CodeIgniter\CLI\BaseCommand;
use CodeIgniter\CLI\CLI;
use Facebook\WebDriver\Chrome\ChromeDriver;
use Facebook\WebDriver\Chrome\ChromeDriverService;
use Facebook\WebDriver\Chrome\ChromeOptions;
use Facebook\WebDriver\Remote\DesiredCapabilities;
use Facebook\WebDriver\Remote\RemoteWebDriver;
use Facebook\WebDriver\WebDriverBy;
use Facebook\WebDriver\WebDriverExpectedCondition;

/**
 * 加载凤凰网新闻数据为 JSON
 * 依赖 php-webdriver
 */
class LoadIFengNews extends BaseCommand
{

    protected $group = 'News';
    protected $name = 'news:iFeng';
    protected $description = 'Load iFeng News to json file.';


    /**
     * 获取浏览器驱动
     * @return RemoteWebDriver
     */
    function getChromeDriver(): RemoteWebDriver
    {
        // 构建浏览器配置
        $desiredCapabilities = DesiredCapabilities::chrome();

        // chrome 配置
        $chromeOptions = new ChromeOptions();
        $chromeOptions->addArguments([
            '--headless=new', # 无界面模式
            '--no-sandbox', # 解决沙箱权限问题
            '--disable-gpu',# 禁用 GPU
            '--start-maximized', # 默认启动最大化
            '--disable-blink-features=AutomationControlled', # 隐藏 navigator.webdriver 避免风控

            # 模拟正常访问
            '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
            '--lang=zh-CN',
            '--ignore-certificate-errors'
        ]);

        // 核心风控配置
        $chromeOptions->setExperimentalOption('excludeSwitches', ['enable-automation']);
        $chromeOptions->setExperimentalOption('useAutomationExtension', false);
        $desiredCapabilities->setCapability(ChromeOptions::CAPABILITY, $chromeOptions);

        // chrome 服务
        // 注意: /usr/bin/chromedriver 就是调用浏览器驱动程序在本地的地址
        // 而其中的端口号则是会采用远程方式做交互处理
        $chromeServicePort = 9515;
        $chromeService = new ChromeDriverService(
            "/usr/bin/chromedriver",
            $chromeServicePort,
            ["--port={$chromeServicePort}"]
        );

        // 构建驱动
        return ChromeDriver::start(
            $desiredCapabilities,
            $chromeService
        );
    }


    /**
     * 请求入口
     * @param array $params
     * @return void
     */
    public function run(array $params): void
    {
        // 获取浏览器驱动
        $driver = $this->getChromeDriver();

        // 核心执行命令
        try {

            // 设置请求地址
            $driver->get("https://news.ifeng.com");

            // --start-maximized 默认最大化窗口, 而如果需要模拟手动点击最大化窗口就是以下命令
            $driver->manage()->window()->maximize();

            // 等待页面加载内容, 可以锁定检索页面是否存在某个元素
            // 凤凰网新闻节点是 class="news-stream-basic-news-list", 所以要等待这个节点渲染完成
            // 这里页面只等待 10s 渲染完成
            $driver->wait(10)->until(
                WebDriverExpectedCondition::presenceOfElementLocated(
                    WebDriverBy::className('news-stream-basic-news-list')
                )
            );

            // 之后完成完成之后就是开始提取这些节点内容
            // news-stream-basic-news-list 节点下面新闻节点是以下多重子节点
            // 1. news-stream-newsStream-news-item-has-image: 内部多个子节点
            // 2. news-stream-newsStream-news-item-title: 内部标题节点
            // 3. a 标签: 内容和 title 属性为新闻标题, href 为新闻地址
            $elements = $driver->findElements(WebDriverBy::className('news-stream-basic-news-list'));
            CLI::write(CLI::color('news node total = ' . count($elements), 'green'));

            // 初始化新闻节点
            $rows = [];

            // 遍历内容
            foreach ($elements as $id => $element) {
                // 再次提取 news-stream-newsStream-news-item-has-image 列表
                $items = $element->findElements(WebDriverBy::className(
                    'news-stream-newsStream-news-item-has-image'
                ));
                CLI::write(CLI::color("news node-{$id} total = " . count($items), 'green'));

                // 遍历内部子新闻列表
                foreach ($items as $item) {
                    // 提取内部的新闻节点 news-stream-newsStream-news-item-title.a 标签
                    $active = $item->findElement(WebDriverBy::className('news-stream-newsStream-news-item-title'));
                    if (empty($active)) continue; // 跳过空节点
                    $news = $active->findElement(WebDriverBy::tagName('a'));
                    if (empty($news)) continue; // 跳过空节点

                    // 提取内部的 a 标签的 title 和 ref 属性
                    $title = $news->getAttribute('title');
                    $url = $news->getAttribute('href');
                    CLI::write(CLI::color("{$url} - {$title}", 'green'));

                    $rows[] = [
                        'url' => $url,
                        'title' => $title,
                    ];
                }
            }

            // 保存到本地 JSON
            if (!empty($rows)) file_put_contents(WRITEPATH . 'uploads/news.json', json_encode($rows, JSON_UNESCAPED_UNICODE));
        } catch (\Exception $e) {
            // 异常信息记录
            $this->logger->error($e->getMessage());
            CLI::write(CLI::color("{$e->getMessage()}", 'red'));
        } finally {
            // 退出的时候关闭浏览器
            if ($driver instanceof ChromeDriver) {
                $driver->quit();
            }
        }

    }
}

这里执行 php -f spark news:iFeng 就会开始模拟人类打开页面复制内容, 最后保存在 writable/uploads/news.json 之中

这就是依靠 PHP 实现无头浏览器采集数据的流程, 不过在目前主流的前后端分离的情况下都是直接拦截接口获取数据;
只有风控比较严格的情况才会需要用到模拟人类采集, 一般集中于购物网站的历史价格采集或者抢购物品的监控下单等.