Puppeteer

基于 Puppeteer 的编程控制

Puppeteer 是 Google

通过 headless 参数来指定是否启用 Headless 模式,默认情况下是启用的。此外,在我们使用 npm 安装 Puppeteer 的时候其会自动下载指定版本的 Chromium 从而保证接口的开箱即用性,也可以通过 executablePath 参数指定启动版本:

const browser = await puppeteer.launch({ headless: false }); // default is true

const browser = await puppeteer.launch({ executablePath: "/path/to/Chrome" });

在大规模部署的情况下,我们需要控制 Puppeteer 连接到远端的服务化方式部署的 Headless Chrome 集群,此时就可以使用 connect 函数连接到 Headless Chrome 实例:

puppeteer.connect({
  browserWSEndpoint:
    "ws://{remoteip}:9222/devtools/browser/fa60c034-422d-4f2c-bbeb-17a2cfd690f2"
});
import { launch } from "puppeteer";
(async () => {
  const browser = await launch({ headless: false });
  const page = await browser.newPage();
  await page.goto("https://example.com", { waitUntil: "networkidle" });
  await page.addScriptTag({
    url: "https://code.jquery.com/jquery-3.2.1.min.js"
  });
  await page.close();
  await browser.close();
})();

动态渲染

动态代理

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch({
    // Launch chromium using a proxy server on port 9876.
    // More on proxying:
    //    https://www.chromium.org/developers/design-documents/network-settings
    args: ["--proxy-server=127.0.0.1:9876"]
  });

  //加隧道代理 加headers头即可
  await page.setExtraHTTPHeaders({
    "Proxy-Authorization":
      "Basic " + Buffer.from(`${username}:${password}`).toString("base64")
  });

  const page = await browser.newPage();
  await page.goto("https://google.com");
  await browser.close();
})();

页面操作

脚本执行

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();

  const page = await browser.newPage();

  await page.goto("https://example.com"); // Get the "viewport" of the page, as reported by the page.

  const dimensions = await page.evaluate(() => {
    return {
      width: document.documentElement.clientWidth,

      height: document.documentElement.clientHeight,

      deviceScaleFactor: window.devicePixelRatio
    };
  });

  console.log("Dimensions:", dimensions);

  await browser.close();
})();

如果需要传递参数,则在 evaluate 的后续参数传入需要传入的参数:

const links = await page.evaluate(evalVar => {
  console.log(evalVar); // should be defined now
  // ...
}, evalVar);

在 Puppeteer 中我们还可以添加外部的脚本执行操作:

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();
  await page.goto("https://google.com");
  await page.addScriptTag({
    url: "https://rawgithub.com/marmelab/gremlins.js/master/gremlins.min.js"
  });
  await page.evaluate(() => {
    window.gremlins.createHorde().unleash();
  });
})();

页面保存

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();

  const page = await browser.newPage();

  await page.goto("https://news.ycombinator.com", { waitUntil: "networkidle" });

  await page.pdf({ path: "hn.pdf", format: "A4" });

  await browser.close();
})();

监听网页请求

const puppeteer = require("puppeteer");

puppeteer.launch().then(async browser => {
  const page = await browser.newPage();
  await page.setRequestInterception(true);
  page.on("request", interceptedRequest => {
    if (
      interceptedRequest.url().endsWith(".png") ||
      interceptedRequest.url().endsWith(".jpg")
    )
      interceptedRequest.abort();
    else interceptedRequest.continue();
  });
  await page.goto("https://example.com");
  await browser.close();
});

端到端测试

上一页