Puppeteer
基于 Puppeteer 的编程控制
Puppeteer 是 Google
通过 headless 参数来指定是否启用 Headless 模式,默认情况下是启用的。此外,在我们使用 npm 安装 Puppeteer 的时候其会自动下载指定版本的 Chromium 从而保证接口的开箱即用性,也可以通过 executablePath 参数指定启动版本:
const browser = await puppeteer.launch({ headless: false }); // default is true
const browser = await puppeteer.launch({ executablePath: "/path/to/Chrome" });
在大规模部署的情况下,我们需要控制 Puppeteer 连接到远端的服务化方式部署的 Headless Chrome 集群,此时就可以使用 connect
函数连接到 Headless Chrome 实例:
puppeteer.connect({
browserWSEndpoint:
"ws://{remoteip}:9222/devtools/browser/fa60c034-422d-4f2c-bbeb-17a2cfd690f2"
});
import { launch } from "puppeteer";
(async () => {
const browser = await launch({ headless: false });
const page = await browser.newPage();
await page.goto("https://example.com", { waitUntil: "networkidle" });
await page.addScriptTag({
url: "https://code.jquery.com/jquery-3.2.1.min.js"
});
await page.close();
await browser.close();
})();
动态渲染
动态代理
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({
// Launch chromium using a proxy server on port 9876.
// More on proxying:
// https://www.chromium.org/developers/design-documents/network-settings
args: ["--proxy-server=127.0.0.1:9876"]
});
//加隧道代理 加headers头即可
await page.setExtraHTTPHeaders({
"Proxy-Authorization":
"Basic " + Buffer.from(`${username}:${password}`).toString("base64")
});
const page = await browser.newPage();
await page.goto("https://google.com");
await browser.close();
})();
页面操作
脚本执行
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com"); // Get the "viewport" of the page, as reported by the page.
const dimensions = await page.evaluate(() => {
return {
width: document.documentElement.clientWidth,
height: document.documentElement.clientHeight,
deviceScaleFactor: window.devicePixelRatio
};
});
console.log("Dimensions:", dimensions);
await browser.close();
})();
如果需要传递参数,则在 evaluate 的后续参数传入需要传入的参数:
const links = await page.evaluate(evalVar => {
console.log(evalVar); // should be defined now
// ...
}, evalVar);
在 Puppeteer 中我们还可以添加外部的脚本执行操作:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto("https://google.com");
await page.addScriptTag({
url: "https://rawgithub.com/marmelab/gremlins.js/master/gremlins.min.js"
});
await page.evaluate(() => {
window.gremlins.createHorde().unleash();
});
})();
页面保存
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://news.ycombinator.com", { waitUntil: "networkidle" });
await page.pdf({ path: "hn.pdf", format: "A4" });
await browser.close();
})();
监听网页请求
const puppeteer = require("puppeteer");
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on("request", interceptedRequest => {
if (
interceptedRequest.url().endsWith(".png") ||
interceptedRequest.url().endsWith(".jpg")
)
interceptedRequest.abort();
else interceptedRequest.continue();
});
await page.goto("https://example.com");
await browser.close();
});