Puppeteer is a Node module from Google Chrome that allows you to programmatically control a headless Chrome instance. This essentially means you can write a script to open up a browser, navigate to pages and perform actions on those pages, just as a user would.
This opens up a lot of possibilities, but in this post I'll be explaining how it can be used to crawl HTML pages and create structured data.
Setting up
If you want to follow along you'll need Node.js and NPM installed, and familiarity with JavaScript and the command line is essential to getting the most out of this article.
The Puppeteer docs will help you with installation, and they are definitely worth a read if you want to know more about the library itself.
I've used Visual Studio Code to run this, the integrated terminal makes it easy to run Node.js scripts.
Retrieving Data from Web Pages
Let's pretend that there's no backend data structure to my recipes section, and it's all hardcoded HTML. Being a front end developer in 2019, I want to build the recipe section out using a static site generator such as 11ty or Jekyll.
We'll be using the eval method from Puppeteer to grab data from the page and build a JSON object that we can then write to a file. JSON is a popular data structure used by many programming languages and is widely used for data sources in static site generators.
This is available as $eval and $$eval and is effectively a pseudo method for document.querySelector and document.querySelectorAll. These methods allow us to execute JavaScript in the browser context, as opposed to the Node.js context.
Let's demonstrate this with some code. I'll start by simply grabbing the h2 tag from each recipe item and log it to the console.
If you want to follow along, copy the code and save as puppeteer.js.
const puppeteer = require('puppeteer');
const url = 'https://www.tgwilkins.co.uk/recipes.html';
const options = {
headless: false
};
const selector = '.recipe-item';
(async function(){
const browser = await puppeteer.launch(options);
const page = await browser.newPage();
await page.goto(url);
const recipeNames = await page.$$eval(selector, nodes => {
return nodes.map(node => {
const h2 = node.querySelector('h2');
return h2.textContent;
})
});
console.log(recipeNames);
await browser.close();
})();
Then in your terminal, navigate to the directory where the script is and run:
node puppeteer.js
So what are we doing here?
First we are importing the Puppeteer module using require, and setting up our configuration. We have the URL we'll be crawling, the query selector to use in the eval method, and our options for launching Puppeteer.
I've set the flag headless to false for this first step. This will open a full Chrome browser when Puppeteer launches so you can see everything it is doing. If you don't define any options, headless will be set to true by default.
Puppeteer is being launched inside an asynchronous function. The reason for this is that the methods we will be using all return promises, and to make it as readable as possible we want to use the await keyword, which can only be used inside async functions.
We then open a new page and navigate to our URL. We're using page.$$eval to select all nodes in the document matching our query selector. This method takes two arguments - the query selector, and a function. Our function is where we will execute JavaScript in the browser context.
The function we pass to page.$$eval uses a parameter which we'll use to work with the NodeList returned from the query selector. I've named this 'nodes'. We want to grab the h2 for each of these so I'm using the map array method to loop over these nodes, select the h2 and return the textContent for each of them.
Notice how node.querySelector is used to do this. 'node', is just an HTML element, so we can make use of the querySelector method in the browser to find elements inside of it. This works because we are inside page.$$eval, and in the browser context. This is where Puppeteer becomes a powerful tool.
Getting more data
So let's take this a step further. Grabbing the recipe name is a good start, but there's a lot of data in the page that we could scrape and put into a data structure.
Let's do this with a bit more code. Copy the below and run the script again:
const puppeteer = require('puppeteer');
const url = 'https://www.tgwilkins.co.uk/recipes.html';
const selector = '.recipe-item';
(async function(){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const recipes = await page.$$eval(selector, nodes => {
return nodes.map(node => {
const title = node.querySelector('h2').textContent;
const description = node.querySelector('p').textContent;
const img = node.querySelector('img').getAttribute('src');
const detailTable = node.querySelector('.details');
const detailRows = Array.from(detailTable.querySelectorAll('tr'));
const details = detailRows.reduce((object, row) => {
const columns = row.querySelectorAll('td');
const { textContent: key } = columns[0];
const { textContent: value } = columns[1];
object[key] = value;
return object
},{});
return {
title,
description,
img,
details
}
})
});
console.log(recipes);
await browser.close();
})();
With additional query selectors, we've taken all the important information for each recipe and built up a JSON object, with just over 30 lines of code!
I've opted to collect the details from HTML table into a nested object. Reduce is one of my favourite array methods and is perfect for this, looping over each row and using the textContent from the columns to create key-value pairs.
All that's left is to write this to a JSON file. We'll do that using the native fs (file system) module in Node.js.
We'll replace the console.log with these two lines:
const fs = require('fs');
fs.writeFile('./recipes.json', JSON.stringify(recipes), err => err ? console.log(err): null);
And there you have it! If you have followed along you should now have a JSON file in the same directory as your script. We've used fs.writeFile which is an asynchronous method where we pass in the name of the file we want to create, the data we want to use for the contents of the file, and a callback function. In the callback, we're just checking for any errors and logging them to the console.