We wanted to kick off this Puppeteer Tutorial by breaking a general assumption that Puppeteer is primarily a testing tool because, in reality, it is primarily an automation tool. But that doesn’t take away the fact that Puppeteer is incredibly popular for use cases such as scraping, generating PDFs and so much more that we will be exploring in this blog. Loading a browser requires a lot of resources as it has to load a lot of other UI elements like the toolbar, buttons, and so on. The need for such UI elements which are not needed can be eliminated when everything is being controlled with code. Fortunately, there are better solutions like making use of headless browsers.
You can find many blog articles and YouTube videos that explain the puppeteer setup. However, in this Puppeteer Tutorial we will be going through the setup process, and also explore how easy it is to perform web scraping (web automation) in a somewhat non-traditional method that uses a headless browser. This method has been often helped us in providing the best Automation Testing services to our clients and now let’s find out how you can benefit from it too.
An Introduction to the Puppeteer Tutorial
Browsers are usually executed without a graphical user interface when they are being used for automated testing. It is obvious that we would need to use a Puppeteer to make this possible. The question here is – how do we do it. The solution is a headless browser as it’s a great tool when it comes to performing automated testing in server environments there is no need for a visible UI shell.
Puppeteer is made by the team behind Google Chrome, and so we can trust it to be well maintained and to perform common actions on the Chromium browser and programmatically through JavaScript, via a simple and easy-to-use API. Nowadays, JavaScript has been ruling the web, and pretty much everything you interact with on websites uses JavaScript. The added advantage here is that Puppeteer can be used to safely automate even potentially malicious pages as it operates off-process with respect to Chromium. Before we proceed further, let’s cover the Puppeteer installation process just in case you are unaware of it.
Node Installation:
One simply cannot install a puppeteer without having a node. So in order to install the node package, you would need a Node Package manager. You can install a Node package manager by using the ‘Brew Install’ command. Once the npm is installed, you can verify the installation using the below command.
node –v, npm –v
Packages Installation:
Now that the nodes have been installed using an npm, a folder will be created. So you can navigate to this folder and run the initialization command given below.
npm init –y
This will create a package .json file in the directory. This package .json includes the puppeteer dependency and test scripts like Runner class. If you need to run any program, you should add the name of the package .json file you want to run in your script, as shown below.
"Dependencies": {"puppeteer": "^9.0.0"} "Scripts": {"test": "node filename.js"}
Puppeteer Installation:
Now to install the puppeteer, you would have to execute the commands from the terminal. Note that the working directory should be the one that contains the package .json file.
npm install --save puppeteer
The above command installs both the Puppeteer and a version of Chromium that the Puppeteer team knows will work with their API, making the process very simple.
All you need here is the required keyword, as it will make sure that the Puppeteer library is available in the file. The asynchronous function will get executed once it is created.
const puppeteer = require('puppeteer');
Puppeteer-core:
Puppeteer-core package is a version of Puppeteer that not everyone might need as it doesn’t download any browser by default. So if you are looking to use a pre-existing browser or connect to a remote one, this option will come in handy. Since Puppeteer-core doesn’t download Chromium when installed, we have to define an executable Path option that contains the Chrome or Chromium browser path if that is the need.
Environment variables:
If you would like to specify a version of Chromium you’d like Puppeteer to use, or skip downloading the Chromium browser for Puppeteer downloads, you will need to set two environment variables:
PUPPETEER_SKIP_CHROMIUM_DOWNLOAD – You can skip the Chromium download by setting this to be true
PUPPETEER_EXECUTABLE_PATH – To customize the browser as per your need you can set this to the path of the Chrome browser on your system or CI image.
Now that we have prepped everything, let’s go ahead and find out how we can launch the headless browser and use all its functionalities.
Puppeteer Tutorial for each functionality
Browser launch:
Finally, you will be able to open the browser using the launch () keyword with puppeteer, as shown below.
const browser = await puppeteer .launch({ });
The browser that is launched will be in headless mode.
Headless mode
The above line can be modified to include an object as a parameter, and instead of launching in headless mode, you can even launch a full version of the browser using headless: false, as shown below
const browser = await puppeteer. launch ({headless :false});
Browser size
Once the browser has been launched, if you want to make the browser go full screen by converting to the maximized screen option, you can make use of the below code
args: ["--start-fullscreen"], args: ["--start-maximized"]
The reason we are including this in our Puppeteer tutorial is that the Puppeteer sets the initial page size to its default option of 800×600px. This value can be changed before taking the screenshot by setting the viewport as shown in the code.
await page. SetViewport ({width: 1920,height: 1080,});
Slow it down
The slow Mo option is a pretty useful feature in specific situations as it can be used to slow down the Puppeteer operations by the specified amount of milliseconds. As per our need, we used the code given below to slow down the Puppeteer operations by 250 milliseconds.
const browser = await puppeteer .launch({headless: false, slowMo: 250})
Chrome Devtools
When the browser is running, you would have to open Devtools in Chrome to debug the application browser code inside evaluate (). We instead managed to get it working by creating a new page instance and navigating to the Devtools URL. We were then able to query the DOM and interact with the panels.
const browser = await puppeteer .launch({ devtools : true });
URL launch
Now that a page or in other words, a tab is available, any website can be loaded by simply calling the go to () function. This is the basic step in this Puppeteer tutorial as any action like scraping elements can be done only after a website is launched.
Here is the code that we used to launch our own website using the launch () function.
const page = await browser .newPage(); await page.goto('https://www.codoid.com/'); const title = await page.title(); await page.reload(); await page.goBack(); await page.goForward();
If needed, we can also run automation test scripts on incognito mode in puppeteer.
const context = await browser.createIncognitoBrowserContext();
Scraping an element
Now that we have seen how to launch a defined website, let’s find out how we can scrape various elements from that page. Once we start the execution, the browser is launched on headless mode, and it directly sends a get request to the web page and receives the HTML content that we require as explained below in steps.
1. Sending the HTTP request
2. Parsing the HTTP response and extracting desired data
3. Saving the data in some persistent storage, e.g. file, database, and similar
Using the below code, we have retrieved the main header info from our Home Page.
await page.goto("https://codoid.com/"); title = await page.evaluate(() => { return document.querySelector("#main-header").textContent.trim();}); console.log(title);
Scraping multiple elements
You definitely would have to scrape more than 1 element from a webpage and you can get it done by following the following step. Select a querySelectorAll to get all the elements matching the selector, and create an array as heading elements are a type of Node List.
await page.goto("https://en.wikipedia.org/wiki/Web_scraping"); headings = await page.evaluate(() => { headings_elements = document.querySelectorAll("h2 .mw-headline"); headings_array = Array.from(headings_elements); return headings_array.map(heading => heading.textContent); }); console.log(headings);
Debugger
Once the execution is over, we can easily set the debugger in the automation process and get a current page Dom file in ChromeDev tools by using the below code
await page.evaluate(() => { debugger; });
Screenshot
Another useful feature is the ability to take screenshots when the browser is running. These screenshots can be taken by using the puppeteer Node library. The library provides a high-level API that can be used to control the headless Chrome or Chromium over the DevTools Protocol. Now, you will see a jpg file with the name “screenshot” inside your working folder.
await page.screenshot({ path: 'codoid.png'})
Getting PDF
We can easily convert HTML text to a PDF page that is basically a report/result for patients with data visualization, containing a lot of SVG. Furthermore, we can make some special requests to manipulate the layout and make some rearrangements of the HTML elements.
Ultimately the PDF must have a defined styling if you need to generate documents as PDF using the below command. In the command, we have defined the format to be A4.
const pdf = await page.pdf({ format: 'A4' });
Switch to New tab
Many people might encounter difficulties if their work demands several tabs. So we thought the code to open a link in a new tab in puppeteer would come in handy, and added it to this Puppeteer tutorial.
await page.bringToFront();
Type
An input field is something that pretty much every website has and we can define what input has to be given by using the Puppeteer’s page method page .type, which makes use of a CSS selector to spot the element you want to type in and a string you wish to type in the field.
const elements3 = await page.$x("//input[@id='contactname']") await elements3[0].type("codoid");
Click
We can also click on any element or button in the puppeteer, but the only challenging aspect here would be to find the element. Once you have found the element, you can just fire up the click() function as shown below.
const elements = await page.$x("//a[.='Resources']") await elements[0].click()
Checkbox
The checkbox is another element that we can handle by assigning two inputs as shown in the code. Here, the first input is taken as the selector, which is the option we want to select, and the second input as the click count.
const ele2= await page.$x("//input[@id='tried-test-cafe']") await ele2[0].click({clickCount:1})
Dropdown
Puppeteer has a select (selector, value) function to get the value from the dropdown that takes two arguments as input. The first one is taken as a selector and the second argument as value, which is similar to what we saw in the case of a checkbox.
const ele3= await page.$x("//select[@id='preferred-interface']") await ele3[0].select("Both");
Element value
This method is used to get the element value using the $eval () function. In the code shown below, we have defined the ‘Heading Text’ element as the one for which we want to obtain the value. This will stage two parameters as an argument where the first parameter will be the selector and the second parameter will be element= element.textContent.
const Message = await page.$eval('p', ele => ele.textContent); console.log('Heading text:' ,Message);
Element count
It’s pretty simple to get the count of the number of elements in a particular webpage. We have the $$eval() function, which can be employed to get the count of an element with the same selector as shown below.
const count = await page.$$eval('p', ele => ele.length); console.log("Count p tag in the page: "+count);
Headless Chrome Crawler
Once we start the execution, Google Chrome runs on headless mode, which is awesome for web crawling. Since Google Chrome executes the JavaScripts, it yields more URLs to crawl simple requests to HTML files that are generally fast. Anybody who is looking for ways to help their webpage rank better would know the importance of the crawling that helps in the pages getting indexed. The code required to execute crawling is given below.
const HCCrawler = require('headless-chrome-crawler'); (async () => { const crawler = await HCCrawler.launch({ evaluatePage: (() => ({ title: $('title').text(), })), onSuccess: (result => {console.log(result);}), }); await crawler.queue('https://codoid.com/'); await crawler.onIdle(); // Resolved when no queue is left await crawler.close(); // Close the crawler })();
Conclusion
As a leading software testing company that provides the best automation testing services, our favorite “feature” of this approach is that we get to improve the loading performance and the indexing ability of a webpage without significant code changes! We hope you enjoyed reading this Puppeteer Tutorial blog and if you did, make sure to subscribe to our blog to make sure you never miss out on all upcoming blogs.
Comments(0)