Evolving web scraping: How Pageripper API handles JavaScript-heavy sites
Introduction
Pageripper is a commercial API that extracts data from webpages, even if they're rendered with Javascript.
Challenges in Scraping JavaScript-Heavy Sites
Scraping data and content from sites that rely on client-side Javascript for rendering requires a different approach from scraping sites that render their content completely on the server.
These two flow diagrams illustrate the difference: when you're scraping a server-side rendered site, the initial page you fetch tends to already include all rendered content.
Scraping a server-side rendered site
Here's an example of what we might see if we fetch a server-side rendered (SSR) page:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>SSR Example</title>
</head>
<body>
<div id="root">
<h1>Server-Side Rendered Page</h1>
<p>This is an example of a server-side rendered page.</p>
<ul>
<li><a href="/link1">Link 1</a></li>
<li><a href="/link2">Link 2</a></li>
<li><a href="/link3">Link 3</a></li>
</ul>
</div>
<!-- All content is already in the HTML -->
</body>
</html>
There's not much to do after we receive the requested page, since we can access whatever CSS selectors or elements we want - they're all already rendered.
Scraping a client-side rendered site
However, when we're dealing with a client-side rendered (CSR) site, we need to employ a browser or Javascript engine to run the code bundle responsible for rendering elements.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>CSR Example</title>
<script type="module" src="app.js"></script> <!-- Modern JavaScript module -->
</head>
<body>
<div id="root">
<h1>Client-Side Rendered Page</h1>
<div id="content">
<!-- Content will be loaded and rendered by JavaScript -->
</div>
</div>
</body>
</html>
That's because the HTML that we're sent is an intentionally minimal shell that includes the Single Page Application's Javascript bundle:
Pageripper’s Approach
Pageripper handles both the client-side and server-side rendered sites by employing Puppeteer under the hood. Puppeteer is project that simplifies running a headless browser in Node.js.
This means that Pageripper can browse the web the same way you do - and see the same results for any given site that you do when loading it in a browser - even those sites that render their content after pageload via Javascript.
In addition, Pageripper allows users to tweak the behavior of Puppeteer on a per-request basis. This means that you can specify which network event to wait on when scraping a given site.
For example, if you're using Pageripper to scrape a server-side rendered (SSR) site, you can likely get away with passing a waitUntilEvent
parameter of domcontentloaded
, meaning that Puppeteer will only
wait until the DOM content is loaded before scraping. This is noticeably faster most of the time:
curl -X POST \
-H 'Content-type: application/json' \
-d '{"options": {"includeUrls": true, "waitUntil":"domcontentloaded"}, "url": "https://nytimes.com"}' \
<pageripper-api-root>/extracts
However, when you're dealing with a Javascript-heavy or Client-side rendered site, you could pass waitUntilEvent
as networkidle0
to wait until the network is idle (there are no resources left to download) before rendering and scraping the content.
curl -X POST \
-H 'Content-type: application/json' \
-d '{"options": {"includeUrls": true, "waitUntil":"networkidle0"}, "url": "https://zackproser.com"}' \
<pageripper-api-root>/extracts
Future Roadmap
Today, Pageripper automatically categorizes links (internal vs external), ecommerce links, asset download links, etc. It also extracts email addresses and Twitter handles.
Over time, Pageripper will be expanded to extract even more data from a given page