Crawling a website- that is, programmatically pulling data from a URL- has millions of applications. Sometimes you’ll need to aggregate data to make decisions. Other times, you’ll need to monitor information constantly. You also might be looking for cheap flights.
The sheer number of software modules helping you scrape is vast; to help narrow things down, there are three types of websites / applications you might need to crawl.
1. Crawling a [Public] API
Ideally, the web application you’re interested in exposes a public API; in this case, pulling data down is pretty trivial. Using the stack of your choice & the corresponding HTTP library, build some requests as specified by the API. You’ll typically dump the data into some sort of database to be cleaned up, or you may process it on the fly.
2. Crawling HTML
Unfortunately, the majority of websites / applications you use will not have an explicit API. Often, pages will come pre-rendered into HTML from the server. In this format, information is very understandable by a human user, but less useful to a machine.
Take the Baseball Reference website. No AJAX requests seem to hit a JSON API on page load; instead you’ll get tables and tables of HTML formatted data.
Though these instances are less ideal, it’s still possible to write a useful data scraper from this HTML.
A good approach for these types of sites is to use a library like Scrapy, Beautiful Soup, or Cheerio in conjunction to whatever language / library you want to use to make the requests (just like in step 1).
For example, here’s a crawler we wrote to extract emails from HTML web pages. You’ll notice requests are made, and then HTML is parsed for relevant data (i.e. emails).
This repository hasn’t been recently maintained, but feel free to contribute / fork.
3. Scraping over the browser
In some cases, the logic of a web application is so baked into the browser that it’s not feasible to collect data without literally pretending to be a human. Enter a suite of tools that enable you to run your scraper over the browser. This means that a browser engine, like Chromium, will obey user input commands in a synchronous fashion.
When you might need to go over the browser:
1.) When the application you’re scraping requires auth (and there’s no auth API)
2.) When the application uses CSRF tokens to ensure that requests come from that page
3.) When you want to observe or debug the whole scraping process visually
Selenium is a “web browser automation tool” that developers often use to test websites. It provides a lot of utility in web scraping; you may use its API to navigate around pages, wait for items to load, and submit forms (among other things). A big advantage of Selenium is that its library is available in many different languages.
Puppeteer on the other hand is a NodeJS library written by the Google Chrome team. It exposes a wonderful API allowing you to do complex things like:
- Taking screenshots of web pages
- Wait for AJAX network requests to finish, indicating the page is fully loaded
So without further ado, here is a simple project we developed that grabs some flight data from Southwest Airlines, using Puppeteer. It includes a Dockerfile, as we are considering extending this guide to include batch scraping LOTS of flight data in many different environments (comment if interested).
Instructions on how to run can be found in the README. Right now, it finds one-way flights from Chicago to Austin and logs them. Feel free to edit the location parameters, or make it more customizable!