Web Scraping; The Simplest Tools are Sometimes Best

October 23, 2017 0 Comments

Web Scraping; The Simplest Tools are Sometimes Best

 

 

We explore using JavaScript (Node.js) as a simple web scraping tool.

A recent project that I am working on depends on gathering publicly available data that is available on a partner’s website; unfortunately the partner at this point is not prepared to provide that data in another format (an API, a downloadable spreadsheet, etc.).

Our team sat down and brainstormed on the tools we would use to automate the data collection (web scraping). Two solutions were thrown out there:

  • Mozenda: Our licensed software is intuitive and allows anyone to extract data from the web, without the need of a programmer or developer.
  • ParseHub: ParseHub is a free web scraping tool. With our advanced web scraper, extracting data is as easy as clicking the data you need.

Both of these solutions are designed to be used by folks who are not developers; simple to use and requires little prerequisite knowledge.

I, on the other hand, am a web developer with a preference of using the least number of tools that is feasible. I start with the premise that I am not going to use any tools (outside of a programming language — say JavaScript on Node.js) and add tools as needed. The question that I had, was how hard a problem is web scraping and would I be better off using one of the aforementioned solutions?

It is important to note that the website that we were scraping did not require any authentication and all the data is available in the HTML itself; so the problem simply amounted to repeatedly parsing HTML pages.

Retrieving the HTML

While Node.js supports making HTTP requests out of the box, the syntax is sufficiently clumsy that many folks use a third-party library for this. In my case, I like the promise-based node-fetch library.

const fetch = require('node-fetch');
const URL = '[URL OBMITTED]';
fetch(URL)
.then(res => res.text())
.then((body) => {
console.log(body);
});

That was easy…

Parsing the HTML

Back in the day, parsing text files was an exercise in writing regular expressions; which can get fairly complicated fast. Luckily, there is a commonly used tool for parsing HTML text that is easy to use; especially if you already know jQuery.

jQuery is a popular, albeit a bit dated, tool for front-end development. At its core, it provided the developer an easy way to traverse the document object model (DOM) of a rendered HTML page. The key of its simplicity is that the selection process mirrors CSS selectors.

The tool we are going to use is Cheerio. Borrowing heavily from jQuery’s syntax it is easy to traverse HTML text in much the same way jQuery traverses the DOM. For example, on the page that I am parsing I know that I am looking for elements with a class search-result.

const fetch = require('node-fetch');
const cheerio = require('cheerio');
const URL = '[URL OBMITTED]';
fetch(URL)
.then(res => res.text())
.then((body) => {
$ = cheerio.load(body);
$('.search-result').each((index, element) => {
console.log(index);
});

});

In this case, there are 31 such elements. For my project, however, I really need to extract our the URL of an a tag contained in a child element with class product-title of the search-result element.

...
$ = cheerio.load(body);
$('.search-result').each((index, element) => {
const resultUrl = $(element).find('.product-title a').attr('href');
console.log(resultUrl);

});
});

note: The found element appears to be a DOM-like element and needs to be wrapped using $ to obtain a Cheerio object.

This was pretty easy too…

Walking the Website

The actual data that we want is actually on the pages on the extracted URLs; so we just repeat the pattern. But we really do not want to process the pages until all of the the pages are downloaded; for this we use Promise.all. In this case we are interested in extracting the text contained in elements with the class page-title.

.then((body) => {
const fetches = [];
$ = cheerio.load(body);
$('.search-result').each((index, element) => {
const resultUrl = $(element).find('.product-title a').attr('href');
fetches.push(
fetch(resultUrl)
.then(res => res.text())
);

});
Promise.all(fetches)
.then((results) => {
for (let i = 1; i < results.length; i += 1) {
$ = cheerio.load(results[i]);
const series = $('.page-title').text();
console.log(series);
}
})

});

Still Room Reason for Regular Expressions

In the page results we are interested in extracting the URLs of certain download links (those that end in ies). In this case, a regular expression is the answer.

const fetch = require('node-fetch');
const cheerio = require('cheerio');
const URL = '[URL OBMITTED]';
const MATCHIES = /ies$/i;
let $;
...
.then((results) => {
for (let i = 0; i < results.length; i += 1) {
$ = cheerio.load(results[i]);
const series = $('.page-title').text();
$('#product-documents .download-wrapper').each((index, element) => {
const download$ = $(element);
const downloadUrl = download$.attr('nodepath');
if (!MATCH
IES.test(downloadUrl)) return;
console.log(${series}, ${downloadUrl});

});
}
})
});

Conclusion

In the end, I was able to get the data that I needed in about 30 lines of code. And in particular, since I am using JavaScript, I can easily apply logic to the web scraper as needed, e.g., only getting URLs to files that end in ies.

Assuming that you know JavaScript, it does not get any easier than this.


Tag cloud