Web Scraping In JavaScript With NanoPipe

April 29, 2018 0 Comments

Web Scraping In JavaScript With NanoPipe

 

 

With the ever increasing amount of content on the Web, even if you are focused on just select sites, web scraping can require a lot of processing power. As a result, it needs to take advantage of parallel or asynchronous tasks. And, by its nature, web scraping generally uses asynchronous processing just to get its inputs using fetch. Put on top of this the complexities of parsing for the data you want and things can get gnarly fast as you descend into callback caves, Promise purgatory, or hard to debug generator outages. The tiny (less <450 bytes gzipped) NanoPipe library can help.

NanoPipe lets you create asynchronous chainable functions for use in processing pipelines in three easy steps while minimizing the direct use of Promises, callbacks, or async/await.

  1. Define and declare to NanoPipe your functions.
  2. Create a pipeline using your functions.
  3. Feed input to your pipeline.

Below we walk you through creating a simple scraping pipeline. When you are done, you will be able to use it like this:

const scraper = NanoPipe()
.getUrl()
.toDOM()
.logTitle()
.splitAnalysis();
scraper.pipe(["https://www.cnn.com","https://www.msn.com";]);
scraper.pipe(["https://www.foxnews.com";]);

For our web scraper we will need NanoPipe, so run:

npm install nano-pipe

We will also need functions to get the content at URLs, convert the content into DOMs, provide us with processing feedback, actually analyze the content, and save the results of the analysis.

There are two great NodeJS libraries that will make our life simpler:

Both of these are installed as dev dependencies for NanoPipe because the example provided here is also in the examples directory for NanoPipe.

For the example we will also mock-up two data stores. You could use Redis, Mongo, KVVS, or some other store:

To get started, create a file scrape.js with the following contents:

const NanoPipe = require("../index.js"),
fetch = require("node-fetch"),
JSDOM = require("jsdom").JSDOM;
const db1 = {
put(key,value) {
console.log(Saving ${value} under ${key} in db1);
}
},
db2 = {
put(key,value) {
console.log(Saving ${value} under ${key} in db2);
};

Add a function to get the text at a UR:

async function getUrl() {
const response = await fetch(this);
return response.text();
}

Note that getUrl does not take an argument. It gets its argument value from the pipeline in which it is used and the argument will always be this.

Add a function to convert text into a DOM:

function toDOM() {
return new JSDOM(this);
}

Add a function we can use so that we know what is happening. Note, we just return this to keep passing data down the pipeline:

function logTitle() {
console.log(this.window.document.title)
return this;
}

Under the assumption we will be processing a lot of URLs with a lot of associated data and we are going to put info about URL content heads in one data store and bodies in another, create a function to split the processing into multiple pipes and save the analysis. Note how you can pass in arguments with the save function defined later.

function splitAnalysis() {
NanoPipe()
.analyzeHead()
.save(db1)
.pipe([this.window.document.head]);
NanoPipe()
.analyzeBody()
.save(db2)
.pipe([this.window.document.body]);
}

Define the functions for analyzing the head and the body. For our example we just simulate an asynchronous process and capture the length in a resolved Promise. NanoPipe will automatically handle functions that return Promises.

function analyzeHead() {
// this could invoke processing on another server or thread
// return <results of analysis>
return Promise.resolve({title:this.title,
length:this.head.innerHTML.length});
}
function analyzeBody() {
// this could invoke processing on another server or thread
// return <results of analysis>
return Promise.resolve({title:this.title,
length:this.body.innerHTML.length});
}

Finally, create a function to save the results:

function save(db) {
db.put(this.title,this.length);
}

We can now declare all the functions to NanoPipe:

NanoPipe
.pipeable(getUrl)
.pipeable(toDOM)
.pipeable(logTitle)
.pipeable(splitAnalysis)
.pipeable(analyzeHead)
.pipeable(analyzeBody)
.pipeable(save);

To define a pipe, just call NanoPipe and chain together the calls you want!

const scraper = NanoPipe()
.getUrl()
.toDOM()
.logTitle()
.splitAnalysis();

Now you can use the pipe multiple times. Add the below lines at the end of your file:

scraper.pipe(["https://www.cnn.com","https://www.msn.com";]);
scraper.pipe(["https://www.foxnews.com";]);

Save the file and using NodeJS v9 or v10 run the command:

node --harmony scraper.js

Note that the results logged may not be in the order specified above, evidence that asynchronous processing is happening.

If you are wondering how NanoPipe works, watch for a follow-up article on the wonders of async generators.

They are not quite like riding a half-pipe, but I hope you have fun with NanoPipes.

If you liked this article, don’t forget to give it clap!


Tag cloud