How to scrape hashtags from Instagram using nodeJS

April 10, 2018 0 Comments

How to scrape hashtags from Instagram using nodeJS

 

 

Scrape hashtags from instagram using node.js

Instagram is the most happening social networking platform of the present time. Most of the millennials are on it. It was originally intended to share photos with friends and family. But now, it is being used in unimaginable ways. A lot of brands are collaborating with popular pages to promote their products. So, there are people who maintain an Instagram page for a living. This is just to show you that Instagram is not a small deal. So, there are a lot of opportunities in this space. In this post, we will be seeing how to scrape the top hashtags for a given keyword.

The Concept

The concept is pretty simple. Go to the homepage of Instagram, and search for something. You will see a grid view of results. These results are the top results of Instagram. So, the hashtags they use are probably well thought out. These are the hashtags that we will be scraping using nodeJS code.

For this we will be first starting a project, installing dependencies, writing the code and testing it. The dependencies of this projects are the npm packages request and request-promise.

The code

Let us look at the code step by step.

  1. Let’s install and include the dependencies. In the terminal

    npm install request request-promise --save


    In the js file
    const rp = require('request-promise'); 
  2. Now, let’s divide the whole script into 3 parts. The main logic and two helper functions. The two function are scrapeHashtags() – will extract all the hashtags from the html code. removeDuplicates – will remove all the duplicate hashtags scraped.
     // Logic here const scrapeHashtags = (inputText) => { } const removeDuplicates = (arr) => { } 
  3. In the logic part, we need to make a variable URL to hold the url of the page we want to scrape. Copy the result page’s url and assign it to URL. And replace the keyword you had typed earlier with ${keyWord}
    let URL = `https://www.instagram.com/explore/tags/${keyWord}/`
  4. Now, we make the actual request to get some real data. For this, we will use the rp function and pass in URL as the argument. This function will return a promise after the response is received. A promise is basically a function that is called after everything inside a function is completely executed. It is generally used with asynchronous functions. In our case, the promise returned will be the html code of the result page. We will come back to this after writing the helper functions.
    rp(URL) .then((html) => { console.log(html); }) .catch((err) => { console.log(err); }); 
  5. Let’s write the two helper function one by one.
    • scrapeHashtags() – In this function, we will use regex to find the hashtags from the html code. The regex pattern for a hashtag is /(?:^|\s)(?:#)([a-zA-Z\d]+)/gm. We will push all the hashtags into a list and finally return the list.
      const scrapeHashtags = (html) => { var regex = /(?:^|\s)(?:#)([a-zA-Z\d]+)/gm; var matches = []; var match; while ((match = regex.exec(html))) { matches.push(match[1]); } return matches; } 
    • removeDuplicates() – In this function, we will remove duplicate elements from a list. We will use a generic algorithm which should work for any list. Make an empty array, start pushing elements to it from the array given as argument. If the element already exists in the new array, skip it. This way, the new array will only have unique elements.
      const removeDuplicates = (arr) => { let newArr = []; arr.map(ele => { if (newArr.indexOf(ele)  -1){ newArr.push(ele) } }) return newArr; } 
  6. We are done with the helper functions. It’s time to use them. Back to the promise returned by the rp function. Call the scrapeHashtags() function on the html code and store the result in a variable hashtags. Now, call the removeDuplicates() function on hashtags and store it back in hashtags. Use the map function to add the # sign to every hashtag. And finally log hashtags to the console.
    rp(URL) .then((html) => { let hashtags = scrapeHashTags(html); hashtags = removeDuplicates(hashtags); hashtags = hashtags.map(ele => "#" + ele) console.log(hashtags); }) .catch((err) => { console.log(err); }); 
  7. Last but not the least, make a variable called keyWord which is expected by URL. Assign any random word to it.
    let keyWord = "developers";

Your final code should look something like this

const rp = require('request-promise'); const cheerio = require('cheerio'); let keyWord = "developers" let URL = https://www.instagram.com/explore/tags/${keyWord}/ rp(URL) .then((html) => { let hashtags = scrapeHashtags(html); hashtags = removeDuplicates(hashtags); hashtags = hashtags.map(ele => "#" + ele) console.log(hashtags); }) .catch((err) => { console.log(err); }); const scrapeHashtags = (html) => { var regex = /(?:^|\s)(?:#)([a-zA-Z\d]+)/gm; var matches = []; var match; while ((match = regex.exec(html))) { matches.push(match[1]); } return matches; } const removeDuplicates = (arr) => { let newArr = []; arr.map(ele => { if (newArr.indexOf(ele)  -1){ newArr.push(ele) } }) return newArr; } 

Testing it

Assign any keyword you want to search to keyWord. Now, run node filename.js. This should take a few seconds and then print a giant list of hashtags. Congratulation, you have successfully written a script to scrape hashtags from Instagram.
output


Tag cloud