Web Scraping with Google Sheets: The Definitive Guide

June 04, 2019 0 Comments

Web Scraping with Google Sheets: The Definitive Guide

 

 

In this tutorial, we learn how to scrape data from web pages and display the results in Google Sheets. We start with the standard built-in Sheets functions and move on to more advanced methods using custom JavaScript functions.

What is web scraping, you say? Web scraping is a technique of automatically extracting unstructured data from web pages using a computer rather than manually copying and pasting data. It is necessary when the data to extract is not available in a well-defined format such as JSON or XML.

Article contents

IMPORTHTML

We start with the built-in IMPORTHTML Google Sheets function. This powerful function performs most of the heavy lifting for you—provided the web page data of interest is contained in an HTML table or list.

Here's the syntax straight from the official documentation:

IMPORTHTML(url, query, index)

  • url - The URL of the page to examine, including protocol (e.g. http://). The value for url must either be enclosed in quotation marks or be a reference to a cell containing the appropriate text.
  • query - Either "list" or "table" depending on what type of structure contains the desired data.
  • index - The index, starting at 1, which identifies which table or list as defined in the HTML source should be returned. The indices for lists and tables are maintained separately, so there may be both a list and a table with index 1 if both types of elements exist on the HTML page.

HTML Tables

Let's scrape the Ubuntu version history page on Wikipedia to fetch the various releases of Ubuntu Linux over the years.

  A B
1 URL https://en.wikipedia.org/wiki/Ubuntuversionhistory
2    
3 =IMPORTHTML(B1,"table", 1)  

In cell B1, we specify the Wikipedia URL and ask for the first HTML table that is found on the page. Here are the results (with several rows removed to save space and keep it simple):

  A B C D E F
1 URL https://en.wikipedia.org/wiki/Ubuntuversionhistory        
2            
3 Version Code name Release date Public support until   Initial kernel version
4       Desktops Servers  
5 4.1 Warty Warthog 10/20/2004 Old version, no longer supported: 2006-04-30   2.6.8
6 18.1 Cosmic Cuttlefish 2018-10-18[345] Older version, yet still supported: 2019-07   4.18[346]
7 19.04 Disco Dingo 2019-04-18[317] Current stable version: 2020-01   5
8 19.1 Eoan EANIMAL 2019-10-18[347] Future release: TBA   TBA
9 Legend: Old version Older version, still supported Latest version Latest preview version

Future release

         

Wow - that's a lot of data we extracted without a whole lot of effort!

HTML Lists

Let's next extract an HTML list of data from the same Ubuntu Wikipedia page to get the names of the Ubuntu releases.

First, we'll view the HTML source in the browser to understand the structure of the list we are seeking to extract. I removed several rows and formatted the HTML so it will be easier to read for our purposes:

<ul> <li class="toclevel-1 tocsection-1"> <a href="#Namingconvention" ><span class="tocnumber">1</span> <span class="toctext">Naming convention</span></a > </li> <li class="toclevel-1 tocsection-2"> <a href="#Releasehistory" ><span class="tocnumber">2</span> <span class="toctext">Release history</span></a > <ul> <li class="toclevel-2 tocsection-3"> <a href="#Ubuntu4.10(WartyWarthog)" ><span class="tocnumber">2.1</span> <span class="toctext" ><span>Ubuntu 4.10 (Warty Warthog)</span></span ></a > </li> <li class="toclevel-2 tocsection-31"> <a href="#Ubuntu18.10(CosmicCuttlefish)" ><span class="tocnumber">2.29</span> <span class="toctext">Ubuntu 18.10 (Cosmic Cuttlefish)</span></a > </li> <li class="toclevel-2 tocsection-32"> <a href="#Ubuntu19.04(DiscoDingo)" ><span class="tocnumber">2.30</span> <span class="toctext">Ubuntu 19.04 (Disco Dingo)</span></a > </li> </ul> </li> <li class="toclevel-1 tocsection-33"> <a href="#Tableofversions" ><span class="tocnumber">3</span> <span class="toctext">Table of versions</span></a > </li> <li class="toclevel-1 tocsection-34"> <a href="#Versiontimeline" ><span class="tocnumber">4</span> <span class="toctext">Version timeline</span></a > </li> <li class="toclevel-1 tocsection-38"> <a href="#Externallinks" ><span class="tocnumber">8</span> <span class="toctext">External links</span></a > </li> </ul> 

We are interested in extracting the Ubuntu release names contained in the embedded list beginning in the 4th line. Through trial and error, I learned that the IMPORTHTML function identifies this embedded list as list #2 in the HTML page where list #1 encompasses list #2; therefore, our IMPORTHTML command in cell B2 will use a value of 2 for the index parameter and will look like this:

  A
1 https://en.wikipedia.org/wiki/Ubuntuversionhistory
2
3 =IMPORTHTML(A1,"list",2)

After pressing the Enter key, Google Sheets fetches the Wikipedia URL, grabs the second HTML list in the results, and displays it for us. Here are the abridged results:

  A
1 https://en.wikipedia.org/wiki/Ubuntuversionhistory
2  
3 2.1 Ubuntu 4.10 (Warty Warthog)
4 2.2 Ubuntu 5.04 (Hoary Hedgehog)
5 2.27 Ubuntu 17.10 (Artful Aardvark)
6 2.28 Ubuntu 18.04 LTS (Bionic Beaver)
7 2.29 Ubuntu 18.10 (Cosmic Cuttlefish)
8 2.30 Ubuntu 19.04 (Disco Dingo)

How awesome is that? Furthermore, we can pinpoint the most recent Ubuntu release name with the help of a couple built-in Sheets functions:

  A B
1 https://en.wikipedia.org/wiki/Ubuntuversionhistory =IMPORTHTML(A1,"list",2)
2    
3 =INDEX(B1:B, COUNTA(B1:B),1)  

I rearranged the cells to ensure the scraped list of Ubuntu releases appears on its own in column B. We then use the INDEX function in conjunction with the COUNTA function to return the last non-blank cell in the list of Ubuntu releases contained in column B. The rendered results look this this:

  A B
1 https://en.wikipedia.org/wiki/Ubuntuversionhistory 2.1 Ubuntu 4.10 (Warty Warthog)
2   2.2 Ubuntu 5.04 (Hoary Hedgehog)
3   2.27 Ubuntu 17.10 (Artful Aardvark)
4   2.28 Ubuntu 18.04 LTS (Bionic Beaver)
5   2.29 Ubuntu 18.10 (Cosmic Cuttlefish)
6   2.30 Ubuntu 19.04 (Disco Dingo)

Fantastic! Disco Dingo, the latest Ubuntu release at the time of this writing, appears in cell A3. We could also hide column B if we want to omit the intermediate results and only view the final results.

IMPORTXML

Let's next work with the IMPORTXML function which is also built into Google Sheets. Given that XML is structured data, we are not technically scraping web data; nonetheless, this is an important tool in our toolkit for fetching data from external web-based sources.

Here is the syntax from the official documentation:

IMPORTXML(url, xpathquery)

  • url - The URL of the page to examine, including protocol (e.g. http://). The value for url must either be enclosed in quotation marks or be a reference to a cell containing the appropriate text.
  • xpathquery - The XPath query to run on the structured data. For more information on XPath, see http://www.w3schools.com/xml/xpathintro.asp.

Let's fetch the current weather conditions in San Diego from the following URL: https://w1.weather.gov/xml/currentobs/KSAN.xml

After invoking the URL and choosing "view page source" from our web browser, we see the following XML data:

<?xml version="1.0" encoding="ISO-8859-1"?> <?xml-stylesheet href="latestob.xsl" type="text/xsl"?> <currentobservation version="1.0" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.weather.gov/view/currentobservation.xsd"> <credit>NOAA's National Weather Service</credit> <creditURL>http://weather.gov/</creditURL> <image> <url>http://weather.gov/images/xmllogo.gif</url> <title>NOAA's National Weather Service</title> <link>http://weather.gov</link> </image> <suggestedpickup>15 minutes after the hour</suggestedpickup> <suggestedpickupperiod>60</suggestedpickupperiod> <location>San Diego, San Diego International-Lindbergh Field, CA</location> <stationid>KSAN</stationid> <latitude>32.73361</latitude> <longitude>-117.18306</longitude> <observationtime>Last Updated on Apr 26 2019, 11:51 am PDT</observationtime> <observationtimerfc822>Fri, 26 Apr 2019 11:51:00 -0700</observationtimerfc822> <weather>A Few Clouds</weather> <temperaturestring>67.0 F (19.4 C)</temperaturestring> <tempf>67.0</tempf> <tempc>19.4</tempc> <relativehumidity>63</relativehumidity> <windstring>from the West at 11.5 gusting to 18.4 MPH (10 gusting to 16 KT)</windstring> <winddir>West</winddir> <winddegrees>280</winddegrees> <windmph>11.5</windmph> <windgustmph>18.4</windgustmph> <windkt>10</windkt> <windgustkt>16</windgustkt> <pressurestring>1016.4 mb</pressurestring> <pressuremb>1016.4</pressuremb> <pressurein>30.02</pressurein> <dewpointstring>54.0 F (12.2 C)</dewpointstring> <dewpointf>54.0</dewpointf> <dewpointc>12.2</dewpointc> <visibilitymi>10.00</visibilitymi> <iconurlbase>http://forecast.weather.gov/images/wtf/small/</iconurlbase> <twodayhistoryurl>http://www.weather.gov/data/obhistory/KSAN.html</twodayhistoryurl> <iconurlname>few.png</iconurlname> <oburl>http://www.weather.gov/data/METAR/KSAN.1.txt</oburl> <disclaimerurl>http://weather.gov/disclaimer.html</disclaimerurl> <copyrighturl>http://weather.gov/disclaimer.html</copyrighturl> <privacypolicyurl>http://weather.gov/notice.html</privacypolicyurl> </currentobservation> 

In this context, we need to view the page source to display the raw XML since the weather URL includes an XML style sheet to render the data.

Get One Field

To retrieve the current temperature (the "<tempf>" element), we use an XPath query of "//tempf". XPath (XML Path Language) is a query language for navigating through XML documents and selecting elements and attributes. As described on this XPath syntax page, the "//" matches the element path that follows no matter where it is located. In the XML returned in this example, the "<tempf>" element is a child element of <currentobservation> and thus we must use "//tempf since "tempf" is not a root XML element. We could also retrieve the "<tempf>" element using an XPath query of "currentobservation/tempf", but we opt for the simpler, less verbose syntax. XPath is powerful, but it takes time to understand if you have not worked with it before.

  A B
1 URL https://w1.weather.gov/xml/currentobs/KSAN.xml
2 xpathquery //tempf
3    
4 Temperature (°F) =IMPORTXML(B1,B2)

The current temperature is successfully retrieved:

  A B
1 URL https://w1.weather.gov/xml/currentobs/KSAN.xml
2 xpathquery //tempf
3    
4 Temperature (°F) 67

Get Multiple Fields

To fetch multiple fields, IMPORTXML supports the xpath separator "|" which allows us to include multiple XPath queries so we can retrieve as many fields as desired. Let's fetch three fields from the incoming XML this time around:

  A B
1 URL https://w1.weather.gov/xml/currentobs/KSAN.xml
2 xpathquery //observationtime | //tempf | //windmph
3    
4 Observation time =IMPORTXML(B1,B2)
5 Temperature (°F)  
6 Wind speed (mph)  

Google Sheets invokes the XML and extracts the data for observation time, temperature, and wind speed. Hooray - it works!

  A B
1 URL https://w1.weather.gov/xml/currentobs/KSAN.xml
2 xpathquery //observationtime | //tempf | //windmph
3    
4 Observation time Last Updated on Apr 26 2019, 11:51 am PDT
5 Temperature (°F) 67
6 Wind speed (mph) 11.5

IMPORTDATA Function

For the sake of completeness, I'll mention that Google Sheets also provides the IMPORTDATA function for working with .csv (comma-separated value) or .tsv (tab-separated value) formatted data.

Here is the syntax from the documentation:

IMPORTDATA(url)

url - The url from which to fetch the .csv or .tsv-formatted data, including protocol (e.g. http://). The value for url must either be enclosed in quotation marks or be a reference to a cell containing the appropriate text.

Create Custom Functions for Maximum Flexibility

We can create our own custom functions in Google Sheets to scrape web pages and get precisely the data of interest for a given web page.

Example 1: Get current gasoline prices

Let's scrape the excellent Gas Buddy page for current gas prices in various status in the United States.

First, go to Tools > Script editor using the Google Sheets menu.

Next, enter the following JavaScript code in the code editor that appears:

function GASPRICE(state) { state = state.trim(); const url = "https://www.gasbuddy.com/USA"; var response = UrlFetchApp.fetch(url); const s = response.getContentText(); const re = new RegExp( '<a id="' + state + '"[\s\S]+?<div class="col-sm-2 col-xs-3 text-right">[\s]+([\d.]+)', "i" ); var match = re.exec(s); var price = "not found"; if (match != null) { price = match[1]; } return price; } 

In this function, we:

  • Utilize the built-in UrlFetchApp JavaScript function to invoke the URL and fetch the raw content text.
  • Leverage regular expressions (regex) to extract the exact text of interest from the content returned.
    • Regular expressions are a very useful tool for web scraping because they allow us to precisely define a search pattern and include the text of interest to capture in parentheses. For more information see this tutorial. I also recommend https://regexr.com/ as an excellent tool to learn, build, and test regular expressions.
    • Our regular expression search pattern includes the value of the "state" parameter supplied to the JavaScript function. The "state" parameter will be one of 50 possible values representing the states in the USA. For example, "CA" = "California". This regex search pattern ensures we retrieve the gas price for just one state since all 50 states are included on the Gas Buddy page.
  • Return the gas price retrieved from the regular expression which is contained between the left and right parentheses in the regex pattern.

The function we created is called "GASPRICE" so let's jump back into our spreadsheet and see about those current gas prices in California. Create a spreadsheet that looks like this:

  A B
1 State Current Avg ($)
2 CA =GASPRICE(A2)

After entering the formula in cell B2 and pressing Enter, our custom function invokes the Gas Buddy URL for California (CA), and scrapes through the text using our regular expression, and retrieves the current gas price in California. How cool is that?

  A B
1 State Current Avg ($)
2 CA 4.051

As a second more sophisticated example, let's retrieve the birthdates of famous people (and death dates if they are no longer alive) with the help of some Wikipedia scraping. Again, we'll create a custom JavaScript function in Google Sheets to accomplish the goal.

Click on Tools > Script editor to launch the JavaScript code editor if it is not already open in another browser tab. Add the following code below the code you added from the first example:

function BIRTHDATE(name) { const url = "https://en.wikipedia.org/wiki/" + name; var response = UrlFetchApp.fetch(url); const s = response.getContentText(); var match = /<span class="bday">(.+?)<\/span>/.exec(s); var bday = "not found"; if (match != null) { bday = ParseDate(match[1]); } var matchd = /Died<\/th><td>.+?<span style="display:none">((.+?))<\/span> (aged&#160;/.exec( s ); var dday = bday === "not found" ? "not found" : "still alive"; if (matchd != null) { dday = ParseDate(matchd[1]); } return [[bday, dday]]; } function ParseDate(s) { // Assume invalid date until proven otherwise var result = "invalid date"; var parts = s.split("-"); if (parts.length === 3) { var dt = new Date(parts[0], parts[1] - 1, parts[2]); if (dt instanceof Date && isFinite(dt)) { result = dt; } } return result; } 

In the JavaScript code above, we follow a similar approach to fetch the text and use regular expressions to construct search patterns and extract the data of interest. This time around, we search for two patterns and return a row of data containing two columns with the results. We also define and use a "ParseDate" function to ensure the date we receive back is valid so we can calculate people's lifespan as well. Here is the final result showing the BIRTHDATE function in action for multiple people:

  A B C D
1 Name Birth Death Life Span (years)
2 Isaac Newton =BIRTHDATE(A2)   =DATEDIF(B2,if(C2="still alive",now(), C2),"Y")
3 Donald Knuth =BIRTHDATE(A3)   =DATEDIF(B3,if(C3="still alive",now(), C3),"Y")
4 Grace Hopper =BIRTHDATE(A4)   =DATEDIF(B4,if(C4="still alive",now(), C4),"Y")
5 Hank Aaron =BIRTHDATE(A5)   =DATEDIF(B5,if(C5="still alive",now(), C5),"Y")

The formulas above yield the following results:

  A B C D
1 Name Birth Death Life Span (years)
2 Isaac Newton 1/4/1643 3/30/1727 84
3 Donald Knuth 1/10/1938 still alive 81
4 Grace Hopper 12/9/1906 1/1/1992 85
5 Hank Aaron 2/5/1934 still alive 85

These custom web scraping functions are very powerful!

Please be aware that these functions can be brittle since you are at the mercy of the web developer who created the page you are scraping. If the developer updates the structure of the page, you may need to update your regular expression accordingly and provide an updated search pattern string.

GETJSON Function

As a final note, please see my article on Fetching, Filtering, and Sorting JSON APIs in Google Sheets: The Missing Functions if you are interested in bringing JSON web API data into Google Sheets. JSON has become the lingua franca for exchanging data on the web, and there is much JSON data available!

Conclusion

My goal was to teach you how to scrape data from web pages and display the results in Google Sheets. We started with the standard built-in Sheets functions and moved on to more advanced methods using custom JavaScript functions. I hope I have equipped you! Go out and have some fun scraping the web!

Follow @thisDaveJ (Dave Johnson) on Twitter to stay up to date with the latest tutorials and tech articles.

Additional articles

Fetching, Filtering, and Sorting JSON APIs in Google Sheets: The Missing Functions How to Count Unique Items in JavaScript Arrays
Add Emoji to your Node Projects with node-emoji ✨
Learn Handlebars with Node.js and Help Freddy's Fish Too

Last updated Jun 04 2019

Share

Tag cloud