Crawling Websites in React-Native

November 07, 2017 0 Comments

Crawling Websites in React-Native

 

 

Coming from years of web developing React-Native feels like a fresh start to me. You get better access to native functionality AND you have fewer rules imposed to your app. For example, you can use fetch() toy get any website you want. What this enables is client site web crawling.

Why

Maybe you need data from a service, but they don't expose an API or the API doesn't give you all the data you need or the API is simply bad. Normally you would have to set up a server that crawls the target website and turns it into an API that you can use, but when you can access all data from all websites inside your client, you can save time.

Lets take the Amazon website for example. You want to show all products of a page and a way to load the next, but you want it in our own data structure, so you can build your own UI around it.

How

  1. Get the HTML from the server
  2. Extract the needed data from the HTML
  3. Reshape the data for our use

1 Get the HTML from the Server

That's the easy part.

async function loadGraphicCards(page = 1) { const searchUrl = `https://www.amazon.de/s/?page=${page}&keywords=graphic+card`; const response = await fetch(searchUrl); // fetch page const htmlString = await response.text(); // get response text ...
}

Fetching a URL with a search pattern returns a HTML page with some items.

2 Extract the Needed Data from the HTML

This is a bit trickier. The data is inside the HTML, but it's a string.

The naive approach would be to use a regular expression to parse the string and get the data, but HTML doesn't have a regular grammar so that wouldn't work.

The better way is to use a HTML parser and CSS selectors.

Cheerio is this solution. It comes with a HTML parser and a re-implementation of jQuerys core functionality, so you can use it on Node.js.

Problem is, React-Native is missing most Node.js packages so it doesn't work.

I searched quite some time to finde a re-implementation of Cheerio that works on React-Native the naming of the package was a bit strange, haha.

But with this, the extraction of the data is now childs play too.

async function loadGraphicCards(page = 1) { const searchUrl = `https://www.amazon.de/s/?page=${page}&keywords=graphic+card`; const response = await fetch(searchUrl); // fetch page  const htmlString = await response.text(); // get response text const $ = cheerio.load(htmlString); // parse HTML string const liList = $("#s-results-list-atf > li"); // select result <li>s ...
}

3 Reshape the Data for further Use

After the data has been extracted from the HTML, we can start to reshape it for our use-cases. Extraction and reshaping are a bit blurry here, the <li>s we selected are full of markup and getting the right data out of them is extraction too, but often these two steps go hand-in-hand.

async function loadGraphicCards(page = 1) { const searchUrl = `https://www.amazon.de/s/?page=${page}&keywords=graphic+card`; const response = await fetch(searchUrl); // fetch page  const htmlString = await response.text(); // get response text const $ = cheerio.load(htmlString); // parse HTML string return $("#s-results-list-atf > li") // select result <li>s .map((_, li) => ({ // map to an list of objects asin: $(li).data("asin"), title: $("h2", li).text(), price: $("span.a-color-price", li).text(), rating: $("span.a-icon-alt", li).text(), imageUrl: $("img.s-access-image").attr("src") }));
}

This is not a robust example, but I think you get the idea. We can now use the new list of objects in our app to make our own UI for the Amazon results.

class App extends ReactComponent { state = { page: 0, items: [], }; componentDidMount = () => this.loadNextPage(); loadNextPage = () => this.setState(async state => { const page = state.page + 1; const items = await loadGraphicCards(page); return {items, page}; }); render = () => ( <ScrollView> {this.state.items.map(item => <Item {...item} key={item.asin}/>)} </ScrollView> );
} const Item = props => ( <TouchableOpacity onPress={() => alert("ASIN:" + props.asin)}> <Text>{props.title}</Text> <Image source={{uri: props.imageUrl}}/> <Text>{props.price}</Text> <Text>{props.rating}</Text> </TouchableOpacity>
);

Conclusion

As with most problems, if you have the right tools solutions can become simple. Often the problem is more about finding these tools :D

This client side crawling approach can be used to build quick prototypes without the need of an API. Amazon is so nice to deliver okay-ish static HTML, so it works rather well on their sites.


Tag cloud