Menu Search
Jump to the content X X
Smashing Conf Barcelona 2016

We use ad-blockers as well, you know. We gotta keep those servers running though. Did you know that we publish useful books and run friendly conferences — crafted for pros like yourself? E.g. upcoming SmashingConf Barcelona, dedicated to smart front-end techniques and design patterns.

Web Scraping With Node.js

Web scraping is the process of programmatically retrieving information from the Internet. As the volume of data on the web has increased, this practice has become increasingly widespread, and a number of powerful services have emerged to simplify it. Unfortunately, the majority of them are costly, limited or have other disadvantages. Instead of turning to one of these third-party resources, you can use Node.js to create a powerful web scraper that is both extremely versatile and completely free.

In this article, I’ll be covering the following:

  • two Node.js modules, Request and Cheerio, that simplify web scraping;
  • an introductory application that fetches and displays some sample data;
  • a more advanced application that finds keywords related to Google searches.

Also, a few things worth noting before we go on: A basic understanding of Node.js is recommended for this article; so, if you haven’t already, check it out1 before continuing. Also, web scraping may violate the terms of service for some websites, so just make sure you’re in the clear there before doing any heavy scraping.

Modules Link

To bring in the Node.js modules I mentioned earlier, we’ll be using NPM2, the Node Package Manager (if you’ve heard of Bower, it’s like that — except, you use NPM to install Bower). NPM is a package management utility that is automatically installed alongside Node.js to make the process of using modules as painless as possible. By default, NPM installs the modules in a folder named node_modules in the directory where you invoke it, so make sure to call it in your project folder.

And without further ado, here are the modules we’ll be using.

Request Link

While Node.js does provide simple methods of downloading data from the Internet via HTTP and HTTPS interfaces, you have to handle them separately, to say nothing of redirects and other issues that appear when you start working with web scraping. The Request module3 merges these methods, abstracts away the difficulties and presents you with a single unified interface for making requests. We’ll use this module to download web pages directly into memory. To install it, run npm install request from your terminal in the directory where your main Node.js file will be located.

Cheerio Link

Cheerio4 enables you to work with downloaded web data using the same syntax that jQuery employs. To quote the copy on its home page, “Cheerio is a fast, flexible and lean implementation of jQuery designed specifically for the server.” Bringing in Cheerio enables us to focus on the data we download directly, rather than on parsing it. To install it, run npm install cheerio from your terminal in the directory where your main Node.js file will be located.

Implementation Link

The code below is a quick little application to nab the temperature from a weather website. I popped in my area code at the end of the URL we’re downloading, but if you want to try it out, you can put yours in there (just make sure to install the two modules we’re attempting to require first; you can learn how to do that via the links given for them above).

var request = require("request"),
  cheerio = require("cheerio"),
  url = "" + 02888;
request(url, function (error, response, body) {
  if (!error) {
    var $ = cheerio.load(body),
      temperature = $("[data-variable='temperature'] .wx-value").html();
    console.log("It’s " + temperature + " degrees Fahrenheit.");
  } else {
    console.log("We’ve encountered an error: " + error);

So, what are we doing here? First, we’re requiring our modules so that we can access them later on. Then, we’re defining the URL we want to download in a variable.

Then, we use the Request module to download the page at the URL specified above via the request function. We pass in the URL that we want to download and a callback that will handle the results of our request. When that data is returned, that callback is invoked and passed three variables: error, response and body. If Request encounters a problem downloading the web page and can’t retrieve the data, it will pass a valid error object to the function, and the body variable will be null. Before we begin working with our data, we’ll check that there aren’t any errors; if there are, we’ll just log them so we can see what went wrong.

If all is well, we pass our data off to Cheerio. Then, we’ll be able to handle the data like we would any other web page, using standard jQuery syntax. To find the data we want, we’ll have to build a selector that grabs the element(s) we’re interested in from the page. If you navigate to the URL I’ve used for this example in your browser and start exploring the page with developer tools, you’ll notice that the big green temperature element is the one I’ve constructed a selector for. Finally, now that we’ve got ahold of our element, it’s a simple matter of grabbing that data and logging it to the console.

We can take it plenty of places from here. I encourage you to play around, and I’ve summarized the key steps for you below. They are as follows.

In Your Browser Link

  1. Visit the page you want to scrape in your browser, being sure to record its URL.
  2. Find the element(s) you want data from, and figure out a jQuery selector for them.

In Your Code Link

  1. Use request to download the page at your URL.
  2. Pass the returned data into Cheerio so you can get your jQuery-like interface.
  3. Use the selector you wrote earlier to scrape your data from the page.

Going Further: Data Mining Link

More advanced uses of web scraping can often be categorized as data mining5, the process of downloading a lot of web pages and generating reports based on the data extracted from them. Node.js scales well for applications of this nature.

I’ve written a small data-mining app in Node.js, less than a hundred lines, to show how we’d use the two libraries that I mentioned above in a more complicated implementation. The app finds the most popular terms associated with a specific Google search by analyzing the text of each of the pages linked to on the first page of Google results.

There are three main phases in this app:

  1. Examine the Google search.
  2. Download all of the pages and parse out all the text on each page.
  3. Analyze the text and present the most popular words.

We’ll take a quick look at the code that’s required to make each of these things happen — as you might guess, not a lot.

The first thing we’ll need to do is find out which pages we’re going to analyze. Because we’re going to be looking at pages pulled from a Google search, we simply find the URL for the search we want, download it and parse the results to find the URLs we need.

To download the page we use Request, like in the example above, and to parse it we’ll use Cheerio again. Here’s what the code looks like:

request(url, function (error, response, body) {
  if (error) {
    console.log(“Couldn’t get page because of error: “ + error);
  // load the body of the page into Cheerio so we can traverse the DOM
  var $ = cheerio.load(body),
    links = $(".r a");
  links.each(function (i, link) {
    // get the href attribute of each link
    var url = $(link).attr("href");
    // strip out unnecessary junk
    url = url.replace("/url?q=", "").split("&")[0];
    if (url.charAt(0) === "/") {
    // this link counts as a result, so increment results

In this case, the URL variable we’re passing in is a Google search for the term “data mining.”

As you can see, we first make a request to get the contents of the page. Then, we load the contents of the page into Cheerio so that we can query the DOM for the elements that hold the links to the pertinent results. Then, we loop through the links and strip out some extra URL parameters that Google inserts for its own usage — when we’re downloading the pages with the Request module, we don’t want any of those extra parameters.

Finally, once we’ve done all that, we make sure the URL doesn’t start with a / — if so, it’s an internal link to something else of Google’s, and we don’t want to try to download it, because either the URL is malformed for our purposes or, even if it isn’t malformed, it wouldn’t be relevant.

Pulling the Words From Each Page Link

Now that we have the URLs of our pages, we need to pull the words from each page. This step consists of doing much the same thing we did just above — only, in this case, the URL variable refers to the URL of the page that we found and processed in the loop above.

request(url, function (error, response, body) {
  // load the page into Cheerio
  var $page = cheerio.load(body),
    text = $page("body").text();

Again, we use Request and Cheerio to download the page and get access to its DOM. Here, we use that access to get just the text from the page.

Next, we’ll need to clean up the text from the page — it’ll have all sorts of garbage that we don’t want on it, like a lot of extra white space, styling, occasionally even the odd bit of JSON data. This is what we’ll need to do:

  1. Compress all white space to single spaces.
  2. Throw away any characters that aren’t letters or spaces.
  3. Convert everything to lowercase.

Once we’ve done that, we can simply split our text on the spaces, and we’re left with an array that contains all of the rendered words on the page. We can then loop through them and add them to our corpus.

The code to do all that looks like this:

// Throw away extra white space and non-alphanumeric characters.
text = text.replace(/\s+/g, " ")
       .replace(/[^a-zA-Z ]/g, "")

// Split on spaces for a list of all the words on that page and 
// loop through that list.
text.split(" ").forEach(function (word) {
  // We don't want to include very short or long words because they're 
  // probably bad data.
  if (word.length  20) {
  if (corpus[word]) {
    // If this word is already in our corpus, our collection
    // of terms, increase the count for appearances of that 
    // word by one.
  } else {
    // Otherwise, say that we've found one of that word so far.
    corpus[word] = 1;

Analyzing Our Words Link

Once we’ve got all of our words in our corpus, we can loop through that and sort them by popularity. First, we’ll need to stick them in an array, though, because the corpus is an object.

// stick all words in an array
for (prop in corpus) {
    word: prop,
    count: corpus[prop]
// sort array based on how often they occur
words.sort(function (a, b) {
  return b.count - a.count;

The result will be a sorted array representing exactly how often each word in it has been used on all of the websites from the first page of results of the Google search. Below is a sample set of results for the term “data mining.” (Coincidentally, I used this list to generate the word cloud at the top of this article.)

[ { word: 'data', count: 981 },
  { word: 'mining', count: 531 },
  { word: 'that', count: 187 },
  { word: 'analysis', count: 120 },
  { word: 'information', count: 113 },
  { word: 'from', count: 102 },
  { word: 'this', count: 97 },
  { word: 'with', count: 92 },
  { word: 'software', count: 81 },
  { word: 'knowledge', count: 79 },
  { word: 'used', count: 78 },
  { word: 'patterns', count: 72 },
  { word: 'learning', count: 70 },
  { word: 'example', count: 70 },
  { word: 'which', count: 69 },
  { word: 'more', count: 68 },
  { word: 'discovery', count: 67 },
  { word: 'such', count: 67 },
  { word: 'techniques', count: 66 },
  { word: 'process', count: 59 } ]

If you’re interested in seeing the rest of the code, check out the fully commented source6.

A good exercise going forward would be to take this application to the next level. You could optimize the text parsing, extend the search to multiple pages of Google results, even strip out common words that aren’t really key terms (like “that” and “from”). More bug handling could also be added to make the app even more robust — when you’re mining data, you want as many layers of redundancy as you can reasonably afford. The variety of content that you’ll be pulling in is such that inevitably you’ll come across an unexpected piece of text that, if unhandled, would throw an error and promptly crash your application.

In Conclusion Link

As always, if you find anything related to web scraping with Node.js that you think is helpful or just have questions or thoughts you want to share, be sure to let us know via the comments below. Also, follow me on Twitter @bovenille and check out my blog7 for more on Node.js, web scraping and JavaScript in general.

(il, rb, al)

Footnotes Link

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
SmashingConf Barcelona 2016

Hold on, Tiger! Thank you for reading the article. Did you know that we also publish printed books and run friendly conferences – crafted for pros like you? Like SmashingConf Barcelona, on October 25–26, with smart design patterns and front-end techniques.

↑ Back to top Tweet itShare on Facebook


Elliot is a freelance JavaScript developer located somewhere in the general vicinity of Providence, Rhode Island. When not working, he enjoys programming, writing, and listening to music. He is a coffee drinker and a cat person. He is the owner of and writer for Hey, JavaScript!

  1. 1

    Nice Tut :-) But you should also point out that a major downside of scraping is that you’re doomed if the markup of your target changes significantly.

  2. 2

    David BOGDAN

    April 8, 2015 2:48 pm

    Very nice and handy article! Thanks for it!

    • 3

      Elliot Bonneville

      April 8, 2015 7:11 pm

      Glad you appreciated it! Thanks for stopping by.

      • 4

        steven kauyedauty

        June 14, 2015 12:41 pm

        Hey Elliot,

        I am working with cheerio and request. I am wondering if you know a work around for scraping the data tables library? I cannot access the classes in the scrape.

  3. 5

    This is fascinating. You could create a cron job and dump this data into a Mongo database and figure out later how to present the data. Thanks for the post!

    • 6

      I’ve been using basically the same methodology but with Firebase. Allows for a lot of cool stuff to be constructed quickly and painlessly.

  4. 7

    WOW! Thank you SO MUCH for this! I’ve recently been researching crawling/scraping for 2 new side projects of mine and I was unsure if I wanted to rely on PHP or .NET to do all the heavy lifting for me, but now that I know that I can use node.js for all this, I think the entire development process will be infinitely faster and easier in a programming language that I’m most familiar with. I’m beyond stoked to implement this.

    THANK YOU! :)

  5. 9

    Great tutorial. But I guess the temperature on the page is in Celsius.

  6. 12

    Data mining he says. :D Did you actually read the wikipedia (lol) link you provided?

    • 13

      Elliot Bonneville

      April 8, 2015 8:43 pm

      Heh. Admittedly, my colloquial employment and definition of the term “data mining” does differ from the definition given in the article (that is, in fact, precisely why I linked to that article), but given the context and the frequent employment of the term within that larger context, I feel it’s appropriate enough.

  7. 14

    TL;DR: Use a (CSS) selector engine, server-side.

    If you do not have got a fancy node.js environment right at hand – try PHP and the trusty simple_html_dom class instead (aka “PHP Simple HTML DOM Parser”). Been using it for ages.

    Works perfectly well. Bit limited, but then I prefer document.querySelectorAll over some ültra-brütal jQuery stuff in nearly all of the cases, anyway.

    cu, w0lf.

    • 15

      Jonas Jancarik

      April 13, 2015 5:21 pm

      I am a bit surprised no one has mentioned Artoo.js yet – if I get it right, going with node.js can help you with additional requests etc., but for quick scraping Artoo can be a delight to use and all you need is a JS console – though it seems to be working the best in Chrome, e.g. saving CSV files.

      (Plus I am quite sure you can also use it for scraping a series of pages, passing the scraped data along the way, although I have not yet done that myself.)

      Doing scraping on the client side helps with authentication issues etc. To quote the Artoo website:

      “Nowadays, websites are not just plain html. […] So, by the days, to cope with this harsh reality, our scraping programs became complex monsters being able to execute JavaScript, authenticate on websites and mimic human behaviour. But, if you sit back and try to find other programs able to perform all those things, you’ll quickly come to this observation: Aren’t we trying to rebuild web browsers?”

      That said, I have used PHP Simple HTML DOM for a good number of projects and it is a great library.

  8. 16

    Sometimes pages require user to click on a link to load more content (ajax), does it work well with this case? And what about page session?

    I usually use greasemonkey to do the dirty work to over come these issues.

  9. 19

    This looks like a nifty replacement for wget or httrack for deploying static versions of node sites. It’d be cool to build this as a Ghost app so when a site is updated it’s crawled and a staticly generated… Sounds like a summer project ;)

  10. 20

    Possible next steps: have a file full of conjunctions & less significant words, like “have”, “from”, “with”, “this”, “would” and so on which you check the words against & eliminate from the results if they appear. Might be interesting to change the URL to point at Google News or maybe Trends & generate word clouds from these.

  11. 21

    Great tutorial. Thank you SO MUCH for this article!

  12. 22

    Hey. I already built what you suggested in the end. I want to learn more scraping. What would be good to try next? Any further Tuts would also be great.


    • 23

      Elliot Bonneville

      April 9, 2015 4:59 pm

      I would say it probably depends on what you want / need to scrape. The basics are all here, although you could look into alternative modules to either of the ones I’ve mentioned (I’ve linked to Zombe.js and PhantomJS as potential replacements for Request), and Cheerio isn’t the only DOM manipulation library, either.

      Going forward, you might be interested in storing the results of your scraping, in which case I’d suggest checking out some Node.js databases (here’s an article I wrote a while back on MongoDB, which pairs nicely with Node:

      I would also say, going forward, another one of your concerns will be making sure there aren’t any memory leaks in your code (as those tend to pile up very quickly when you’re downloading lots of data) and getting some really solid error handling set up. Once you’ve done that and you’re looking to make further improvements, feel free to post your code over at Stack Exchange’s Code Review side: They can usually give you some tips.

      If you have more questions feel free to shoot me an email (my name with a period instead of a space, at Gmail) or a tweet! Always happy to chat.

      Hope this helps,

  13. 25

    This is a very good introduction to web scraping, it is very easy to understand.

  14. 26

    Herlon Aguiar

    April 9, 2015 7:43 pm

    I think that is better :D
    You can also fork the project if you want.
    I say this because I already did a Web Scraping by myself and It was a little hard to understand.

  15. 28

    Thanks for sharing informative article on Node.js. In recent years, this technology is going to be the future of web design and development technology. I am going to include Node.js in my web design training syllabus in educating my students.

  16. 29

    Web Hosting Nepal

    April 10, 2015 1:58 pm

    Great Tutorial thanks for sharing web scripting using node js.

  17. 30

    Hello Elliot,

    First, thanks for the article.

    My question is: what is the best framework you recommend to test this application?


    • 31

      Elliot Bonneville

      April 14, 2015 3:12 am

      Hey Diego, thanks for stopping by!

      Are you referring to application testing, or just running the app in general? If the latter, you’ll have to install Node.js. Find out more here:

      • 32


        I refer to the application test (Mocha, Jasmine…). Any sugestion?


        • 33

          Elliot Bonneville

          April 14, 2015 9:36 pm

          That’s a pretty general question. There are lots of options, each with their own advantages and disadvantages. I haven’t written any formal tests for this app, so I’d say use whatever you can make work, if you’d like to do that yourself.

    • 34

      Elliot Bonneville

      April 14, 2015 3:12 am

      Diogo, my apologies. :-)

  18. 35

    Where does web scraping stand from a legal point of view, specifically copyright/intellectual property?

    • 36

      Elliot Bonneville

      April 14, 2015 9:33 pm

      Hey Brian, tough question. Essentially it boils down to what a website specifies in its Terms of Service; many websites don’t like web scraping and say as much. Legally, you shouldn’t touch their content. Realistically, though, they are limited in terms of preventative action, so unless you’ve done something like create a service which relies on heavily scraping a single site, you probably won’t suffer any legal consequences if you do.

      Further reading here (don’t know how much I would trust it directly, but there are lots of interesting links in the References section):

  19. 37

    Thanks for this simple yet powerful article on nodejs.

  20. 38

    Good article however I do think it is important to encourange people to be good net-citizens. No one (especially smaller operations) wants their site being hammered by web scrapers. If you are going to be harvesting full sites you should take note of the robot.txt and give yourself a suitable user agent so that site owners can identify you.


↑ Back to top