When it comes to extracting data from an HTML document on the server-side, Node.js is a fantastic option. Not so much because of any particular feature of Node itself, but because running JavaScript on the server also means that you can run jQuery on the server. Then, many of the techniques you’ve learned on the client-side are applicable on the server-side as well.

Typically, using jQuery on the server with Node is accomplished via a handy module called jsdom. Unfortunately, later versions of jsdom took a dependency on a module called contextify, and that choice has made jsdom not-so-friendly to those of us running Node.exe on Windows.

As I was jumping through the hoops to build contextify in Visual Studio, copy that binary back into my Node.exe project, and cross my fingers, I realized that maybe I’d become too fixated on jsdom as the only solution to this problem. After all, one of Node’s strengths is its thriving ecosystem of third-party libraries (reminiscent of the community that sprang up around jQuery itself several years ago).

Surely, there’s a Windows-friendly alternative to jsdom out there, right?

In this post, I’ll show you why jQuery on the server is useful, the alternative to jsdom that I found (Cheerio), and how to use Cheerio’s jQuery syntax to request and parse a remote HTML document.

Note: You don’t need to be a Windows user to benefit from Cheerio. Though jsdom impressively emulates how a browser might interpret an HTML document, at a much deeper level than most simple HTML parsers, that’s massive overkill in a lot of scenarios. In those simpler situations, jsdom carries a needless performance tax for that highly realistic simulation of the DOM.

Why use jQuery on the server?

One of my favorite things about using jQuery to interact with web pages is that you can rapidly prototype and iterate on ideas within a browser’s development tools. Anything that shortens the distance between an experiment and its result is a big win, and using your browser as a live REPL environment accomplishes just that.

For example, say we wanted to extract the titles of the posts appearing on the index page of my blog. It’s incredibly easy to load the page up in Chrome and experiment at the console a bit:

Screenshot of jQuery experimentation at the Chrome console.

Using console.log to echo those post titles isn’t very useful, but it’s helpful for getting quick confirmation that a particular combination of selectors and traversals does what you expected.

Now imagine that this browser-based experimentation was something that you could port directly to server-side code. Suddenly, your browser essentially becomes an IDE tool for developing the screen scraping portion of your server-side code. If you’ve ever worked on traditional HTML parsing code before, using something like XPath (or, heaven forbid, regular expressions), you know what a huge improvement this workflow offers.

Finding a pure JavaScript alternative to jsdom

While working on the first phase of the Juice UI project, we found the need for an API to jQuery UI’s properties, methods, and events, similar to what jQuery core provides. Since no official API was available, I decided to screen scrape the jQuery UI documentation and build one myself: http://jqueryuiapi.com.

Originally, I built that API using Node and jsdom, and hosted it on a dedicated Ubuntu VPS. As iisnode matured, I thought wouldn’t it be nice to migrate the API site to unused capacity on one of my Windows servers? Unfortunately, jsdom’s dependence on contextify threw a wrench in that plan and that migration went to a back burner for a while.

Eventually, it occurred to me that jsdom couldn’t possibly be the only game in town. After a bit of searching, I arrived at a promising GitHub repository for a project named Cheerio.

Cheerio vs. jsdom

Cheerio is a relatively simple library with a clear mission statement:

Fast, flexible, and lean implementation of core jQuery designed specifically for the server.

Cheerio doesn’t try to emulate a full implementation of the DOM. It specifically focuses on the scenario where you want to manipulate an HTML document using jQuery-like syntax. As such, it compares to jsdom favorably in some cases, but not in every situation.

  • Cheerio is faster – Since Cheerio does not attempt to create an accurate representation of the DOM, it is much faster than jsdom. In fact, the author suggests that Cheerio is roughly eight times faster than jsdom.
  • Cheerio is more flexible – Malformed HTML markup (i.e. the kind of markup that’s painfully plentiful online) can give jsdom trouble. Since Cheerio doesn’t build an accurate a representation of the document in memory, it’s more flexible when it parses the markup you throw its way.
  • jsdom is more powerful – I can’t stress enough that Cheerio doesn’t even attempt to compete with jsdom’s full feature set. For example, content generated on a page by JavaScript won’t be available when you parse that page with Cheerio, but would be rendered and accessible through jsdom (e.g. you couldn’t use Cheerio to read the content of my sidebar ad).

The bottom line? If all you need is simple jQuery-esque functionality against a static document, Cheerio is faster, easier, and more flexible than jsdom. If you need anything more, jsdom is the better choice.

Installing Cheerio and Request

To give Cheerio a try, the first step is to install the Cheerio module in your project.

If you’ve already installed a recent version of Node on your machine, you’ll also have npm installed. Using npm, you can download Cheerio by opening a command window, navigating to the top-level directory of the project you want to use Cheerio in, and executing npm install cheerio.

Cheerio itself doesn’t include a mechanism for making HTTP requests, and that’s something that can be tedious to handle manually. It’s a bit easier to use a module called request to facilitate requesting remote HTML documents. Request handles common tasks like caching cookies between multiple requests, setting the content length on POSTs, and generally makes life easier.

Installing both Cheerio and request will look something like this:

Installing Cheerio and request from the command line with npm.

Using the browser code on server

With Cheerio and request both installed, now it’s easy to bring our browser-based experimentation directly to the server without modification:

var request = require('request'),
    cheerio = require('cheerio');
 
request('http://encosia.com', function(error, response, body) {
  // Hand the HTML response off to Cheerio and assign that to
  //  a local $ variable to provide familiar jQuery syntax.
  var $ = cheerio.load(body);
 
  // Exactly the same code that we used in the browser before:
  $('h2').each(function() {
    console.log($(this).text());
  });
});

Running that bit of code through Node.exe at the command line extracts the same information that our original Chrome developer tools console experimentation did:

Conclusion

While Cheerio isn’t powerful enough for every task, it has handled everything I’ve thrown at it so far (including the parsing that underlies jQueryUIAPI.com). I’ve found that the type of documents I want to parse behind the scenes are still mostly static, and that’s where Cheerio excels.

As someone that spends a lot of time developing for the web, my brain just doesn’t think in terms of XPath or XML traversals when I look at markup. Being able to use JavaScript and a jQuery-like syntax to parse HTML is a perfect match for the mental model I approach these tasks with.

Even though jsdom will probably work well on Windows in the long run (there are viable workarounds to the contextify problem, to be clear), I’m glad that my temporary hassles with it led me to Cheerio. In the applications where its approach is viable, Cheerio is easier to use and quite a bit faster than jsdom. Give it a try.