It’s been nearly two years since I wrote about using Google’s CDN to host jQuery on your public-facing sites. In that post, I recommended it due to three primary benefits that public CDNs offer: decreased latency, increased parallelism, and improved caching.

Though the post has been overwhelmingly well-received, concerns have been raised as to whether or not the likelihood of better caching is truly very significant. Since the efficacy of that benefit depends entirely on how many other sites are using the same CDN, it takes quite a bit of research to make an objective case either way.

I’ve never been happy about responding with vague answers. Caching probability is a valid concern and deserves to be taken seriously. So, I decided to cobble together an HTTP crawler, analyze 200,000 of the most popular sites on the Internet, and determine how many of those are referencing jQuery on Google’s public CDN.


Methodology

The Internet in a Box!

Having an idea of how many sites reference jQuery on Google’s CDN and how popular those sites are should lead to a more objective decision, one way or the other. Just a handful of top-ranked sites can prime as many browser caches as thousands of more obscure sites. Conversely, heavy coverage across a long-tail of moderately popular sites has the potential to prime just as many caches.

To measure the CDN’s coverage and how that coverage varies with site popularity, I decided to use Alexa as my source of sites to analyze. Alexa is far from perfect, but they make a free CSV file of their top ranked sites available and the aggregate across 200,000 sites smooths out most of Alexa’s issues.

To determine which of those sites use a public jQuery CDN, my crawler ran down Alexa’s list and downloaded the document at each site’s root. Then, I logged any script element with a src attribute that contained the word “jQuery”.

Inaccuracies

I’ll be the first to admit that my approach is fraught with inaccuracies:

  • Alexa – Alexa itself isn’t a great ranking mechanism. It depends on toolbar-reported data and individual rankings must be taken with a grain of salt. However, I believe that aggregate trends across its top 200,000 sites represents a useful high-level view.
  • HTTP errors – About 10% of the URLs I requested were unresolvable, unreachable, or otherwise refused my connection. A big part of that is due to Alexa basing its rankings on domains, not specific hosts. Even if a site only responds to www.domain.com, Alexa lists it as domain.com and my request to domain.com went unanswered.
  • jsapi – Sites using Google’s jsapi loader and google.load() weren’t counted in my tally, even though they do eventually reference the same googleapis.com URL. Both script loading approaches do count toward the same critical mass of caching, but my crawler’s regex doesn’t catch google.load().
  • Internal usage – It’s not uncommon for sites to pare their landing pages down to the absolute bare minimum, only introducing more superfluous JavaScript references on inner pages that require them. Since I only analyzed root documents, I undercounted any sites taking that approach and using the Google CDN to host jQuery on those inner pages.

At first, that may seem like an awful lot of potential error. However, the one thing all of these inaccuracies have in common is that none of them favor the case for using a public CDN. Playing the averages, I expect that the actual usage numbers are at least 10% higher than what I found.

So, in terms of making a case for the CDN, this analysis is extremely conservative.

Analysis

By casting a wide net with the regex and logging any script reference that contained the word “jQuery”, I was able to construct ad-hoc queries to answer a variety of questions. For example, how many top 200,000 sites use the Google CDN to host jQuery UI for them?

SELECT count(*)
FROM Results
WHERE Reference LIKE '%googleapis%jquery-ui.min.js'

Answer: 989

Want to know how many top 1,000 sites use the Microsoft CDN for any jQuery-related script?

SELECT COUNT(*) 
FROM Results 
WHERE Reference LIKE '%ajax.microsoft%jquery%' 
  AND Rank <= 1000

Answer: 1 (Microsoft.com)

My findings

Without further ado, across the 200,000 sites that I analyzed, this is what I found:

  • 47 of the Alexa top 1,000 include a Google CDN reference.
  • 99 of the Alexa top 2,000 reference jQuery on the Google CDN.
  • 6,953 of the top 200,000 sites include a script reference to some version of jQuery hosted on Google’s CDN.

Just within the top thousand or so sites, I found the Google CDN being used to host jQuery for very high-traffic sites including Twitter, TwitPic, SlideShare, Break, Stack Overflow, Woot, Posterous, SitePoint, Foursquare, FAIL Blog, Stanford, and the jQuery site itself. These sites alone are priming tens, if not hundreds, of millions of browser caches with the Google CDN’s copy of jQuery.

Not only that, but popular sites using the Google jQuery CDN span a diverse range of genres and demographics. While I might theorize that a minority of my readers are also regular Break.com and Foursquare visitors, I cannot possibly make that claim for Twitter and Stack Overflow. A non-trivial amount of my traffic is referred directly from those sites, and enjoys a no-HTTP-request cache hit for the jQuery reference here on my site.

In fact, I found that most any genre a site falls within, there’s at least one site near the top of Alexa’s rankings that uses Google’s jQuery CDN, priming caches for all of the smaller sites in that niche.

Disproving a theory

Going into this, I expected that usage of a public CDN would be more common as a site’s Alexa rank increased. This wouldn’t necessarily be desirable since high-traffic sites referencing the CDN improve the caching situation for all of us, but I thought it likely.

Since most large-scale sites already host static assets on CDNs, I reasoned that they would be less likely to use a shared, public CDN like Google’s. On the other hand, I thought that smaller sites would be more eager to take advantage of the free CDN service, which has become easy for even non-technical site owners to implement.

However, what I found was a nearly dead-even distribution across the 200,000 sites I sampled. There were some variations, but it appears that larger sites are just as happy to use Google’s bandwidth as anyone. This is a great result for the rest of us. When popular sites like Twitter, StumbleUpon, and Stack Overflow seed their myriad users’ caches, it’s more likely that smaller sites will benefit from a no-HTTP-request jQuery load.

Issues

I’m optimistic about the results of my research, but the analysis did reveal some issues that can’t be ignored. I hope identifying these sore spots can raise awareness and eventually improve the situation.

Version fragmentation

One obstacle in the way of optimal caching is that sites must reference exactly the same CDN URL in order to obtain the cross-site caching benefit. Thankfully, jQuery tends to quickly settle on a stable version after each major release, and that version is relatively long-lasting.

Unfortunately, I did find a handful of sites still referencing odd versions of jQuery, such as 1.3.0 and 1.4.1. That mistake wasn’t very common, but even one popular site referencing an odd version of jQuery is one too many.

The most notable offender is Twitter. For reasons I can’t fathom, their jQuery reference is to 1.3.0. I assume that’s being updated to 1.4.2 as part of the #newtwitter revamp currently underway, but I was surprised to find that reference on a site under the stewardship of so many developers.

The takeaway here is to keep your site’s CDN reference updated to the latest compatible version of jQuery. Even if you have some legitimate reason to avoid the upgrade from 1.3 to 1.4, at least be sure that you’re referencing 1.3.2.

Specificity is crucial

After fragmentation, the next most common mistake I found was using the “latest version” references that some CDNs offer. The “latest version” reference allows you to request either version 1 or 1.x and automatically receive the latest matching 1.x.y version.

For example, at the time of writing, this reference returns jQuery 1.4.2:

http://ajax.googleapis.com/ajax/libs/jquery/1/jquery.min.js

And, this reference returns jQuery 1.3.2:

http://ajax.googleapis.com/ajax/libs/jquery/1.3/jquery.min.js

You should never do this in production.

In order to insure that references to a “latest version” remain current when jQuery is updated, these requests are necessarily served with a very short expires header. Not only does this break the Internet-wide caching benefit between your site and others, the “latest version” reference’s short expires header likely even makes the CDN less optimal than serving jQuery from your own site!

Worse, you’re giving a third party permission to transparently change one of your site’s fundamental dependencies without your approval or interaction when a jQuery update occurs. This is not a good idea.

A notable offender here is jQuery.com itself. The site currently references the Google CDN for jQuery, but unfortunately references the “latest” 1.4 version instead of 1.4.2 specifically. Not only is that slower than necessary for repeat jQuery.com visitors, but imagine how many browser caches they could be priming if they were referencing 1.4.2 specifically!

The Microsoft CDN

Since I’m a fan of public CDNs, I was happy to see Microsoft start hosting MicrosoftAjax.js and the now-defunct ASP.NET Ajax Library on their CDN. However, I’m disappointed to see them pushing it as a solution for hosting jQuery and jQuery UI for two reasons:

  • Cookies – Because Microsoft’s CDN falls under the Microsoft.com domain, every request to it needlessly includes the plethora of cookies that other Microsoft.com subdomains set. In my case, this weighs in at about 3kb of superfluous cookie data that must be transmitted along with every request to the Microsoft CDN.
  • PopularityFar fewer public-facing sites use the Microsoft CDN to serve jQuery for them: I found only 49 sites in the entire top 200,000 that reference Microsoft’s copy of jQuery. We can speculate endlessly about why that is, but the “why” is unimportant. The Google CDN has such a vast caching advantage at this point, using the Microsoft CDN for jQuery is a needless performance penalty.

I have friends at Microsoft and hesitated to point out these issues with using Microsoft’s CDN for jQuery and jQuery UI. Moreover, I do commend them for providing the Microsoft-specific scripts on a public CDN.

Ultimately though, I’d be remiss not to mention these drawbacks, since so many .NET developers seem eager to use Microsoft’s CDN out of misplaced brand loyalty. It’s a shame for .NET developers to unwittingly contribute to the aforementioned fragmentation issue, while simultaneously missing out on the caching benefit that Google’s more popular CDN offers.

Conclusion

If you’re using jQuery on a public-facing site, use the Google CDN to host it. This is not simply a theoretically good idea, but is objectively, quantitatively justified. Better yet, the likelihood of a cache hit is only growing.