Searchable Dynamic Content With AJAX Crawling

Advertisement

Google Search likes simple, easy-to-crawl websites. You like dynamic websites that show off your work and that really pop. But search engines can’t run your JavaScript. That cool AJAX routine that loads your content is hurting your SEO.

Google’s robots parse HTML with ease; they can pull apart Word documents, PDFs and even images from the far corners of your website. But as far as they’re concerned, AJAX content is invisible.

The Problem With AJAX

AJAX has revolutionized the Web, but it has also hidden its content. If you have a Twitter account, try viewing the source of your profile page. There are no tweets there — just code! Almost everything on a Twitter page is built dynamically through JavaScript, and the crawlers can’t see any of it. That’s why Google developed AJAX crawling1.

Because Google can’t get dynamic content from HTML, you will need to provide it another way. But there are two big problems: Google won’t run your JavaScript, and it doesn’t trust you.

Google indexes the entire Web, but it doesn’t run JavaScript. Modern websites are little applications that run in the browser, but running those applications as they index is just too slow for Google and everyone else.

The trust problem is trickier. Every website wants to come out first in search results; your website competes with everyone else’s for the top position. Google can’t just give you an API to return your content because some websites use dirty tricks like cloaking2 to try to rank higher. Search engines can’t trust that you’ll do the right thing.

Google needs a way to let you serve AJAX content to browsers while serving simple HTML to crawlers. In other words, you need the same content in multiple formats.

Two URLs For The Same Content

Let’s start with a simple example. I’m part of an open-source project called Spiffy UI. It’s a Google Web Toolkit3 (GWT) framework for REST and rapid development. We wanted to show off our framework, so we made SpiffyUI.org4 using GWT.

GWT is a dynamic framework that puts all of our content in JavaScript. Our index.html file looks like this:

<body>
   <script type="text/javascript" language="javascript"
   src="org.spiffyui.spsample.index.nocache.js"></script>
</body>

Everything is added to the page with JavaScript, and we control our content with hash tags5 (I’ll explain why a little later). Every time you move to another page in our application, you get a new hash tag. Click on the “CSS” link and you’ll end up here:

http://www.spiffyui.org#css

The URL in the address bar will look like this in most browsers:

http://www.spiffyui.org/?css

We’ve fixed it up with HTML5. I’ll show you how later in this article.

This simple hash works well for our application and makes it bookmarkable, but it isn’t crawlable. Google doesn’t know what a hash tag means or how to get the content from it, but it does provide an alternate method for a website to return content. So, we let Google know that our hash is really JavaScript code instead of just an anchor on the page by adding an exclamation point (a “bang”), like this:

http://www.spiffyui.org#!css

This hash bang is the secret sauce in the whole AJAX crawling scheme. When Google sees these two characters together, it knows that more content is hidden by JavaScript. It gives us a chance to return the full content by making a second request to a special URL:

http://www.spiffyui.org?_escaped_fragment_=css

The new URL has replaced the #! with ?_escaped_fragment_=. Using a URL parameter instead of a hash tag is important, because parameters are sent to the server, whereas hash tags are available only to the browser.

That new URL lets us return the same content in HTML format when Google’s crawler requests it. Confused? Let’s look at how it works, step by step.

Snippets Of HTML

The whole page is rendered in JavaScript. We needed to get that content into HTML so that it is accessible to Google. The first step was to separate SpiffyUI.org into snippets of HTML.

Google still thinks of a website as a set of pages, so we needed to serve our content that way. This was pretty easy with our application, because we have a set of pages, and each one is a separate logical section. The first step was to make the pages bookmarkable.

Bookmarking

Most of the time, JavaScript just changes something within the page: when you click that button or pop up that panel, the URL of the page does not change. That’s fine for simple pages, but when you’re serving content through JavaScript, you want give users unique URLs so that they can bookmark certain areas of your application.

JavaScript applications can change the URL of the current page, so they usually support bookmarking via the addition of hash tags. Hash tags work better than any other URL mechanism because they’re not sent to the server; they’re the only part of the URL that can be changed without having to refresh the page.

The hash tag is essentially a value that makes sense in the context of your application. Choose a tag that is logical for the area of your application that it represents, and add it to the hash like this:

http://www.spiffyui.org#css

When a user accesses this URL again, we use JavaScript to read the hash tag and send the user to the page that contains the CSS.

You can choose anything you want for your hash tag, but try to keep it readable, because users will be looking at it. We give our hashes tags like css, rest and security.

Because you can name the hash tag anything you want, adding the extra bang for Google is easy. Just slide it between the hash and the tag, like this:

http://www.spiffyui.org#!css

You can manage all of your hash tags manually, but most JavaScript history frameworks will do it for you. All of the plug-ins that support HTML4 use hash tags, and many of them have options for making URLs bookmarkable. We use History.js6 by Ben Lupton7. It’s easy to use, it’s open source, and it has excellent support for HTML5 history integration. We’ll talk more about that shortly.

Serving Up Snippets

The hash tag makes an application bookmarkable, and the bang makes it crawlable. Now Google can ask for special escaped-fragment URLs like so:

screenshot8

When the crawler accesses our ugly URL, we need to return simple HTML. We can’t handle that in JavaScript because the crawler doesn’t run JavaScript in the crawler. So, it all has to come from the server.

You can implement your server in PHP, Ruby or any other language, as long as it delivers HTML. SpiffyUI.org is a Java application, so we deliver our content with a Java servlet9.

The escaped fragment tells us what to serve, and the servlet gives us a place to serve it from. Now we need the actual content.

Getting the content to serve is tricky. Most applications mix the content in with the code; but we don’t want to parse the readable text out of the JavaScript. Luckily, Spiffy UI has an HTML-templating mechanism. The templates are embedded in the JavaScript but also included on the server. When the escaped fragment looks for the ID css, we just have to serve CSSPanel.html.

The template without any styling looks very plain, but Google just needs the content. Users see our page with all of the styles and dynamic features:

screenshot10

Google gets only the unstyled version:

screenshot11

You can see all of the source code for our SiteMapServlet.java12 servlet. This servlet is mostly just a look-up table that takes an ID and serves the associated content from somewhere on our server. It’s called SiteMapServlet.java because this class also handles the generation of our site map.

Tying It All Together With A Site Map

Our site map13 tells the crawler what’s available in our application. Every website should have a site map; AJAX crawling doesn’t work without one.

Site maps are simple XML documents that list the URLs in an application. They can also include data about the priority and update frequency of the app’s pages. Normal entries for site maps look like this:

<url>
   <loc>http://www.spiffyui.org/</loc>
   <lastmod>2011-07-26</lastmod>
   <changefreq>daily</changefreq>
   <priority>1.0</priority>
</url>

Our AJAX-crawlable entries look like this:

<url>
   <loc>http://www.spiffyui.org/#!css</loc>
   <lastmod>2011-07-26</lastmod>
   <changefreq>daily</changefreq>
   <priority>0.8</priority>
</url>

The hash bang tells Google that this is an escaped fragment, and the rest works like any other page. You can mix and match AJAX URLs and regular URLs, and you can use only one site map for everything.

You could write your site map by hand, but there are tools that will save you a lot of time. The key is to format the site map well and submit it to Google Webmaster Tools.

Google Webmaster Tools

Google Webmaster Tools2614 gives you the chance to tell Google about your website. Log in with your Google ID, or create a new account, and then verify your website.

screenshot15

Once you’ve verified, you can submit your site map and then Google will start indexing your URLs.

And then you wait. This part is maddening. It took about two weeks for SpiffyUI.org to show up properly in Google Search. I posted to the help forums half a dozen times, thinking it was broken.

There’s no easy way to make sure everything is working, but there are a few tools to help you see what’s going on. The best one is Fetch as Googlebot16, which shows you exactly what Google sees when it crawls your website. You can access it in your dashboard in Google Webmaster Tools under “Diagnostics.”

screenshot17

Enter a hash bang URL from your website, and click “Fetch.” Google will tell you whether the fetch has succeeded and, if it has, will show you the content it sees.

screenshot18

If Fetch as Googlebot works as expected, then you’re returning the escaped URLs correctly. But you should check a few more things:

  • Validate your site map19.
  • Manually try the URLs in your site map. Make sure to try the hash-bang and escaped versions.
  • Check the Google result for your website by searching for site:www.yoursite.com.

Making Pretty URLs With HTML5

Twitter leaves the hash bang visible in its URLs, like this:

http://twitter.com/#!/ZackGrossbart

This works well for AJAX crawling, but again, it’s slightly ugly. You can make your URLs prettier by integrating HTML5 history20.

Spiffy UI uses HTML5 history integration to turn a hash-bang URL like this…

http://www.spiffyui.org#!css

… into a pretty URL like this:

http://www.spiffyui.org?css

HTML5 history makes it possible to change this URL parameter, because the hash tag is the only part of the URL that you can change in HTML4. If you change anything else, the entire page reloads. HTML5 history changes the entire URL without refreshing the page, and we can make the URL look any way we want.

This nicer URL works in our application, but we still list the hash-bang version on our site map. And when browsers access the hash-bang URL, we change it to the nicer one with a little JavaScript.

Cloaking

Earlier, I mentioned cloaking. It is the practice of trying to boost a website’s ranking in search results by showing one set of pages to Google and another to regular browsers. Google doesn’t like cloaking and may remove offending websites from its search index21.

AJAX-crawling applications always show different results to Google than to regular browsers, but it isn’t cloaking if the HTML snippets contain the same content that the user would see in the browser. The real mystery is how Google can tell whether a website is cloaking or not; crawlers can’t compare content programmatically because they don’t run JavaScript. It’s all part of Google’s Googley power.

Regardless of how it’s detected, cloaking is a bad idea. You might not get caught, but if you do, you’ll be removed from the search index.

Hash Bang Is A Little Ugly, But It Works

I’m an engineer, and my first response to this scheme is “Yuck!” It just feels wrong; we’re warping the purpose of URLs and relying on magic strings. But I understand where Google is coming from; the problem is extremely difficult. Search engines need to get useful information from inherently untrustworthy sources: us.

Hash bangs shouldn’t replace every URL on the Web. Some websites have had serious problems22 with hash-bang URLs because they rely on JavaScript to serve content. Simple pages don’t need hash bangs, but AJAX pages do. The URLs do look a bit ugly, but you can fix that with HTML5.

Further Reading

We’ve covered a lot in this article. Supporting AJAX crawling means that you need to change your client’s code and your server’s code. Here are some links to find out more:

Thanks to Kristen Riley for help with some of the images in this article.

(al)

Footnotes

  1. 1 http://code.google.com/web/ajaxcrawling
  2. 2 http://en.wikipedia.org/wiki/Cloaking
  3. 3 http://code.google.com/webtoolkit/
  4. 4 http://www.spiffyui.org
  5. 5 http://en.wikipedia.org/wiki/Hashtag#Hashtags
  6. 6 https://github.com/balupton/history.js
  7. 7 http://balupton.com/
  8. 8 http://coding.smashingmagazine.com/wp-content/uploads/2011/07/CrawlerServerDiagram3.png
  9. 9 http://en.wikipedia.org/wiki/Java_Servlet
  10. 10 http://coding.smashingmagazine.com/wp-content/uploads/2011/07/css_page_normal.png
  11. 11 http://coding.smashingmagazine.com/wp-content/uploads/2011/07/css_page_escaped1.png
  12. 12 http://spiffyui.googlecode.com/svn/trunk/spiffyui-app/src/main/java/org/spiffyui/spsample/server/SiteMapServlet.java
  13. 13 http://www.spiffyui.org/sitemap.xml
  14. 14 https://www.google.com/webmasters/tools
  15. 15 http://coding.smashingmagazine.com/wp-content/uploads/2011/07/google_wmt_verification.png
  16. 16 http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=158587
  17. 17 http://coding.smashingmagazine.com/wp-content/uploads/2011/07/googlebot_fetch.png
  18. 18 http://coding.smashingmagazine.com/wp-content/uploads/2011/07/googlebot_results.png
  19. 19 http://www.validome.org/google/validate
  20. 20 http://www.w3.org/TR/html5/history.html
  21. 21 http://www.google.com/support/webmasters/bin/answer.py?answer=66355
  22. 22 http://www.webmonkey.com/2011/02/gawker-learns-the-hard-way-why-hash-bang-urls-are-evil/
  23. 23 http://code.google.com/web/ajaxcrawling/
  24. 24 http://www.w3.org/TR/html5/history.html
  25. 25 http://www.sitemaps.org/
  26. 26 https://www.google.com/webmasters/tools
  27. 27 https://code.google.com/p/spiffyui/source/checkout

↑ Back to topShare on Twitter

Zack Grossbart is an engineer, designer, and author. He's a founding member of the Spiffy UI project, the architect of the WordPress Editorial Calendar, and a Consulting Engineer with NetIQ. Zack began loading DOS from a floppy disk when he was five years old. He first worked professionally with computers when he was 15 and started his first software company when he was 16. Zack lives in Cambridge, Massachusetts with his wife and daughter.

Advertising

Note: Our rating-system has caused errors, so it's disabled at the moment. It will be back the moment the problem has been resolved. We're very sorry. Happy Holidays!

  1. 1

    Not a single mention of the History API? Really? https://developer.mozilla.org/en/DOM/Manipulating_the_browser_history

    • 2

      I think it would make a great follow-up article, don’t you think, Chris? ;-)

    • 3

      We’re using history.js to interact with the History API. I’d love to see a follow-up article about working with the browser history.

  2. 4

    I would imagine that cloaking detection would have to come from file/page size comparisons. If one page is 10kb to one Googlebot, but is 100kb to another Googlebot, then there is an obvious difference in content.

    Just my theory.

  3. 5

    I didnt read the whole post, so maybe im off the topic.

    Why not leave all urls as is like http://www.somepage.com/some/folder/there and rewrite this when user uses it into http://www.somepage.com#/some/folder/there?
    I´m using this on one site and have had no problems. On the server side if you use hashes it goes into partial rendering or just data retrievment so you render via javascript and without hashes it renders as usual. Wouldnt that be much more prettier?

    • 6

      Michael Yagudaev

      May 20, 2012 10:50 am

      Yes I think you are right luke, I don’t like the “ugly urls” that the google Ajax API uses. A much better approach would be to use the same URL for both things and use the HTML5 history API to avoid that ugly hash in the URL.

      Things should be seamless to the user and the search engine.

  4. 7

    Great article Zach, and very timely too. We have faced this situation many times and wondered how best to address honest SEO without running the risk of facing a ban from the same search engines we’re attempting to accommodate.

  5. 8

    TL;DR? – Bottom-line: This is ugly, incompatible, user-unfriendly and doesn’t make use of capabilities of the web.

    Isn’t this a step backwards? It makes use of javascript for all content, which renders the page useless for browsers without/disabled javascript. I suppose it doesn’t work that well with Opera Mobile and other proxy browsers, but I haven’t got one here so I can’t be definite about it.

    I think the web community should take a step away from using the hash for something else than it is, and instead feed the website in a non-js version and just catch (proper) urls with javascript to make the site dynamic (“ajaxy”). With proper urls I mean urls that actually links to where they lead, example.org/?page2 instead of example.org/?page1# (as you’ve got it on spiffyui.org). Without proper links it’s impossible to open a link in a new tab.

    So this is clearly (in my eyes) a step backwards. Also, example.org/page1 is prettier than example.org/?page1, right? Takes less time to type and less characters too. And having the slash as a section-divider is a common and user-known pattern. Why add a seemingly (and practically) unneeded character? For someone more interested, have a look at http://leaverou.me/2011/08/on-url-readability/

  6. 9

    Very nice and informative article. Thank you.

  7. 10

    Good article except you kept mentioning google can’t run javascript. This incorrect and google can run javascript which we have verified through our own testing.

    We have found that google doesn’t like ajax but normal javascript maybe fine.

    • 11

      Thanks for the correction Marc. I haven’t been able to get Google to run any GWT generated JavaScript and haven’t had much luck with simple JavaScript either.

      What types of normal JavaScript have you been able to get Google to run? How did you test it?

      Thanks,
      Zack

      • 12

        You can run WebKit from CLI for sure.

        An article how to do some good headles JS stuff
        http://ariya.blogspot.com/2011/01/phantomjs-minimalistic-headless-webkit.html

        If anyone can run ANY js code, why can’t google?

        The question is how can google handle onClick events attached by ajax? I bet hardly, but if AJAX changes content in a sense new DOM contains proper href’s, there is no reason big G cant read it.

        Question is: Do they want it?

        If everyone would start using AJAX content, it would mean many millions of USD more required to run google bots, as JS might be very power consuming, so they don’t advertise it

        • 13

          Javascript can be very OO. Even to the point of being Java-like, but you do have to iplmement it yourself. Honestly though, you don’t always need to be completely OO. I much prefer Python, and Javascript to Java, because I feel more productive in general. I don’t need to write an excessive amount of text to do the exact same things, and I can have unit testing and all of that great stuff the same. Even Sun is adding dynamic concepts to Java with closures, and by bringing dynamic languages to the JVM. Rhino, Jython, JRuby, they are all officially supported and being developed by Sun in some way.

  8. 14

    Great and thorough article Zack :-)

    Though I’m a bit confused on why you only recommend using the HTML5 History API for clean urls? As using the HTML5 History API, you don’t need hashbangs at all… and is a much simpler solution than hashbangs altogether, as they require no server-side work (where the hashbang does).

    There is a nice article detailing graceful AJAX with the HTML5 History API here: https://github.com/balupton/history.js/wiki/Intelligent-State-Handling

    Would love to know your thoughts on this Zack, always willing to learn :-)

    • 15

      Hi Ben,

      GitHub does a really good job of integrating HTML5 history without needing the hashbang. They serve each page individually as static content, but they also load them dynamically when you’re clicking around. That’s an awesome way to get clean URLs, but it doesn’t work for every site.

      Your article talks about loading each page, but that doesn’t help for page which load snippets and other smaller chunks of content through AJAX like Twitter. This is another place in web design where you have to pick the solution that’s right for your site.

      Thanks,
      Zack

      • 16

        Hello,

        While I understand Zack’s point about the need to hashbang (for ajax “lazy loading”, to my understanding). I’m wondering why spiffyui.org is no longer using the hashbang in the urls. Any insight Zack?

        Thanks!

  9. 17

    Well, imho you make simple things sound complicated. I kinda had problem with fetching of content for site with ajax submenus, but it was all sorted out with site tree page. and if you really need hash sign, do htaccess.

  10. 18

    Superb!
    I’ve read many articles about crawling ajax content, but this one just covered it all with ease ;) Thanks!

  11. 19

    Michaël van Oosten

    September 28, 2011 6:37 am

    Thanks for sharing Zack; very nice read. Nice to see the results when you Google “site:spiffyui.org”; works like a charm! Good job.

  12. 20

    Sweet. I work at a company that uses GWT internally where SEO hasn’t been important, but are about to develop some pages for public consumption.

    Added bonus is I’ll be checking out Spiffy for our use, too.

  13. 21

    Excellent. AJAX is great for websites but as mentioned it comes with its disadvantages. The main thing when building websites to rank in search engines is ensuring they are correct!

  14. 22

    Hi I realy liked this article. But why I can’t share it with my friends by useing LIKE ?

    Thanks

  15. 23

    hi there macauley if you are still in need of them here is the link
    and details,ring them if you need them in a hurry ,tell them mick told you to ring

  16. 24

    Little late to the party here, but I’ve created a service where your website serves a browser rendered html snapshot of your website in accordance with Google’s ajax crawling specification. If anyone is interested in testing this on their website, drop me an e-mail at a@abrkn.com.

↑ Back to top