URI handling, Google bot, SEO book example

Bart1 · April 7, 2011, 11:09pm

Hey guys,

I just recently realized that since virtually the entire Vaadin application code is processed as javascript, I need a workaround for search engine indexing.

Apparently google crawler is now able to crawl AJAX applications, albeit certain URI conditions have to be met first:
Google Crawler - AJAX
. Google seems to be the only search engine for now, but knowing how quickly the internet works this should spread to other engines soon enough.

Summary:
Basically what I get from the Google FAQ is that:
a) First we have to convert our “www.example.com/ajax.html #example” to “www.example.com/ajax.html #!example”. This seems simple enough once you set up URIfragment handling.

b) Second, and this part really confuses me, we have to provide an html snapshot of our page.
Whenever the google bot sees “#!” URIfragment it converts it to “?escaped_fragment=”. Now whenever this address is called, our application should return an HTML page with flat content that can be indexed by the bot.

c)At the end our server has to make sure that a request URL of the form “www.example.com/ajax.html?escaped_fragment=key=value” is mapped back to its original form: “www.example.com/ajax.html#!key=value”.

=============================================================

It seems there is already a book example on how to implement this, however I have a hard time following it.

SEO Example

// Set the URI Fragment when menu selection changes
        menu.addListener(new Property.ValueChangeListener() {
            public void valueChange(ValueChangeEvent event) {
                String itemid = (String) event.getProperty().getValue();
                
                // Set the fragment with the exclamation mark, which is
                // understood by the indexing engine
                urifu.setFragment("!" + itemid);
            }
        });

This part I understand, the menu has a listener that fires whenever someone selects an item, it just adds the URIfragment to the URL and also makes sure that the fragment begins with a “#!”.

  // When the URI fragment is given, use it to set menu selection 
        urifu.addListener(new FragmentChangedListener() {
            public void fragmentChanged(FragmentChangedEvent source) {
                String fragment =
                          source.getUriFragmentUtility().getFragment();
                if (fragment != null) {
                    // Skip the exclamation mark
                    if (fragment.startsWith("!"))
                        fragment = fragment.substring(1);
                    
                    // Set the menu selection
                    menu.select(fragment);
                    
                    // Display some content related to the item
                    main.addComponent(new Label(getContent(fragment)));
                }
            }
        });

URIfragment listener that listens to URIfragment and selects the specific menu item when the fragment changes. This is still just the regular Vaadin URI handling.

  // Store possible parameters here
        main.addParameterHandler(new ParameterHandler() {
            public void handleParameters(Map<String, String[]> parameters) {
                // If the special escape paremeter is included, store it
                if (parameters.containsKey("_escaped_fragment_"))
                    fragment = parameters.get("_escaped_fragment_")[0]
;
                else
                    fragment = null;
            }
        });

Here we attach a parameter handler to the main Window that catches the parameters passed to our application on the server. If google sees a site with “#!” address fragment, it will send a request for a site with address “escaped_fragment”. Is this correct?


// Handle the parameters here
        main.addURIHandler(new URIHandler() {
            public DownloadStream handleURI(URL context, String relativeUri) {
                if (fragment != null) {
                    // Got the fragment earlier, provide some HTML content
                    // for the indexing engine
                    String content = getContent(fragment);
                    ByteArrayInputStream istream = new ByteArrayInputStream(
                        ("<html><body><p>" + content +
                        "</p></body></html>").getBytes());
                    return new DownloadStream(istream, "text/html", null);
                }
                return null;
            }
        });

Finally, the last part, handling of the “?escaped_fragment”. Now, it seems that anything thats after “?escaped_fragment” is stored in the fragment variable, such that if our address is “?escaped_fragment=earth” our fragment is “earth”. So here, we manually force an HTML page to be created with content that is just textual representation of whatever the fragment was.

================================================

Questions:

How does the google bot know what’s located on our website? Does it follow the links on its own? In the book example, we have our main page with a selectable list. If we provide the bot with the main link i.e. “http://magi.virtuallypreinstalled.com/book-examples/indexing” how will it see that"http://magi.virtuallypreinstalled.com/book-examples/indexing#!mars" or “http://magi.virtuallypreinstalled.com/book-examples/indexing#!earth” are even possible links so it can send the “?escaped_fragment” request? Are we suppose to specify all the possible links within the HTML snapshot of our page?
What happens with the html snapshot pages we create? Should we store them? In the book example, it seems like we are just overwriting the same istream whenever the parameter request is made.
Do we need to do anything on our server. In the google FAQ it says that we should map “?escaped_fragment” back to “#!” (summary, part c) but I’m not sure what they mean by this.

Thanks, and sorry for such a convoluted post. Allowing an application to be indexed is a big stepping stone for me.

Marko1 · April 8, 2011, 8:24am

Hi Bart!

Google bot follows links, so if it sees one on this forum, for example, it will try to index it. So, if your application has a “main view”, the non-Ajax page should include all links to all the subviews, or at least they should be accessible recursively through such links. That’s how I think it should work anyhow.

The example application doesn’t do that.

In the example, the indexing pages are generated dynamically from the data model. I don’t see any reasonable purpose for storing them, unless the load caused by the indexing gets really heavy, which sounds unlikely, unless the pages are really heavy to generate.

I think it is just an unclear way to say that your application should return the same content for the escaped URL as it would for the #! URL (but as HTML).

Anyhow, the example seems to work. You may try googling for
“little content for mars”
(use the quotes). It has probably found the link from this forum.

If you’re a Pro Account subscriber, please see article
#252: How can I make my application indexed by search engines?
for some more details.

Bart1 · April 10, 2011, 4:46am

Thanks for the reply!

What exactly constitutes a link in vaadin. For example, let’s say I have a page on which I place a table with a lot of items, and whenever the user selects an item I want to open a page with a particular URI. Obviously, I am going to create the URI fragments dynamically, so in other words the link doesn’t exist until an item is selected. How would that work with the HTML snapshots. Should I create a snapshot and place all the possible links within it? (like run a loop for all the items in the table, generate the URI fragments for each one, and include the links for all of them in the html snapshot of the main page?)

Back to the previous question, if we have to generate all the links for a given page for each HTML snapshot, would it not be more efficient to let’s say store the most recent snapshot and reuse that, until the next item is added?

What do you mean by content? Our whole application is in javascript so the given content often depends on user selection. So for example, if we have a page with just a selectable list on it, there might be no content until the user selects something, aside from the items in the list.

You are right, googling for “little content for mars” does work, however if you try to search for “little content for earth” for example, nothing shows up, even though it is one of the selectable items in the list for which we create the “?escaped_fragment” page. Is there a specific reason for that, or is the google crawler bot at fault for this?