Rich web apps and search engines like a Google - indexing problems and tri

Vaadin app (and any other gwt-based app) is visible to a search engine like Google as a single html file with a title. All content is loaded and managed dynamically via JS and thus your public webapp data (if you expose, say, a catalog, build with Vaadin) will not be indexed but a page title only.

What tricks do you use (if use) to make indexing better ?

As a static method, it is possible to tweak and add some keywords / meta information into the UI starting html page (generated by a servlet).

A one solution we tried in our company is to create an extra JSP/Servlet page, say, catalog.jsp and make a reference to it from a main website and Vaadin starter page. This JSP simply lists all items from our catalog with a maximum of information, in a table manner and like a plain old html page. From each item there is a link to a Vaadin app like “http://app.com/vaadinapp/#id”. This seems to be indexed well and working - catalog items are searchable from the Google and search result links directs user to a Vaadin app.

Im wondering if there are other tricks and methods to gather similar results ? Would be interesting to discuss.

Best,
Dmitri

While researching more on the topic I found that Apache Wicket might be a good companion product to Vaadin for the cases you need to make your application indexable. Wicket also maintains the server-side component model like Vaadin, which makes development process the same, but it generates a normal html for every page (yes, page gets reloaded on each interaction such as button or link click, etc, unless you separately apply a ajax communication yourself) but this is what the search engines expect - the wicket app html page source is a normal html with all content data inside.

So I think the idea on composite webapp creation, where public and indexable part (forums, wiki, web2.0 content) could be done as Wicket sub-app and all other parts (control panels, private area / user cabinates, etc) - surely with Vaadin, might be a good solution, what do you think ?

Still trying to find other ways to make Vaadin (gwt) application to be more friendly to search engines, however.

It is trickier than that. If you page is made up of several components which can change independently, and if you have many ways to get to the same page, etc. You can end up with a total mess. You have no control over what will be retrieved.

We ended up creating a crawler, and giving it specific rules as to what to crawl and where not to go. The crawler created a shadow site, and the shadow site was being indexed.

Shadow site is ok, but how did you solve the hyperlink navigation problem ?
I mean, that in case of shadow site, your crawler builds up a plain html collection of web pages with static data, hyperlinked to each other. This gets indexed and then becomes searchable in Google, for instance. Then, when user clicks the search results element in Google search, it is navigated to a shadow site, not to the real on.

Exactly. The crawler would also know which parts of the site were inherently dynamic, and the built HTML page would keep references to the dynamic component.

But this means, when I (as a user) find something in, say, Google, click the result link, I’ll be navigated to a shadow site static web page. How do you redirect user to the same content but to an actual website (rich app), requiring to click separate link like “tell me more…” ?

We’re currently migrating our child-company standards online catalog to a Vaadin rich app. This catalog contains more than 100K records. As for now, we have a pageable plain html catalog listing, where each row contains title and most important attributes for indexing. Link from that row are directing to actual Vaadin application, into the particular standard description page. This works fine, but not so elegant as we’d like - sometimes, you may find the standard in Google but the link from search results will bring user to the shadow listing particular page, where the searched standard is listed along with the others 10-15-20 “items per page”, so user has to find it again and click.

What we also experimenting with right now, is to embed shadow site into the actual rich application. There, a servlet filter, for instance, will detect search engine user agent and bypass Vaadin application, switching to the plain old html copy of the same application, which will contain minimal (if no) design but retain the same layout (tree navigational menus, etc) and similar URI’s and allow crawler to walk around. When the real user visits the same URL, it is switched to Vaadin application and because the URI’s are identical, the proper section or item description page appears.

You got it – most of the links in the crawled pages point to the actual application. The crawled site is there for the indexing, and so we can get control over which links are used to get into the actual application.

Hi,

Just quickly dumping some links for the record - you guys have seen these, right?


http://googlewebtoolkit.blogspot.com/2009/10/making-ajax-crawlable.html


http://googlewebmastercentral.blogspot.com/2009/10/proposal-for-making-ajax-crawlable.html

Here’s to hoping we’ll get a standard someday…

Best Regards,
Marc

Yep, things should change soon, but Im afraid not tomorrow, so looks like nothing new but only shadow sites can help us at the moment :slight_smile:

You should always go with creating HTML pages for search engines whenever you’re developing
rich internet applications
.

HTML pages help you get indexed your website pages in Google and other search engines while Rich pages will help to improve your website user experience.

Hope this will help you.

If you’d like to know more about this then I recommend you to read this blog post :
How To Optimize Rich Internet Applications For Search Engines

Cheers!!!

This book chapter is also interesting:

http://www.infoq.com/resource/articles/progwt/en/resources/progwt.pdf

Thanks for the link ! Seems to be an interesting reading

This is possibly the biggest downside of the Vaadin framework.
Take any website (aside from something running in the firm’s Intranet) and you will quickly realize that search engines such as Google are delivering most of the visitors. If Google can’t crawl the website and all it sees is your title, than even the richest website full of context will appear as nothing to crawler and we all know that nothing generates nothing.

There should be a way to export textual content of the page into a readable dump either in a hidden field or something so Google can see what’s on the page.