Saturday, April 19, 2008

Google Starts to Index the Invisible Web


Google Webmaster Central Blog has recently announced that Google started to index web pages hidden behind web forms. "In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google. Specifically, when we encounter a

element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page." For now, only a small number of websites will be affected by this change and Google will only fill forms that use GET to submit data and don't require personal information.

Many web pages are difficult to find because they're not indexed by search engines and they're only available if you know where to search and what to use as a query. All these web pages create the Invisible Web, which was estimated to include 550 billion documents in 2001. "Traditional search engines create their indices by spidering or crawling surface Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines can not see or retrieve content in the deep Web -- those pages do not exist until they are created dynamically as the result of a specific search."

Anand Rajaraman found that the new feature is related to a low-profile Google acquisition from 2005.

Between 1995 and 2005, Web search had become the dominant mechanism for finding information. Search engines, however, had a blind spot: the data behind HTML forms. (...) The key problem in indexing the Invisible Web are:

1. Determining which web forms are worth penetrating.
2. If we decide to crawl behind a form, how do we fill in values in the form to get at the data behind it? In the case of fields with checkboxes, radiobuttons, and drop-down menus, the solution is fairly straightforward. In the case of free-text inputs, the problem is quite challenging - we need to understand the semantics of the input box to guess possible valid inputs.

Transformic's technology addressed both problems (1) and (2). It was always clear to us that Google would be a great home for Transformic, and in 2005 Google acquired Transformic. (...) The Transformic team have been been working hard for the past two years perfecting the technology and integrating it into the Google crawler.

It's not clear what are the high-quality sites used by Google for the new feature, but this list includes some good options. Along with Google Book Search, Google Scholar, Google News Archive, this is yet another way to bring to light valuable information.