Posted by: Marshall Sponder | April 11, 2008

Google crawler starts filling out forms

Kinda interesting that Google’s crawlers are now  Crawling through HTML forms and filling out the forms instead of just stopping, as it has up till now – as detailed in Google Webmaster Central Blog and Search Engine Land in a post on Google Now Fills Out Forms & Crawls Results

In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn’t find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.

It’s alluded to what Google’s crawlers might enter in a form – but I wonder what it’s going to actually enter – and in most cases, without an exact account/password, the crawler won’t get far.

I wonder if it would make sense to create a dummy account on your site, tell Google about it, and let it log in and crawl the protected pages – it certainly seems as if it should be an option.

Search Engine Land points out that other search engines have been crawling protected content – though Google is the first large search engine to do so:

The move is potentially good for searchers, in that it will open up material often referred to being part of the “deep web” or “invisible web” as it was hidden behind forms. Search Engine Land executive editor Chris Sherman actually co-authored a book on the topic. He and fellow author Gary Price didn’t coin the term invisible web but they certainly help popularize it.

It should be noted that Google’s not the first to do something like this. Companies like Quigo, BrightPlanet and WhizBang Labs were doing this type of work years ago. But it never translated over to the major search engines. Now chapter two of surfacing deep web material is opening, this time with a major search player — in that, Google is being a pioneer.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: