What is Google Cooking?

Googlebot is up to something, but what could it be? This could be big.

Watching my logs, I’ve been getting random requests from Googlebot for atom.xml and index.rdf files on this site and others. It’s always in the root or in relevant subdirectories (usually /blog or similar). All of these sites run WordPress, and I can promise there is no mention of or links to atom.xml or index.rdf anywhere. This means Googlebot is guessing that these files will be there. Now I’ve come to expect random flailing for syndication files from Feedster and Kinja, but Google? Et tu, Googlebot?

I suspect this is a hint of something new coming, perhaps feed-aware search like Feedster or RSS links in search results like Yahoo. Maybe a Google-aggregator? Google BlogNews? I want answers! They’ve got some room on above the search box since their redesign, maybe the next item there will be a “blog” tab. (Of course since their redesign they aren’t real tabs anymore, a regression in my opinion. I think tabs are a very effective navigation metaphor and worked well for Google.)

Anyone have any clues, ideas, or notice something similar in their logs?

Update: As always, we’re a few days ahead of the curve here at photomatt.net. Dave Winer has noticed the hits on his server and is covering the issue today.

Update: I’m late to the game, but Evan Williams confirms in part what I was suspecting and also jabs at the conspiracy theorists.

Is it more likely that this is not a calculated move, but that they are experimenting with crawling feeds in general and that, if they’re going to index them, they probably want as many as possible? And that maybe (hmmm…) they started with Blogger blogs first, since they were handy, and they tended to find feeds at index.rdf and atom.xml, and they haven’t yet optimized their crawler because they’ve been working on other stuff?

39 replies on “What is Google Cooking?”

The issue isn’t that they’re indexing non-HTML files, which they’ve done for a long time, it’s that they’re requesting files that don’t exist and aren’t linked anywhere, files which happen to be named according to the popular conventions of many blog packages. Since your atom.xml file is linked, I would expect they index it.

i noticed the exact same thing. they have been looking for mine for a few months now. i just installed new software with non-standard feed links and the bots still managed to find the feeds – even with no links to them.

I’ve been getting a couple of hits for index.rdf every other day (roughly) from GoogleBot. (GoogleBrutus?) My first request for atom.xml came yesterday. [20/Apr/2004:18:59:49 -0700]

FWIW, I don’t have an index.rdf file either, but I have a liberal RedirectMatch rule that converts all manner of syndication-looking toplevel URI requests into requests for my actual feed, so Google has apparently been getting live feed data from me for quite some time. But, as you point out, the important thing is the URIs it requests are totally invented!

Another sighting here — I’ve don’t have, nor have I ever had, any reference to or instance of index.rdf or index.xml. My visitor logs have shown repeated attempts of googlebot requesting those files in the past weeks.

Do any autodiscovery functions look for these files? You could be seeing Google crawling links that are coming from log files that have logged autodiscovery stats. Kind of the way my public stats keep resulting in google crawling for pages that show up in my “broken links” log… just a thought.

Tim, there are several links to feeds in my <head>, to four different syndication formats. None of them are named what Googlebot is asking for.

I’ve been getting requests for these and a couple of other files quite regulally in my logs.

Remember all that fuss a few years back when internet explorer started clogging up error logs when it started requesting a file called favicon.ico?

People were out in the streets buring Microsoft flags – people ACTIVILY tried to harm microsoft by bouncing requests for the file back to microsoft servers.

How come google (and others) are getting away with it? Why arent we burning google flags?

I get googled almost every day (have no idea why), and they have not searched for atom, or rdf, but they do always spider my RSS file.

here’s my log:
Host: xxx.xxx.xxx.xxx
Url: /rss.xml Http Code : 200
Date: Apr 22 06:54:22
Http Version: HTTP/1.0″ Size in Bytes: 12073
Referer: – Agent: Googlebot/2.1 (+http://www.googlebot.com/bot.html)

Not sure what, if anything this shows, but there you go.

As a member of the homebrew-weblogging-software club, I’ve not had the files googlebot, kinjabot, or feedster have been looking for either. (Well, not by those names.) The big question is, now that I’ve added a 301 Moved Permanently response to point them to the real files, will they do the right thing and stop requesting them?

I get crawled by google often enough, but I’m not receiving any such requests. Commentators should note that Evan Williams is also just guessing (though he has more chance of being right than most 🙂

Google sucht Blogs
Die Gerüchteküche brodelt wieder am anderen Ende des Teichs.
Einige Blogger entdeckten in ihren Statistiken, das Googlebot nach atom.xml und index.rdf Datein sucht. Unter anderem ist ein mögliches Szenario das Google in die Blogsuche a la Feedster/Day…

I’m running geeklog on my site. I noticed google crawling a new staticpage i made within 5 minutes after creation. Even messages like: you have been logged out, logged in, message sent, etc. were being crawled when i looked at my last visitors log.

I’m sure i didn’t make any links to the mentioned staticpage. The only link there is can be found on the stats page, wich google didnt crawl before crawling the staticpage.

After this i tried switching off the AdSense block, after this i didn’t get crawled after creating another staticpage.

Looks like google crawls all referrers it gets on adsense it doesn’t allready know.

Hope this helps, Falconey

Maybe this use of rss feeds allows Google’s bots to crawl irregularly updated pages more efficiently.

When crawling the entire web this sort of prompted crawling could mean big savings in computer time and server space.

Google, indexando fuentes de sindicación
Google está buscando fuentes de sindicación en las bitácoras. La noticia la da Dirson en ¿Trabaja Google en un buscador de blogs?. «Desde el día 13 de abril, en los logs de acceso al servidor web de ‘google.dirson.com’ se están

Google crawling for RSS feeds
Matt from Photo Matt has noted googlebot crawling for RSS and Atom feeds on his site, using feed file names commonly found on Blogger

This is someth

google bought blogger, is crawling blogs, and is launching gmail (wich searches each email you send for keywords) for the same reason. They want to know what people are talking about. By doing this, they know what subjects are currently of interest to people all over the world. By knowing what keywords are of interest to people, they should be able to predict the tide of public interest faster than any one else. That should give them insight into the psyche of a lot of consumers. That has to be worth something.

I’m seeing Googlebot requests for atom.xml and index.rdf, too, despite the fact that I’ve don’t use those filenames. Googlebot *is* following the 301 redirects to the real files. Googlebot is *not* requesting those files from my non-blogging domain names, so it must be pre-filtering for blogs before guessing filenames (my blog CMS is homebrew, FWIW). Incidentally, I’m seeing the exact same behavior from AskJeeves, but nobody else has mentioned that. I’m guessing nobody cares about AskJeeves?

Incidentally, I’m also seeing AskJeeves’s bot flailing

I use a free blog at BlogEasy for keeping track of interesting articles and such. They have a great interface for checking out your site statistics and what search bots are hitting your site. I’ve noticed that the GoogleBot has been hitting the site a lot. More than usual and it can’t just be the front page. It’s probably looking for atom.xml and index.rdf as well. I wonder if it can detect that your site is a blog site maybe based on a keyword or domain name. Once it knows you are a blog site it’ll start looking for blog-related syndication feeds.

Help GoogleBot Index Your RSS Feed
As noted by PhotoMatt Googlebot has been randomly requesting atom.xml, index.rdf and rss.xml from many webservers in the last few months.

Not only does this suggest that Google is actively indexing RSS news feeds because it values them as a good sou…

Semantic Web

Paul Ford uses a fictional story from five years in the future to explain the semantic web: August 2009: How Google Beat Amazon and Ebay to the Semantic Web.

For more info about the Semantic Web concept:

W3C Semantic Web
SemanticWeb.Org

Falconey, Google spiders your new pages very quickly when you have an Adsense block on them because that’s how Adsense knows what adverts to put on the page. If it didn’t spider them you’d either get a blank block or the charity adverts (depending on the setup of your Adsense code.)

However, I have noticed that a few months ago they started using the words in the URL to supply information for the Adsense adverts if they can’t spider the actual page, because I started getting relatively related adverts on my test server, which is on localhost and not spiderable.

Comments are closed.