What is Google Cooking?

April 20, 2004UncategorizedGeneralMatt

Watching my logs, I’ve been getting random requests from Googlebot for atom.xml and index.rdf files on this site and others. It’s always in the root or in relevant subdirectories (usually /blog or similar). All of these sites run WordPress, and I can promise there is no mention of or links to atom.xml or index.rdf anywhere. This means Googlebot is guessing that these files will be there. Now I’ve come to expect random flailing for syndication files from Feedster and Kinja, but Google? Et tu, Googlebot?

I suspect this is a hint of something new coming, perhaps feed-aware search like Feedster or RSS links in search results like Yahoo. Maybe a Google-aggregator? Google BlogNews? I want answers! They’ve got some room on above the search box since their redesign, maybe the next item there will be a “blog” tab. (Of course since their redesign they aren’t real tabs anymore, a regression in my opinion. I think tabs are a very effective navigation metaphor and worked well for Google.)

Anyone have any clues, ideas, or notice something similar in their logs?

Update: As always, we’re a few days ahead of the curve here at photomatt.net. Dave Winer has noticed the hits on his server and is covering the issue today.

Update: I’m late to the game, but Evan Williams confirms in part what I was suspecting and also jabs at the conspiracy theorists.

Is it more likely that this is not a calculated move, but that they are experimenting with crawling feeds in general and that, if they’re going to index them, they probably want as many as possible? And that maybe (hmmm…) they started with Blogger blogs first, since they were handy, and they tended to find feeds at index.rdf and atom.xml, and they haven’t yet optimized their crawler because they’ve been working on other stuff?

39 thoughts on “What is Google Cooking?”

Pingback: Idiotprogrammer » Google and Blog Feeds: Musings on Culture & Technology by Texas Writer Robert Nagle
astopy » Google Requesting Non-existent Feeds says:

July 12, 2004 at 10:48 pm

[…] […]

Reply
Matt says:

April 20, 2004 at 11:08 pm

Hint: Try this on your log files: egrep 'atom.xml.*Googlebot' log-file-name.log.

Reply
Clint Ecker says:

April 20, 2004 at 11:39 pm

They aren’t cooking much yet, but they’ve been indexing OPML, RDF, RSS, and XML files for quite a while now. They show up in normal searches:

http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=related:www.phaedo.cx/includes/opml/clints-feeds.opml
http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=site%3Aphaedo.cx+rdf&btnG=Search
http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=site%3Aphaedo.cx+rss&btnG=Search
http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=related:www.phaedo.cx/atom.xml

Reply
Clint Ecker says:

April 20, 2004 at 11:41 pm

P.S. They’ve been indexing my atom.xml since July 18, 2003. About 9 months now.

Reply
Matt says:

April 20, 2004 at 11:43 pm

The issue isn’t that they’re indexing non-HTML files, which they’ve done for a long time, it’s that they’re requesting files that don’t exist and aren’t linked anywhere, files which happen to be named according to the popular conventions of many blog packages. Since your atom.xml file is linked, I would expect they index it.

Reply
Clint Ecker says:

April 20, 2004 at 11:52 pm

Ah, I missed that part of your post 😉 Ignore the previous replies 😀

Reply
François Briatte says:

April 21, 2004 at 2:17 am

Correct French words would be : “et toi, GoogleBot ?”.
🙂

Reply
Matt says:

April 21, 2004 at 2:23 am

It’s actually an allusion to the latin phrase uttered by Julius Caesar. More info here.

Reply
cindy says:

April 21, 2004 at 6:50 am

i noticed the exact same thing. they have been looking for mine for a few months now. i just installed new software with non-standard feed links and the bots still managed to find the feeds – even with no links to them.

Reply
dsandler says:

April 21, 2004 at 8:41 am

I’ve been getting a couple of hits for index.rdf every other day (roughly) from GoogleBot. (GoogleBrutus?) My first request for atom.xml came yesterday. [20/Apr/2004:18:59:49 -0700]

FWIW, I don’t have an index.rdf file either, but I have a liberal RedirectMatch rule that converts all manner of syndication-looking toplevel URI requests into requests for my actual feed, so Google has apparently been getting live feed data from me for quite some time. But, as you point out, the important thing is the URIs it requests are totally invented!

Reply
andrew says:

April 21, 2004 at 9:49 am

Another sighting here — I’ve don’t have, nor have I ever had, any reference to or instance of index.rdf or index.xml. My visitor logs have shown repeated attempts of googlebot requesting those files in the past weeks.

Reply
brian says:

April 21, 2004 at 5:58 pm

Google owns Blogger. Which now only supports Atom feeds. Maybe there is a connection there?

Reply
Chris L says:

April 21, 2004 at 8:43 pm

Do any autodiscovery functions look for these files? You could be seeing Google crawling links that are coming from log files that have logged autodiscovery stats. Kind of the way my public stats keep resulting in google crawling for pages that show up in my “broken links” log… just a thought.

Reply
Tim says:

April 21, 2004 at 9:08 pm

I believe you can aid the googlebot in auto discovery by putting a link to the feed file in your head.

<link rel="alternate” type="application/rss+xml” title="RSS feed” href="index.xml” />

Mark Pilgrim has more on this: http://diveintomark.org/archives/2002/06/02/important_change_to_the_link_tag

Reply
Matt says:

April 21, 2004 at 9:15 pm

Tim, there are several links to feeds in my <head>, to four different syndication formats. None of them are named what Googlebot is asking for.

Reply
Richard@Home says:

April 22, 2004 at 2:58 am

I’ve been getting requests for these and a couple of other files quite regulally in my logs.

Remember all that fuss a few years back when internet explorer started clogging up error logs when it started requesting a file called favicon.ico?

People were out in the streets buring Microsoft flags – people ACTIVILY tried to harm microsoft by bouncing requests for the file back to microsoft servers.

How come google (and others) are getting away with it? Why arent we burning google flags?

Reply
Pat Rock says:

April 22, 2004 at 12:16 pm

I get googled almost every day (have no idea why), and they have not searched for atom, or rdf, but they do always spider my RSS file.

here’s my log:
Host: xxx.xxx.xxx.xxx
Url: /rss.xml Http Code : 200
Date: Apr 22 06:54:22
Http Version: HTTP/1.0″ Size in Bytes: 12073
Referer: – Agent: Googlebot/2.1 (+http://www.googlebot.com/bot.html)

Not sure what, if anything this shows, but there you go.

Reply
Pete Prodoehl says:

April 22, 2004 at 12:56 pm

As a member of the homebrew-weblogging-software club, I’ve not had the files googlebot, kinjabot, or feedster have been looking for either. (Well, not by those names.) The big question is, now that I’ve added a 301 Moved Permanently response to point them to the real files, will they do the right thing and stop requesting them?

Reply
Chris L says:

April 23, 2004 at 12:24 am

I get crawled by google often enough, but I’m not receiving any such requests. Commentators should note that Evan Williams is also just guessing (though he has more chance of being right than most 🙂

Reply
Pingback: Suchmaschinen Experte Blog
Matt says:

April 23, 2004 at 8:05 am

Googlebot must be toying with me, because I got half a dozen more requests like this today.

Reply
Falconey says:

April 23, 2004 at 11:36 am

I’m running geeklog on my site. I noticed google crawling a new staticpage i made within 5 minutes after creation. Even messages like: you have been logged out, logged in, message sent, etc. were being crawled when i looked at my last visitors log.

I’m sure i didn’t make any links to the mentioned staticpage. The only link there is can be found on the stats page, wich google didnt crawl before crawling the staticpage.

After this i tried switching off the AdSense block, after this i didn’t get crawled after creating another staticpage.

Looks like google crawls all referrers it gets on adsense it doesn’t allready know.

Hope this helps, Falconey

Reply
Matt Prescott says:

April 24, 2004 at 12:37 pm

Maybe this use of rss feeds allows Google’s bots to crawl irregularly updated pages more efficiently.

When crawling the entire web this sort of prompted crawling could mean big savings in computer time and server space.

Reply
Pingback: Google Dirson
Pingback: Google Blog
Pingback: Linotipo
Pingback: iChris.ws
william says:

May 1, 2004 at 5:59 pm

Just Checking

Reply
Francois Briatte says:

May 3, 2004 at 9:53 am

Nice, I learnt something about Julius Caesar while reading about GoogleBot. Thanks Matt.

Reply
Gary Nelson says:

May 5, 2004 at 10:05 pm

google bought blogger, is crawling blogs, and is launching gmail (wich searches each email you send for keywords) for the same reason. They want to know what people are talking about. By doing this, they know what subjects are currently of interest to people all over the world. By knowing what keywords are of interest to people, they should be able to predict the tide of public interest faster than any one else. That should give them insight into the psyche of a lot of consumers. That has to be worth something.

Reply
Michael Bauser says:

June 16, 2004 at 12:13 am

I’m seeing Googlebot requests for atom.xml and index.rdf, too, despite the fact that I’ve don’t use those filenames. Googlebot *is* following the 301 redirects to the real files. Googlebot is *not* requesting those files from my non-blogging domain names, so it must be pre-filtering for blogs before guessing filenames (my blog CMS is homebrew, FWIW). Incidentally, I’m seeing the exact same behavior from AskJeeves, but nobody else has mentioned that. I’m guessing nobody cares about AskJeeves?

Incidentally, I’m also seeing AskJeeves’s bot flailing

Reply
Blog Writer says:

July 10, 2004 at 12:21 am

I use a free blog at BlogEasy for keeping track of interesting articles and such. They have a great interface for checking out your site statistics and what search bots are hitting your site. I’ve noticed that the GoogleBot has been hitting the site a lot. More than usual and it can’t just be the front page. It’s probably looking for atom.xml and index.rdf as well. I wonder if it can detect that your site is a blog site maybe based on a keyword or domain name. Once it knows you are a blog site it’ll start looking for blog-related syndication feeds.

Reply
Pingback: Rambling Thoughts
Pingback: geek ramblings
Alex Raiano says:

August 9, 2004 at 7:31 am

I’ve noticed this activity on my site as well. What is Google up to?

Reply
Alex Raiano says:

August 30, 2004 at 1:43 pm

I ended up using ModRewrite to direct google to the correct files. Now they should be able to find them :).

Alex
http://www.tivoblog.com

Reply
graywolf says:

October 6, 2004 at 2:52 pm

I was sad to learn that Evan willaims is leving google/blogger, I hope the IPO was good to him.

Reply
Paul Silver says:

June 17, 2005 at 4:34 am

Falconey, Google spiders your new pages very quickly when you have an Adsense block on them because that’s how Adsense knows what adverts to put on the page. If it didn’t spider them you’d either get a blank block or the charity adverts (depending on the setup of your Adsense code.)

However, I have noticed that a few months ago they started using the words in the URL to supply information for the Adsense adverts if they can’t spider the actual page, because I started getting relatively related adverts on my test server, which is on localhost and not spiderable.

Reply

Matt Mullenweg

Unlucky in Cards

What is Google Cooking?

Related

39 thoughts on “What is Google Cooking?”

SHARE YOUR THOUGHTSCancel reply

Share this:

Related

39 thoughts on “What is Google Cooking?”

SHARE YOUR THOUGHTSCancel reply