Apr
20
39

What is Google Cooking?

Uncategorized

Watching my logs, I’ve been getting random requests from Googlebot for atom.xml and index.rdf files on this site and others. It’s always in the root or in relevant subdirectories (usually /blog or similar). All of these sites run WordPress, and I can promise there is no mention of or links to atom.xml or index.rdf anywhere. This means Googlebot is guessing that these files will be there. Now I’ve come to expect random flailing for syndication files from Feedster and Kinja, but Google? Et tu, Googlebot?

I suspect this is a hint of something new coming, perhaps feed-aware search like Feedster or RSS links in search results like Yahoo. Maybe a Google-aggregator? Google BlogNews? I want answers! They’ve got some room on above the search box since their redesign, maybe the next item there will be a “blog” tab. (Of course since their redesign they aren’t real tabs anymore, a regression in my opinion. I think tabs are a very effective navigation metaphor and worked well for Google.)

Anyone have any clues, ideas, or notice something similar in their logs?

Update: As always, we’re a few days ahead of the curve here at photomatt.net. Dave Winer has noticed the hits on his server and is covering the issue today.

Update: I’m late to the game, but Evan Williams confirms in part what I was suspecting and also jabs at the conspiracy theorists.

Is it more likely that this is not a calculated move, but that they are experimenting with crawling feeds in general and that, if they’re going to index them, they probably want as many as possible? And that maybe (hmmm…) they started with Blogger blogs first, since they were handy, and they tended to find feeds at index.rdf and atom.xml, and they haven’t yet optimized their crawler because they’ve been working on other stuff?


39 Comments

  • astopy » Google Requesting Non-existent Feeds July 12, 2004 @ 10:48 pm

    [...] [...]

  • Matt April 20, 2004 @ 11:08 pm

    Hint: Try this on your log files: egrep 'atom.xml.*Googlebot' log-file-name.log.

  • Clint Ecker April 20, 2004 @ 11:41 pm

    P.S. They’ve been indexing my atom.xml since July 18, 2003. About 9 months now.

  • Matt April 20, 2004 @ 11:43 pm

    The issue isn’t that they’re indexing non-HTML files, which they’ve done for a long time, it’s that they’re requesting files that don’t exist and aren’t linked anywhere, files which happen to be named according to the popular conventions of many blog packages. Since your atom.xml file is linked, I would expect they index it.

  • Clint Ecker April 20, 2004 @ 11:52 pm

    Ah, I missed that part of your post ;) Ignore the previous replies :D

  • François Briatte April 21, 2004 @ 2:17 am

    Correct French words would be : “et toi, GoogleBot ?”.
    :)

  • Matt April 21, 2004 @ 2:23 am

    It’s actually an allusion to the latin phrase uttered by Julius Caesar. More info here.

  • cindy April 21, 2004 @ 6:50 am

    i noticed the exact same thing. they have been looking for mine for a few months now. i just installed new software with non-standard feed links and the bots still managed to find the feeds – even with no links to them.

  • dsandler April 21, 2004 @ 8:41 am

    I’ve been getting a couple of hits for index.rdf every other day (roughly) from GoogleBot. (GoogleBrutus?) My first request for atom.xml came yesterday. [20/Apr/2004:18:59:49 -0700]

    FWIW, I don’t have an index.rdf file either, but I have a liberal RedirectMatch rule that converts all manner of syndication-looking toplevel URI requests into requests for my actual feed, so Google has apparently been getting live feed data from me for quite some time. But, as you point out, the important thing is the URIs it requests are totally invented!

  • andrew April 21, 2004 @ 9:49 am

    Another sighting here — I’ve don’t have, nor have I ever had, any reference to or instance of index.rdf or index.xml. My visitor logs have shown repeated attempts of googlebot requesting those files in the past weeks.

  • brian April 21, 2004 @ 5:58 pm

    Google owns Blogger. Which now only supports Atom feeds. Maybe there is a connection there?

  • Chris L April 21, 2004 @ 8:43 pm

    Do any autodiscovery functions look for these files? You could be seeing Google crawling links that are coming from log files that have logged autodiscovery stats. Kind of the way my public stats keep resulting in google crawling for pages that show up in my “broken links” log… just a thought.

  • Tim April 21, 2004 @ 9:08 pm

    I believe you can aid the googlebot in auto discovery by putting a link to the feed file in your head.

    <link rel="alternate” type="application/rss+xml” title="RSS feed” href="index.xml” />

    Mark Pilgrim has more on this: http://diveintomark.org/archives/2002/06/02/important_change_to_the_link_tag

  • Matt April 21, 2004 @ 9:15 pm

    Tim, there are several links to feeds in my <head>, to four different syndication formats. None of them are named what Googlebot is asking for.

  • Richard@Home April 22, 2004 @ 2:58 am

    I’ve been getting requests for these and a couple of other files quite regulally in my logs.

    Remember all that fuss a few years back when internet explorer started clogging up error logs when it started requesting a file called favicon.ico?

    People were out in the streets buring Microsoft flags – people ACTIVILY tried to harm microsoft by bouncing requests for the file back to microsoft servers.

    How come google (and others) are getting away with it? Why arent we burning google flags?

  • Pat Rock April 22, 2004 @ 12:16 pm

    I get googled almost every day (have no idea why), and they have not searched for atom, or rdf, but they do always spider my RSS file.

    here’s my log:
    Host: xxx.xxx.xxx.xxx
    Url: /rss.xml Http Code : 200
    Date: Apr 22 06:54:22
    Http Version: HTTP/1.0″ Size in Bytes: 12073
    Referer: – Agent: Googlebot/2.1 (+http://www.googlebot.com/bot.html)

    Not sure what, if anything this shows, but there you go.

  • Pete Prodoehl April 22, 2004 @ 12:56 pm

    As a member of the homebrew-weblogging-software club, I’ve not had the files googlebot, kinjabot, or feedster have been looking for either. (Well, not by those names.) The big question is, now that I’ve added a 301 Moved Permanently response to point them to the real files, will they do the right thing and stop requesting them?

  • Chris L April 23, 2004 @ 12:24 am

    I get crawled by google often enough, but I’m not receiving any such requests. Commentators should note that Evan Williams is also just guessing (though he has more chance of being right than most :)

  • Matt April 23, 2004 @ 8:05 am

    Googlebot must be toying with me, because I got half a dozen more requests like this today.

  • Falconey April 23, 2004 @ 11:36 am

    I’m running geeklog on my site. I noticed google crawling a new staticpage i made within 5 minutes after creation. Even messages like: you have been logged out, logged in, message sent, etc. were being crawled when i looked at my last visitors log.

    I’m sure i didn’t make any links to the mentioned staticpage. The only link there is can be found on the stats page, wich google didnt crawl before crawling the staticpage.

    After this i tried switching off the AdSense block, after this i didn’t get crawled after creating another staticpage.

    Looks like google crawls all referrers it gets on adsense it doesn’t allready know.

    Hope this helps, Falconey

  • Matt Prescott April 24, 2004 @ 12:37 pm

    Maybe this use of rss feeds allows Google’s bots to crawl irregularly updated pages more efficiently.

    When crawling the entire web this sort of prompted crawling could mean big savings in computer time and server space.

  • Pingback: Google Dirson

  • Pingback: Google Blog

  • Pingback: Linotipo

  • Pingback: iChris.ws

  • william May 1, 2004 @ 5:59 pm

    Just Checking

  • Francois Briatte May 3, 2004 @ 9:53 am

    Nice, I learnt something about Julius Caesar while reading about GoogleBot. Thanks Matt.

  • Gary Nelson May 5, 2004 @ 10:05 pm

    google bought blogger, is crawling blogs, and is launching gmail (wich searches each email you send for keywords) for the same reason. They want to know what people are talking about. By doing this, they know what subjects are currently of interest to people all over the world. By knowing what keywords are of interest to people, they should be able to predict the tide of public interest faster than any one else. That should give them insight into the psyche of a lot of consumers. That has to be worth something.

  • Michael Bauser June 16, 2004 @ 12:13 am

    I’m seeing Googlebot requests for atom.xml and index.rdf, too, despite the fact that I’ve don’t use those filenames. Googlebot *is* following the 301 redirects to the real files. Googlebot is *not* requesting those files from my non-blogging domain names, so it must be pre-filtering for blogs before guessing filenames (my blog CMS is homebrew, FWIW). Incidentally, I’m seeing the exact same behavior from AskJeeves, but nobody else has mentioned that. I’m guessing nobody cares about AskJeeves?

    Incidentally, I’m also seeing AskJeeves’s bot flailing

  • Blog Writer July 10, 2004 @ 12:21 am

    I use a free blog at BlogEasy for keeping track of interesting articles and such. They have a great interface for checking out your site statistics and what search bots are hitting your site. I’ve noticed that the GoogleBot has been hitting the site a lot. More than usual and it can’t just be the front page. It’s probably looking for atom.xml and index.rdf as well. I wonder if it can detect that your site is a blog site maybe based on a keyword or domain name. Once it knows you are a blog site it’ll start looking for blog-related syndication feeds.

  • Pingback: Rambling Thoughts

  • Pingback: geek ramblings

  • Alex Raiano August 9, 2004 @ 7:31 am

    I’ve noticed this activity on my site as well. What is Google up to?

  • Alex Raiano August 30, 2004 @ 1:43 pm

    I ended up using ModRewrite to direct google to the correct files. Now they should be able to find them :) .

    Alex
    http://www.tivoblog.com

  • graywolf October 6, 2004 @ 2:52 pm

    I was sad to learn that Evan willaims is leving google/blogger, I hope the IPO was good to him.

  • Paul Silver June 17, 2005 @ 4:34 am

    Falconey, Google spiders your new pages very quickly when you have an Adsense block on them because that’s how Adsense knows what adverts to put on the page. If it didn’t spider them you’d either get a blank block or the charity adverts (depending on the setup of your Adsense code.)

    However, I have noticed that a few months ago they started using the words in the URL to supply information for the Adsense adverts if they can’t spider the actual page, because I started getting relatively related adverts on my test server, which is on localhost and not spiderable.

Share Your Thoughts