Watching my logs, I’ve been getting random requests from Googlebot for atom.xml
and index.rdf
files on this site and others. It’s always in the root or in relevant subdirectories (usually /blog
or similar). All of these sites run WordPress, and I can promise there is no mention of or links to atom.xml
or index.rdf
anywhere. This means Googlebot is guessing that these files will be there. Now I’ve come to expect random flailing for syndication files from Feedster and Kinja, but Google? Et tu, Googlebot?
I suspect this is a hint of something new coming, perhaps feed-aware search like Feedster or RSS links in search results like Yahoo. Maybe a Google-aggregator? Google BlogNews? I want answers! They’ve got some room on above the search box since their redesign, maybe the next item there will be a “blog” tab. (Of course since their redesign they aren’t real tabs anymore, a regression in my opinion. I think tabs are a very effective navigation metaphor and worked well for Google.)
Anyone have any clues, ideas, or notice something similar in their logs?
Update: As always, we’re a few days ahead of the curve here at photomatt.net. Dave Winer has noticed the hits on his server and is covering the issue today.
Update: I’m late to the game, but Evan Williams confirms in part what I was suspecting and also jabs at the conspiracy theorists.
Is it more likely that this is not a calculated move, but that they are experimenting with crawling feeds in general and that, if they’re going to index them, they probably want as many as possible? And that maybe (hmmm…) they started with Blogger blogs first, since they were handy, and they tended to find feeds at index.rdf and atom.xml, and they haven’t yet optimized their crawler because they’ve been working on other stuff?
[…] […]
Hint: Try this on your log files:
egrep 'atom.xml.*Googlebot' log-file-name.log
.They aren’t cooking much yet, but they’ve been indexing OPML, RDF, RSS, and XML files for quite a while now. They show up in normal searches:
http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=related:www.phaedo.cx/includes/opml/clints-feeds.opml
http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=site%3Aphaedo.cx+rdf&btnG=Search
http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=site%3Aphaedo.cx+rss&btnG=Search
http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=related:www.phaedo.cx/atom.xml
P.S. They’ve been indexing my atom.xml since July 18, 2003. About 9 months now.
The issue isn’t that they’re indexing non-HTML files, which they’ve done for a long time, it’s that they’re requesting files that don’t exist and aren’t linked anywhere, files which happen to be named according to the popular conventions of many blog packages. Since your atom.xml file is linked, I would expect they index it.
Ah, I missed that part of your post đ Ignore the previous replies đ
Correct French words would be : “et toi, GoogleBot ?”.
đ
It’s actually an allusion to the latin phrase uttered by Julius Caesar. More info here.
i noticed the exact same thing. they have been looking for mine for a few months now. i just installed new software with non-standard feed links and the bots still managed to find the feeds – even with no links to them.
I’ve been getting a couple of hits for index.rdf every other day (roughly) from GoogleBot. (GoogleBrutus?) My first request for atom.xml came yesterday. [20/Apr/2004:18:59:49 -0700]
FWIW, I don’t have an index.rdf file either, but I have a liberal RedirectMatch rule that converts all manner of syndication-looking toplevel URI requests into requests for my actual feed, so Google has apparently been getting live feed data from me for quite some time. But, as you point out, the important thing is the URIs it requests are totally invented!
Another sighting here — I’ve don’t have, nor have I ever had, any reference to or instance of index.rdf or index.xml. My visitor logs have shown repeated attempts of googlebot requesting those files in the past weeks.
Google owns Blogger. Which now only supports Atom feeds. Maybe there is a connection there?
Do any autodiscovery functions look for these files? You could be seeing Google crawling links that are coming from log files that have logged autodiscovery stats. Kind of the way my public stats keep resulting in google crawling for pages that show up in my “broken links” log… just a thought.
I believe you can aid the googlebot in auto discovery by putting a link to the feed file in your head.
<link rel="alternateâ type="application/rss+xmlâ title="RSS feedâ href="index.xmlâ />
Mark Pilgrim has more on this: http://diveintomark.org/archives/2002/06/02/important_change_to_the_link_tag
Tim, there are several links to feeds in my
<head>
, to four different syndication formats. None of them are named what Googlebot is asking for.I’ve been getting requests for these and a couple of other files quite regulally in my logs.
Remember all that fuss a few years back when internet explorer started clogging up error logs when it started requesting a file called favicon.ico?
People were out in the streets buring Microsoft flags – people ACTIVILY tried to harm microsoft by bouncing requests for the file back to microsoft servers.
How come google (and others) are getting away with it? Why arent we burning google flags?
I get googled almost every day (have no idea why), and they have not searched for atom, or rdf, but they do always spider my RSS file.
here’s my log:
Host: xxx.xxx.xxx.xxx
Url: /rss.xml Http Code : 200
Date: Apr 22 06:54:22
Http Version: HTTP/1.0″ Size in Bytes: 12073
Referer: – Agent: Googlebot/2.1 (+http://www.googlebot.com/bot.html)
Not sure what, if anything this shows, but there you go.
As a member of the homebrew-weblogging-software club, I’ve not had the files googlebot, kinjabot, or feedster have been looking for either. (Well, not by those names.) The big question is, now that I’ve added a 301 Moved Permanently response to point them to the real files, will they do the right thing and stop requesting them?
I get crawled by google often enough, but I’m not receiving any such requests. Commentators should note that Evan Williams is also just guessing (though he has more chance of being right than most đ
Googlebot must be toying with me, because I got half a dozen more requests like this today.
I’m running geeklog on my site. I noticed google crawling a new staticpage i made within 5 minutes after creation. Even messages like: you have been logged out, logged in, message sent, etc. were being crawled when i looked at my last visitors log.
I’m sure i didn’t make any links to the mentioned staticpage. The only link there is can be found on the stats page, wich google didnt crawl before crawling the staticpage.
After this i tried switching off the AdSense block, after this i didn’t get crawled after creating another staticpage.
Looks like google crawls all referrers it gets on adsense it doesn’t allready know.
Hope this helps, Falconey
Maybe this use of rss feeds allows Google’s bots to crawl irregularly updated pages more efficiently.
When crawling the entire web this sort of prompted crawling could mean big savings in computer time and server space.
Just Checking
Nice, I learnt something about Julius Caesar while reading about GoogleBot. Thanks Matt.
google bought blogger, is crawling blogs, and is launching gmail (wich searches each email you send for keywords) for the same reason. They want to know what people are talking about. By doing this, they know what subjects are currently of interest to people all over the world. By knowing what keywords are of interest to people, they should be able to predict the tide of public interest faster than any one else. That should give them insight into the psyche of a lot of consumers. That has to be worth something.
I’m seeing Googlebot requests for atom.xml and index.rdf, too, despite the fact that I’ve don’t use those filenames. Googlebot *is* following the 301 redirects to the real files. Googlebot is *not* requesting those files from my non-blogging domain names, so it must be pre-filtering for blogs before guessing filenames (my blog CMS is homebrew, FWIW). Incidentally, I’m seeing the exact same behavior from AskJeeves, but nobody else has mentioned that. I’m guessing nobody cares about AskJeeves?
Incidentally, I’m also seeing AskJeeves’s bot flailing
I use a free blog at BlogEasy for keeping track of interesting articles and such. They have a great interface for checking out your site statistics and what search bots are hitting your site. I’ve noticed that the GoogleBot has been hitting the site a lot. More than usual and it can’t just be the front page. It’s probably looking for atom.xml and index.rdf as well. I wonder if it can detect that your site is a blog site maybe based on a keyword or domain name. Once it knows you are a blog site it’ll start looking for blog-related syndication feeds.
I’ve noticed this activity on my site as well. What is Google up to?
I ended up using ModRewrite to direct google to the correct files. Now they should be able to find them :).
Alex
http://www.tivoblog.com
I was sad to learn that Evan willaims is leving google/blogger, I hope the IPO was good to him.
Falconey, Google spiders your new pages very quickly when you have an Adsense block on them because that’s how Adsense knows what adverts to put on the page. If it didn’t spider them you’d either get a blank block or the charity adverts (depending on the setup of your Adsense code.)
However, I have noticed that a few months ago they started using the words in the URL to supply information for the Adsense adverts if they can’t spider the actual page, because I started getting relatively related adverts on my test server, which is on localhost and not spiderable.