More Googlebot Flailing

Now I’m seeing the Googlebot request /about/ pages relative to known blogs that don’t have any links to any /about/ URI. The last time the Googlebot flailed around like this it was fun to watch for a little bit and wonder what they had cooking in the labs, but then it got annoying. I don’t know if there are rules of bot etiquette, but requesting imagined unlinked resources while spidering can’t be a best practice.

~~There is of course one blog vendor who consistently has about pages at /about/ URIs, and that’s Typepad.~~ Now my question Google is: what should the rest of us do if we want our about pages indexed by this new system? Mine happens to be in the about subdirectory of my blog, but what about people who have about.html or about-me.php? Should I set up a permanent redirect for every blog I have redirecting to the real about page?

(Note: That’s faux indignation. I don’t have any juicy conspiracy theories, and I’m not really that peeved, mostly I’m just curious what they’re up to. However juicy conspiracy theories are welcome in the comments. [As long as they don’t make fun of me for noticing these things.] )

UPDATE: It just requested a non-existent non-linked /contact/ URI.

UPDATE: It just requested a non-existent non-linked /stats/ URI.

DEVELOPING . . .

18 thoughts on “More Googlebot Flailing”

Well, to kick in the obvious conspiracy theory: Maybe Google likes TypePad more than normal…? 😉

GoogleBot can be quite irritating. Having it around on your website 24/7 with about 10 instances of it is quite odd, but at least you can find my website by the oddest queries now 😀

my vote is that they are planning something biggie related to blogs. they search atom.xml and other similar stuff that is quite popular on other Blog tools and now typepad specific folders… gotta be something funky. Google lacks tools for blogs and rss feeds. something is definitely cooking 😉

Google is like the boyscouts. Its motto is: be prepared. They’re information whores. They might not yet even know what they plan to do with all this blog-related data that is being indexed.

GoogleBot has been requesting non-existent RSS files for some time now. However, I changed my apache redirect to send them to my WordPress .php files to handle all of the 404 errors being reported in my stats app.

“typepad specific folders”? I think that’s a bit of a stretch. /about/ and /contact/ are hardly specific to Typepad. They’re everywhere. And not much of an indication that Google’s up to something. Now, if over the next few days you start to see requests for other non-existent “about and “contact” URI’s like about.html, about.php, contact-us.html, etc. it’d be a bit safer to say that something’s going on.

While they may be annoying, maybe we should be looking at them as SEO clues. If that is the filename/dir where Google expects such info, is it possable that they’ll like it when we put it there and, who knows, maybe even giving us a higher page rank? Not that I obsese about such things, but it could potentialy make us easier to find.

It’s just a thought. But, then again, maybe I’m way off base?!

Ok. ‘About’, maybe. ‘Contact’, maybe. But ‘Stats’? That to me is completely unnecessary and invasive practice. There is no need whatsoever for Google to crawl people’s unprotected stats packages.

andrew:

Stats pages normally contain lists of referrers and popular pages, and more, so it’s not like Google don’t get anything out of it. I think it’s quite a good idea.

If people don’t want something crawled, they should disallow access in robots.txt and/or password protect it. It’s not as though this is hard.

I have to agree with Andrew on the /stats/ URI. /about/ and /contact/ are common URI’s for publicly available information, but /stats/ is not. That’s a bit invasive. Google expects webmasters to be “polite” by not serving up special content to googlebot. They too should be “polite” and not go poking where they haven’t been invited.

There is of course one blog vendor who consistently has about pages at /about/ URIs, and that’s Typepad.

False. My Typepad sites all have the about section at ~/about.html, not ~/about/.

Isn’t /about/ the default? If not, I stand corrected and I’ll happily remove that from the post. It’s not terribly relevant anyway.

The other Matt is correct and I’ve updated the post (with correct markup and datetime attribute) to reflect this.

It’s invasive to go to any unlinked page or directory, but visiting /stats/ isn’t more invasive than /about/, assuming neither are publicly available. It’s not hard to imagine a private (security through obscurity) directory with contact information. If I had one, I’d much rather it stay private than my essentially anonymous statistics.

Also remember that putting something as forbidden in your robots.txt file is the best way to get bad robots to visit it. Many robot honeypots work this way.

Pingback: Petroglyphs

Pingback: geek ramblings » Semantic Web

Yeah, I know this is an uber-old post, but whatever happened as the end result?

did you ever figure out what was causing googlebot to randomly crawl non-existent pages?

Nope.

Rob Mientjes says:

July 9, 2004 at 1:49 am

Well, to kick in the obvious conspiracy theory: Maybe Google likes TypePad more than normal…? 😉

GoogleBot can be quite irritating. Having it around on your website 24/7 with about 10 instances of it is quite odd, but at least you can find my website by the oddest queries now 😀

Sushubh says:

July 9, 2004 at 2:41 am

my vote is that they are planning something biggie related to blogs. they search atom.xml and other similar stuff that is quite popular on other Blog tools and now typepad specific folders… gotta be something funky. Google lacks tools for blogs and rss feeds. something is definitely cooking 😉

Mark J says:

July 9, 2004 at 3:52 am

Google is like the boyscouts. Its motto is: be prepared. They’re information whores. They might not yet even know what they plan to do with all this blog-related data that is being indexed.

Randy Peterman says:

July 9, 2004 at 8:24 am

GoogleBot has been requesting non-existent RSS files for some time now. However, I changed my apache redirect to send them to my WordPress .php files to handle all of the 404 errors being reported in my stats app.

Ryan Mack says:

July 9, 2004 at 8:48 am

“typepad specific folders”? I think that’s a bit of a stretch. /about/ and /contact/ are hardly specific to Typepad. They’re everywhere. And not much of an indication that Google’s up to something. Now, if over the next few days you start to see requests for other non-existent “about and “contact” URI’s like about.html, about.php, contact-us.html, etc. it’d be a bit safer to say that something’s going on.

waylman says:

July 9, 2004 at 9:54 am

While they may be annoying, maybe we should be looking at them as SEO clues. If that is the filename/dir where Google expects such info, is it possable that they’ll like it when we put it there and, who knows, maybe even giving us a higher page rank? Not that I obsese about such things, but it could potentialy make us easier to find.

It’s just a thought. But, then again, maybe I’m way off base?!

andrew says:

July 9, 2004 at 11:58 am

Ok. ‘About’, maybe. ‘Contact’, maybe. But ‘Stats’? That to me is completely unnecessary and invasive practice. There is no need whatsoever for Google to crawl people’s unprotected stats packages.

random says:

July 9, 2004 at 1:13 pm

andrew:

Stats pages normally contain lists of referrers and popular pages, and more, so it’s not like Google don’t get anything out of it. I think it’s quite a good idea.

If people don’t want something crawled, they should disallow access in robots.txt and/or password protect it. It’s not as though this is hard.

Ryan Mack says:

July 9, 2004 at 1:56 pm

I have to agree with Andrew on the /stats/ URI. /about/ and /contact/ are common URI’s for publicly available information, but /stats/ is not. That’s a bit invasive. Google expects webmasters to be “polite” by not serving up special content to googlebot. They too should be “polite” and not go poking where they haven’t been invited.

Matt says:

July 10, 2004 at 6:38 pm

There is of course one blog vendor who consistently has about pages at /about/ URIs, and that’s Typepad.

False. My Typepad sites all have the about section at ~/about.html, not ~/about/.

Matt says:

July 10, 2004 at 6:40 pm

Isn’t /about/ the default? If not, I stand corrected and I’ll happily remove that from the post. It’s not terribly relevant anyway.

Matt says:

July 10, 2004 at 6:56 pm

The other Matt is correct and I’ve updated the post (with correct markup and datetime attribute) to reflect this.

random says:

July 10, 2004 at 11:56 pm

It’s invasive to go to any unlinked page or directory, but visiting /stats/ isn’t more invasive than /about/, assuming neither are publicly available. It’s not hard to imagine a private (security through obscurity) directory with contact information. If I had one, I’d much rather it stay private than my essentially anonymous statistics.

Matt says:

July 11, 2004 at 9:17 am

Also remember that putting something as forbidden in your robots.txt file is the best way to get bad robots to visit it. Many robot honeypots work this way.

Pingback: Petroglyphs
Pingback: geek ramblings » Semantic Web
Jonathan Dingman says:

July 31, 2008 at 5:28 am

Yeah, I know this is an uber-old post, but whatever happened as the end result?

did you ever figure out what was causing googlebot to randomly crawl non-existent pages?

Matt says:

July 31, 2008 at 7:54 am

Nope.

Matt Mullenweg

Unlucky in Cards

More Googlebot Flailing

Related

18 thoughts on “More Googlebot Flailing”

Leave a Reply to MattCancel reply

Share this:

Related

18 thoughts on “More Googlebot Flailing”

Leave a Reply to MattCancel reply