Category Archives: Google

Google, its products, and its role on the web.

Search Engine Markshowdown

I decided to run the web page analyzer (excellent tool) against the front pages of a few of the latest and greatest search engines and also do a little analysis of my own. Here are some of the results in one of the only tables you’ll ever see on this site:

  Feedster Technorati Google Yahoo Search
HTML 6.11 3.72 1.18 7.82
Ext. CSS 11.47 11.63 0 1.45
Other 9.10 6.70 15.10 1.72
Total 26.70 22.05 16.27 11.00
Compressed No No Yes No

Numbers are kilobytes, and may not add up exactly due to rounding. CSS is external, linked files. “Other” includes images and javascript.

Yahoo was the surprise winner here. Their HTML was alright but I think could be reduced quite a bit without losing anything. You’ll note they have the heaviest HTML of the bunch, heavier than other sites showing quite a bit more on their front page. They should probably talk to Doug. Overall though I think Yahoo has consistently been doing great nearly-standards-compliant work in their new designs. Yahoo could save about 67% of their HTML size with compression. Interestingly, Yahoo was the only site to specify ISO-8859-1 encoding, all the others claimed UTF-8.

Google was optimized to the hilt, but it’s kind of silly that they put so much effort into their markup but couldn’t go the last inch and make it valid HTML 4. They could probably make it a bit smaller with some more intelligent CSS usage. At least they don’t have font tags anymore. I think under normal circumstances they would have won but they have an olympic logo right now that’s pretty heavy. Google was the only site that used gzip compression for their HTML, but even uncompressed they only weighed in at about 2.4 kilobytes, still the lightest of the group.

Technorati clearly had the smartest markup of the group, and was the only one that validated. (An impressive feat for any website in this day and age.) Their markup is clean as a whistle with excellent structure and logic, and their numbers aren’t bad when you consider that they have a lot of stuff on their front page. This isn’t too surprising since Tantek did it. Their CSS, however, is pretty heavy. It’s strange because it’s very optimized in some ways but bloated in others, I think they could cut a few K from it pretty easily. One smart thing they did is have the CSS named with the date, so it’s name versioned and they can update it monthly without caching issues. All that said, they’re so far ahead of everything else they don’t need to worry about much. Technorati could save about 53% of their XHTML size with compression.

Feedster has its heart in the right place, but the implementation falls far short. For example it has a XHTML 1.1 doctype but then has the needless XML declaration at the top throwing IE into quirks mode. They use CSS in places, but then they have a table with 75 non-breaking spaces in it for positioning. There’s a ton needless markup, including a full kilobyte of HTML comments. On the bright side, they have the most room to improve. Feedster could save about 61% of their XHTML size with compression.

WordPress.org Search

I’ve ripped out the guts and redone the search on the WordPress.org support forums in the hopes of making it something more people will use. Try it out! The new system searches the wiki (hosted on a different machine), thread titles, recent posts, and does a FULLTEXT post search for the most relevant posts. It has contextual search highlighting (like Google).

When I have some time to get back to this every section will have a “more of this” link to take you to more results (paged). It does this currently with the wiki search, counting the total results and linking to the wiki search directly if there are more than 5 results. Probably still a few bugs to work out. The fulltext query was taking over two seconds to run until I tweaked the JOIN type to get the MySQL optimizer to use the proper index and join order. Everything should validate as XHTML.

A new system is also in place to inject custom results at the top of the page. We’ve been logging searches for the last few months (over a 129,000 so far, about 43,000 unique searches) and I’m going to be working closely with the documentation team to identify which searches are most common and what tailored information would be best to present the user with when they search for targetted terms, be it a blog post, an external resource, someplace on WordPress.org itself, a wiki page, or a specific thread. We can watch trends and spikes in searches to identify any problems in the application itself or features that may be insufficently documented or hard to use.

The work is far from finished, but I think it’s a strong first step into fully integrating search as a support mechanism and bringing the WordPress team even closer to the pulse of the users.

Humble Googling

It’s always humbling when you’re searching for an answer to something on Google and you come across a page from your own site in Google. It’s even worse when that page has the answer. I want Technorati or Feedster to show three types of hits: those from my own blog, those from other blogs, and those from Google.

Weeds in the Garden

Under the Iron has an old interview with Scott Johnson that is a good read. Now scroll down to the comments. Dozens and dozens of spam comments. I see this over and over again on MT and s9y sites. What’s terrible is these pages are just as dangerous as dedicated spam blogs. Think about it: I shouldn’t even be linking to it now.

Alex told me the other day about a new type of comment spam he’s been seeing: comments that link to normal blog entries. Well known blogs like Mozillazine. As advanced as tools like MT Blacklist have become, they’re pretty useless in cases like this. Are you going to blacklist Dave Sifry? Molly.com used to have spam comments on her site all the time. Even though she spent a lot of time and effort dealing with them (a daily chore) they only need to be there long enough for Googlebot to index them for the harm to be done. I’m not dogging on MT here, it’s just that there are tens of thousands of MT blogs out there who don’t have any protection and the spammers are targetting them mercilessly. Domain blacklists don’t scale (spammers can have thousands of domains easily and hijack innocent domains) and centralized registration hasn’t shown to be effective except against people who don’t like centralized registration, a group that doesn’t include spammers.

People used to say that WordPress doesn’t get spam comments because it’s not popular enough. I don’t think this argument holds water anymore. It’s true that MT has three to four times as many blogs as WordPress, but Serendipity has an order of magnitude fewer blogs than WP and is highly targetted by spammers. I think WordPress has, through design and luck, done a lot of things right with regards to comment management in general. First we respond to the problem in the core code quickly. Moderation and blacklisting has been in the core for half a year now. All of the WordPress developers are bloggers as well so we’re pretty sensitive to new techniques in use by the spammers. When early versions of WordPress 1.0 advertised moderation was on spammers instantly adapted to that and started searching for blogs that didn’t have the phrases we used, so in the next nightly build for testers I had changed how that worked so it couldn’t be targeted anymore. Then in 1.2 we expanded the already successful moderation to allow powerful regular expressions and target not just the content but things like number of links in a post. Let’s say that somehow two hundred spam comments did get on your blog, which would never happen in the first place because we’ve had throttling for over a year now, you can easily delete hundreds of spam comments at once in under five clicks. We’re not sitting still either, version 1.3 will have emergent registration based on code originally written by Kitten so there is a type of automatic whitelisting going on that spammers can’t duplicate because it uses email addresses like a secret key and WordPress never reveals your email address. (So Dave and Mark, stop leaving fake ones!) The code will be flexible enough to adapt for GPG signing for the ultra-geeky in the audience.

Any of these things by themself wouldn’t be very effective, and each method I’ve listed has its flaws and weaknesses and I know them. Which brings us to what I think the real reason WordPress, despite its explosion of popularity, still doesn’t get the level of spam other tools do: it’s more trouble than it’s worth. WordPress, to spammers, is an unpredictable and moving target. We’re not resting on our laurels, we have another exciting feature-filled release coming just a few months after the landmark version 1.2. The WordPress moderation system can be be toggled to manual mode, which is 100% effective at catching spam, or triggered only when something is suspicious. We’re committed to keeping the cost high and the reward uncertain for spammers which means you don’t have to wake up every morning to filth on your weblog as well as in your inbox. You can focus on what draws us all to this medium, writing and genuine interaction. Here’s a quote from Molly from a comment she left on Keith’s site:

I wanted open comments. In my situation, MT, despite the wonderful Jay Allen personallyhelping me on an almost daily basis to deal with comment spam, I was a major target. My ISP refused to continue dealing with me because the server molly.com resided on was brought to its knees twice due to spam floods. I was spending up to two hours PER DAY to undo the spam much less post.

Since switching to WP, I’ve had exactly five emails sent to me automagically for moderation. 3 of them were spam, 2 were just enthusiastic posts with multiple links from a reader.

Either way, I had instantaneous access to accept or delete those posts.

That’s the sort of thing that is incredibly rewarding about working on WordPress. Knowing that your work makes it easy for someone else to do what they love is one of the greatest feelings in the world. No amount of money or recognition can ever match that.

More Googlebot Flailing

Now I’m seeing the Googlebot request /about/ pages relative to known blogs that don’t have any links to any /about/ URI. The last time the Googlebot flailed around like this it was fun to watch for a little bit and wonder what they had cooking in the labs, but then it got annoying. I don’t know if there are rules of bot etiquette, but requesting imagined unlinked resources while spidering can’t be a best practice.

There is of course one blog vendor who consistently has about pages at /about/ URIs, and that’s Typepad. Now my question Google is: what should the rest of us do if we want our about pages indexed by this new system? Mine happens to be in the about subdirectory of my blog, but what about people who have about.html or about-me.php? Should I set up a permanent redirect for every blog I have redirecting to the real about page?

(Note: That’s faux indignation. I don’t have any juicy conspiracy theories, and I’m not really that peeved, mostly I’m just curious what they’re up to. However juicy conspiracy theories are welcome in the comments. [As long as they don’t make fun of me for noticing these things.] )

UPDATE: It just requested a non-existent non-linked /contact/ URI.

UPDATE: It just requested a non-existent non-linked /stats/ URI.

DEVELOPING . . .

The Google Blog

As nearly everyone in the world has noticed, Google has a blog now. It’s too bad they didn’t go with the /blog/ URI because this one has extra redundant redundancy, and that doesn’t seem very Google-like. The new blog is very generic, it barely seems like a Blogger blog. On the same day Blogger releases gorgeous XHTML+CSS tempates from Doug and the crew, Google releases its blog with a table-based layout and funky HTML 4 (with no doctype). Also, Blogger uses utf-8 encoding by default now (like WordPress) and Google’s blog uses iso-8859-1.

So there isn’t a lot of information on their blog yet. The first post was signed by Ev, but after that it’s been non-entities writing (and modifying) the posts, which is very weird for a blog. Where to go for more information? Their Atom feed of course. The first thing I noticed was the <id> element, which contained tag:big.corp.google.com,2003:blog-1720. Big corp, ha! So who’s the mysterious author of the two entries after Ev’s?

<author>
<name>A Googler</name>
</author>

Well that’s helpful. Their second post on outsourcing has a more interesting bit of metadata.

<issued>2004-05-10T15:30:52-07:00</issued>
<modified>2004-05-11T17:40:57Z</modified>
<created>2004-05-10T22:39:01Z</created>

Bloggers edit their entries all the time, but “A Googler” actually changed quite a bit, removing a paragraph on outsourcing to India. Perhaps Google is already sharing more than they had planned, but I’ll stop now before they take away my Gmail account.

New Yahoo Search

Yahoo has flipped the switch and is no longer using Google for their search. (Some technical details.) The question on everybody’s mind: Is Yahoo’s search better than Google’s? Yes. Why do I think so?

  1. Results are given as an ordered list, or <ol>, which is a good thing.
  2. It shows 20 results instead of just 10.
  3. You have an option by each result to open it in a new window.
  4. They are somehow detecting RSS feeds for sites that have them, and linking to them directly and also allowing you to add them to My Yahoo. They seem to have gotten my RDF file instead of my RSS 2.0 file, which is prefered, but no worries. I’ve been meaning to replace that with a 301 redirect lately anway.
  5. It is much better designed.
  6. But the best reason to use Yahoo? I’m the #2 hit for “Matt”. Yes, even ahead of that Drudge clown.

What? Were you expecting me to check for any other search terms?

A quick trip to MyCroft and you can make Yahoo your default search engine for Firefox. Easy as pie.

It Feels So Good to be 3

A birthday present from Google: I have overtaken that other guy and I am now the #3 Matt in the world. Go look at it now, because Google can be a harsh mistress. If things continue to go well I might reach #1 and have to take down all my sites, like I promised.

For those of you commenting on the phantom ping from yesterday, I actually posted something and took it down. Don’t worry, a little editing and it’ll be back.

Calculate Age in MySQL

I just got an email from docs@mysql.com saying the following:

The user comment system in the MySQL manual is not the place to request features. You can do so using our bug-tracking system at http://bugs.mysql.com/. Thanks. (Actually, your comment is not a feature request, but it relates to another comment that is. The example you’re giving is nice, but this is a reference manual, so we have to restrict it to _a few_ useful examples.)

My original comment was:

You bring up some important issues, but dealing with ages really isn’t that hard. For example you could do something like this:

mysql> SELECT DATE_FORMAT(FROM_DAYS(TO_DAYS(NOW())-TO_DAYS(dob)), '%Y')+0 AS age FROM people;

Where ‘dob’ is obviously their date of birth. It’ll also work with pre and post-epoch dates. Please excuse the funky formatting as the
comment stem seems to insist on inserting line breaks into the code block. I ran into this problem while working on some genealogical things over at Mullenweg.com, a family site. I hope this helps!

Looking back, it’s funny that the comment is still around, I wrote it over two years ago. The date and time functions is the MySQL page I use most, so in some sense it was always nice to have my mark on there. For google and posterity I’ve preserved the comment here.

I’m glad they’re cleaning up the comments, as they are really bad in places and have atrocious formatting, especially when compared to say, the PHP manual. However there is a later comment (which is still up) that offers perhaps a better method. From Kirill Novitchenko:

The method posted by Mathew Mullenweg is good, but leap years goof it up on birthdays. (Try it. Use the current date and subtract exactly 5 years ago.)

Hopefully this will be the last ‘find age’ function. There is a simple premise to it:

  1. Subtract the current year from the birth year to get the age.
  2. If the current month and date is less than the birth month and date, subtract 1 from step 1.

Therefore, this should work with everyone who wasn’t born in the future.

SELECT DATE_FORMAT(NOW(), '%Y') - DATE_FORMAT(dob, '%Y') - (DATE_FORMAT(NOW(), '00-%m-%d') < DATE_FORMAT(dob, '00-%m-%d')) AS age

where dob is date of birth.

I’ve never run into any problems with my function but I see nothing wrong with the way this one works, so I may update my code to use it.

Why not just use unix timestamps and avoid all the funkiness? When I first started writing everything I actually did, but then one day I got a call from my lovely sister saying that it was showing everyone’s birthday as January 8th, 1901 (or something like that). I had reached the negative limit of a 32-bit integer, the upper limit being sometime in 2038. Moving all the date functions into the SQL is probably bad from a programming point of view but it works great for the application. Of course I have no clue how it deals with the 10 days Pope Gregory removed from the calendar in 1582. Hopefully that won’t come up. 🙂

RSS Requests and Browsers

Out of curiousity I ran some stats on the different RSS versions I offer here. The results were pretty much what I expected:

  1. 47 % — RSS 2
  2. 39% — RSS .92
  3. 14% — RDF 1.0

Also as an update to a previous look, I don’t know what it was about Mozilla that month (August). Here’s what is happening currently:

  1. 38% — Internet Explorer
  2. 36% — Netscape/Mozilla
  3. 6% — Googlebot
  4. 3% — Safari

That’s pretty much on par for the course. One interesting note is that IE6 users seem to spend the most time on the site, for whatever that’s worth. As before, let me know how these things stack up in your neck of the woods.

Wildcard DNS and Sub Domains

What follows is what I consider to be best practice for my personal sites and a guide for those who wish to do the same. Months ago I dropped the www. prefix from my domain in part because I think it’s redundant and also because I wanted to experiment with how Google treated valid HTTP redirect codes. The experiment has been a great success. Google seems to fully respect 301 Permanent Redirects and the change has taken my previously split PageRank has been combined and now I am at 7. There are other factors that have contributed to this, of course, and people still continue to link to my site and posts with a www. (or worse) in front of it, but overall it just feels so much cleaner to have one URI for one resource, all the time. I’m sure that’s the wrong way to say that, but the feeling is there nonetheless.

Now for the meat. What’s a good way to do this? Let’s look at our goals:

  • No links should break.
  • Visitors should be redirected using a permanent redirect, HTTP code 301, meaning that the address bar should update and intelligent user agents may change a stored URI
  • It should be transparent to the user.
  • It should also work for mistyped “sub domains” such as ww. or wwww. (I still get hits from Carrie’s bad link)

So we need a little magic in DNS and in our web server. In my case these are Bind and Apache. I am writing about this because at some point the code I put in to catch any subdomain stopped working and while I reimplemented it I decided to write about what I was doing. This method also works with virtual hosts on shared IPs where my previous method did not.

In Bind you need to set up a wildcard entry to catch anything that a misguided user or bad typist might enter in front of your domain name. Just like when searching or using regular expressions you use an asterisk (or splat) to match any number of any characters the same thing applies in Bind. So at the end of my zone DB file (/var/named/photomatt.net.db) I added the following line:

*.photomatt.net. 14400 IN A 64.246.62.114

Note the period after my domain. The IP is my shared IP address. That’s all you need, now restart bind. (For me /etc/init.d/named restart.)

Now you need to set up Apache to respond to requests on any hostname under photomatt.net. Before I just used the convinence of having a dedicated IP for this site and having the redirect VirtualHost entry occur first in my httpd.conf file. That works, but I have a better solution now. So we want to tell Apache to respond to any request on any subdomain (that does not already have an existing subdomain entry) and redirect it to photomatt.net. Here’s what I have:

<VirtualHost 64.246.62.114>
DocumentRoot /home/photomat/public_html
BytesLog domlogs/photomatt.net-bytes_log
ServerAlias *.photomatt.net
ServerName www.photomatt.net
CustomLog domlogs/photomatt.net combined
RedirectMatch 301 (.*) http://photomatt.net$1
</VirtualHost>

The two magic lines are the ServerAlias directive which is self explanitory and the RedirectMatch line which redirects all requests to photomatt.net in a permanent manner.

There is a catch though. The redirecting VirtualHost entry must come after any valid subdomain VirtualHost entries you may have, for example I have one for cvs.photomatt.net and I had to move that entry up in the httpd.conf because Apache just moves down that file and uses the first one it comes to that matches, so the wildcard should be last.

That is it, I’m open to comments and suggestions for improvement.