Monthly Archives: August 2004

Search Engine Markshowdown

I decided to run the web page analyzer (excellent tool) against the front pages of a few of the latest and greatest search engines and also do a little analysis of my own. Here are some of the results in one of the only tables you’ll ever see on this site:

  Feedster Technorati Google Yahoo Search
HTML 6.11 3.72 1.18 7.82
Ext. CSS 11.47 11.63 0 1.45
Other 9.10 6.70 15.10 1.72
Total 26.70 22.05 16.27 11.00
Compressed No No Yes No

Numbers are kilobytes, and may not add up exactly due to rounding. CSS is external, linked files. “Other” includes images and javascript.

Yahoo was the surprise winner here. Their HTML was alright but I think could be reduced quite a bit without losing anything. You’ll note they have the heaviest HTML of the bunch, heavier than other sites showing quite a bit more on their front page. They should probably talk to Doug. Overall though I think Yahoo has consistently been doing great nearly-standards-compliant work in their new designs. Yahoo could save about 67% of their HTML size with compression. Interestingly, Yahoo was the only site to specify ISO-8859-1 encoding, all the others claimed UTF-8.

Google was optimized to the hilt, but it’s kind of silly that they put so much effort into their markup but couldn’t go the last inch and make it valid HTML 4. They could probably make it a bit smaller with some more intelligent CSS usage. At least they don’t have font tags anymore. I think under normal circumstances they would have won but they have an olympic logo right now that’s pretty heavy. Google was the only site that used gzip compression for their HTML, but even uncompressed they only weighed in at about 2.4 kilobytes, still the lightest of the group.

Technorati clearly had the smartest markup of the group, and was the only one that validated. (An impressive feat for any website in this day and age.) Their markup is clean as a whistle with excellent structure and logic, and their numbers aren’t bad when you consider that they have a lot of stuff on their front page. This isn’t too surprising since Tantek did it. Their CSS, however, is pretty heavy. It’s strange because it’s very optimized in some ways but bloated in others, I think they could cut a few K from it pretty easily. One smart thing they did is have the CSS named with the date, so it’s name versioned and they can update it monthly without caching issues. All that said, they’re so far ahead of everything else they don’t need to worry about much. Technorati could save about 53% of their XHTML size with compression.

Feedster has its heart in the right place, but the implementation falls far short. For example it has a XHTML 1.1 doctype but then has the needless XML declaration at the top throwing IE into quirks mode. They use CSS in places, but then they have a table with 75 non-breaking spaces in it for positioning. There’s a ton needless markup, including a full kilobyte of HTML comments. On the bright side, they have the most room to improve. Feedster could save about 61% of their XHTML size with compression.

Adsense Idea

Wouldn’t it be neat if instead of requiring a crawler bot to visit the page the adsense javascript could actually scrape the page text that the user sees (like a screen reader like JAWS) and use that to formulate the ad in real-time? That would prevent people from cloaking pages for the adsense bot an would allow it to be used on pages that you must be logged in to view the content. I know it’s probably impossible, but it’d still be neat.

Blog Appeal


“I’ll have you know, that WordPress is very sexy. Just ask any WP site owner, they’ll say that their sex appeal has increased by a factor of 2 since they moved to WordPress. And you’ve never been moved until you’ve been moved by someone like me.”
I can attest to the first part. Shelley is doing WordPress/blog consulting to raise money for a new camera. If you’re in the market for that sort of thing, may want to drop her a note.

Link Thanks

I just wanted to take a moment to thank those people who give proper attribution (aka a hat tip) when they post about something they found here. More and more lately I’m seeing things that I know started here show up from blogs of people I know and respect with nary a note or link back. Taking the time to properly attribute things can be a drag sometimes, but I think it’s important to maintain the credibility of weblogging as a medium and to reward those who bring new things to light. If you are someone who does properly credit things please know that I appreciate it quite a bit, and I hold you in a higher esteem than more “professional” blogs who are sloppy at best with their attribution.

WordPress.org Search

I’ve ripped out the guts and redone the search on the WordPress.org support forums in the hopes of making it something more people will use. Try it out! The new system searches the wiki (hosted on a different machine), thread titles, recent posts, and does a FULLTEXT post search for the most relevant posts. It has contextual search highlighting (like Google).

When I have some time to get back to this every section will have a “more of this” link to take you to more results (paged). It does this currently with the wiki search, counting the total results and linking to the wiki search directly if there are more than 5 results. Probably still a few bugs to work out. The fulltext query was taking over two seconds to run until I tweaked the JOIN type to get the MySQL optimizer to use the proper index and join order. Everything should validate as XHTML.

A new system is also in place to inject custom results at the top of the page. We’ve been logging searches for the last few months (over a 129,000 so far, about 43,000 unique searches) and I’m going to be working closely with the documentation team to identify which searches are most common and what tailored information would be best to present the user with when they search for targetted terms, be it a blog post, an external resource, someplace on WordPress.org itself, a wiki page, or a specific thread. We can watch trends and spikes in searches to identify any problems in the application itself or features that may be insufficently documented or hard to use.

The work is far from finished, but I think it’s a strong first step into fully integrating search as a support mechanism and bringing the WordPress team even closer to the pulse of the users.