At the last IRC meetup the WordPress community asked for better search that included both the forums and the Codex and was integrated with the look and feel of the rest of the site. When I did this before it was horribly slow and it involved several queries across several different programs and MySQL hosts to get the results from the wiki, the forums, the blog, and then splice them together somehow. Later we switched to a plain Google site-search but they didn’t like the HTML we used for the search form so we took it down. Well after the meeting I remembered Yahoo Developer Network which had some sort of API for their search with a much higher limit than Google’s.
I went to the site to see how much of a pain it would be so I could start properly procrastinating, but I was taken aback by how incredibly easy it was to get an application ID and start getting the results back as simple XML. I began hacking on it right then. It was about 5 minutes to set up a search form with URIs the way I wanted, 7 minutes to get the XML and parse it out, 5 minutes to write in some paging, and then about 20 minutes tweaking the search page to make it look a little better. The result is the new search.wordpress.org WordPress Search.
It still needs some more work. There seems to be a dupe problem, which is actually a problem with our site, not Yahoo Search. I’d like to tweak the results to highlight newer topics more, or at leats allow for a date-based weighting. Finally I think it would be nice to include some WP-related blogs like Blogging Pro and Weblog Tools Collection in the results. Most importantly we now have a clean URI structure and home for searches which is abstracted from any piece of software or particular service provider. Yahoo deserves major kudos for opening up their information in such a free way and making it so easy that it’s taken me longer to write this post than start using their API.
I decided to run the web page analyzer (excellent tool) against the front pages of a few of the latest and greatest search engines and also do a little analysis of my own. Here are some of the results in one of the only tables you’ll ever see on this site:
Yahoo was the surprise winner here. Their HTML was alright but I think could be reduced quite a bit without losing anything. You’ll note they have the heaviest HTML of the bunch, heavier than other sites showing quite a bit more on their front page. They should probably talk to Doug. Overall though I think Yahoo has consistently been doing great nearly-standards-compliant work in their new designs. Yahoo could save about 67% of their HTML size with compression. Interestingly, Yahoo was the only site to specify ISO-8859-1 encoding, all the others claimed UTF-8.
Google was optimized to the hilt, but it’s kind of silly that they put so much effort into their markup but couldn’t go the last inch and make it valid HTML 4. They could probably make it a bit smaller with some more intelligent CSS usage. At least they don’t have font tags anymore. I think under normal circumstances they would have won but they have an olympic logo right now that’s pretty heavy. Google was the only site that used gzip compression for their HTML, but even uncompressed they only weighed in at about 2.4 kilobytes, still the lightest of the group.
Technorati clearly had the smartest markup of the group, and was the only one that validated. (An impressive feat for any website in this day and age.) Their markup is clean as a whistle with excellent structure and logic, and their numbers aren’t bad when you consider that they have a lot of stuff on their front page. This isn’t too surprising since Tantek did it. Their CSS, however, is pretty heavy. It’s strange because it’s very optimized in some ways but bloated in others, I think they could cut a few K from it pretty easily. One smart thing they did is have the CSS named with the date, so it’s name versioned and they can update it monthly without caching issues. All that said, they’re so far ahead of everything else they don’t need to worry about much. Technorati could save about 53% of their XHTML size with compression.
Feedster has its heart in the right place, but the implementation falls far short. For example it has a XHTML 1.1
doctype but then has the needless XML declaration at the top throwing IE into quirks mode. They use CSS in places, but then they have a table with 75 non-breaking spaces in it for positioning. There’s a ton needless markup, including a full kilobyte of HTML comments. On the bright side, they have the most room to improve. Feedster could save about 61% of their XHTML size with compression.