Google Doesn’t Read XHTML

Google doesn’t index application/xhtml+xml pages, geez what a mess. I prefer the rigidity of XHTML syntax but I don’t want the mess that comes along with it. Hat tip: Petroglyphs. Update: See comments.

11 thoughts on “Google Doesn’t Read XHTML

  1. That’s old news, now fixed.

    You can easily check that by any-old Google search which yields hits on my blog. For a while, none of my pages made it into the Google cache (they were, in Google-parlance, “partially-indexed”). Now all are there. The fix came about 4 months ago, IIRC.

    There are many reasons to fear sending application/xhtml+xml.

    This isn’t one of them.

  2. Well, actually. It doesn’t really have natice support I guess (unknown format), but it does index the content a bit. (I’m not sure about recognizing markup and giving more value to H1 et cetera.)

  3. I’m confused, Anne’s link shows:

    Contact
    File Format: Unrecognized – View as HTML
    Contacteer ons. Limpid. > Rozenstraat 12-B. > 3702 VN Zeist. > Tel.:
    +31 (0)6 228 362 87. > info@l&#x69 …
    annevankesteren.nl/test/ examples/css/advanced/limpid-2.xhtml – Similar pages

    Which would leave me feeling a little worried that google is properly indexing the content….
    Seeing a result like this in google would confuse a novel user too, hell i think i would be turned off from clicking it…

  4. You get that result since you didn’t searched for anything specific. If you use keywords, instead of the location it will give back something more useful. Note also that the page in question is a bit weird.

  5. Google seems to be indexing application/xhtml+xml files, but they’re clearly handling them differently. Files on my server with the .xhtml extension are always served as application/xhtml+xml. Google extracts metadata from my site’s index.html file — including the title and a content summary — but provides no information about my .xhtml alternative. In addition, Google has properly cached the .html version, but not the .xhtml version.

  6. OK, there’s something funny going on. From request to request Google seems to change its behaviour. It’s been by the URL in my original post three times, the first time getting a 406, the second a 200, and the third a 406 again. All told yesterday I received seven requests that resulted in a 406 and twenty that resulted in 200, across 24 unique pages. Are any of the other a/x+x senders seeing similar behaviour?

  7. Before converting my site from HTML 4 transitional to XHTML 1.0 strict code, Google listed dozens of my pages. Soon after I did that, it dropped me completely. It’s been a year and a half and Google STILL doesn’t crawl my site. Googlebot says 406 error when it goes to my home page, even though the bots of many other search engines have no trouble. My code is stricter than strict, ultra validated, and served as text/html according to the rules of Appendix C for backwards compatibility. If interested, my site is http://www.TheBicyclingGuitarist.net
    Chris Watson a.k.a. “The Bicycling Guitarist”

SHARE YOUR THOUGHTS