Robert Scoble looks at RSS bandwidth usage but unfortunately doesn’t give real numbers. There are a couple of important points in looking at HTML vs. RSS bandwidth usage, some brought up in his comments but I’ll review here:
- RSS is not a transport mechanism, and these problems should be handled on the HTTP level. This is faster, better tested, works with caches and proxies, et cetera.
- I don’t care if an aggregator checks my feed every 5 minutes, if they support HTTP properly (last-modified headers) the load is neglible for me and them. The bandwidth used each time is around 250 bytes.
- Speaking of HTTP, gzip encoding works just as well for RSS feeds.
- Bloated HTML in full content feeds will make for bloated feeds. We’ve upped our standards, up yours.
- Adjust your number of items to match your posting schedule. If you update two or three times a week, you don’t need 20 items in your news feed, try five. If you update a lot like myself or Robert you run the opposite risk: people who don’t check your feed for a while might actually miss content. I talked to aggregator developers a few months ago about a way to address this, perhaps we need to look at this again.
- A visit to a homepage like this generates a lot of requests. First you get the HTML, then your browser requests the CSS files, then it gets all the images, probably about a dozen of them. RSS is generally a single request, and then images embedded in posts may be requested later if that entry is viewed.
That out of the way, how about some real numbers? I can give the best stats for photomatt.net because my RSS feed is on a separate subdomain. Here’s bandwidth usage for August:
- photomatt.net – HTML and images, et cetera
- 56.0 GB
- xml.photomatt.net – RSS and other variants
- 1.7 GB
The ratio for July was similar, 77.6/2.0 GB, I guess I lost some readers in the summer slowdown.
My front page is an average of 17K HTML and about 30K of images, so lets say 50K. The images and CSS are all cached, but I don’t output the proper headers on the HTML because it would be a pain. I would have to check the time of the latest post, check the latest updated link, make sure there’s not a random photo, basically go through a lot of trouble that isn’t really worth it. (When I was caching everything with Staticize the page was stored in its entirety so I sent correct headers, but I don’t bother with that anymore because everything is so fast it doesn’t really make a difference.) All that said, I’m happy with the level of optimization, my HTML and CSS is as streamlined as I want it to be.
My feed, on the other hand, is completely self-contained. I can send headers with confidence because I know everything in that file and can say authoritively when it was last changed. (Actually WordPress handles it all automatically, I don’t worry about it.) Most of the aggregators that hit my site support this. I don’t think about it and they don’t think about it, HTTP just works. My feed has 25 items because of my posting frequency, more than twice the number of items most feeds have. The feed is usually around 10K, and as I already mentioned it’s only one request.
Here’s the kicker: my RSS feed is requested 3 times for every time my front page is loaded. So a HTML page with 1/3 the traffic is using over 30 times the bandwidth. What was that about scalability again?
I imagine that the big problem with the weblogs.asp.net feed (which kicked off this whole discussion) is that it was aggregating all of the posts by all of the MS bloggers. New posts were coming in at a rate of several an hour, so even with conditional GET support the feed /always/ had something new in it (so aggregators would end up loading the whole thing every time they polled). This ends up being compounded by the length of the items and the fact that the number of items in the feed needs to be high to make sure subscribers don’t miss anything.
RSS may scale fine for feeds which aren’t being updated frequently, but it’s hard to make a case that a frequently updated, highly popular feed won’t suffer from severe bandwidth problems.
Another thing: the real problem with the above example when it comes to bandwidth is that aggregators end up pulling the whole thing even if only one new item has been added. On esolution would be to dynamically generate the feed, only including entries that were posted since the Last-Modified header specified in the request. This would up the load on the server but would dramatically increase the size of many of the outgoing feeds, saving a big chunk of bandwidth.
I’ve seen on Mark Pilgrim’s site that he just applies XSL to his XML feeds and it displays almost the same as his XHTML pages. Would this be useful? (assuming XSL, XML and images are constant.
I wonder, how much of that bandwidth is taken up by your headers being images?
While I agree somewhat that we need to get over the “if you try to grab my RSS more than once an hour, I’ll ban you!!1!” attitude that’s mostly carried over from before we knew enough to use conditional GET, I wonder: suppose bug #251842 gets fixed, and the bookmark/Livemark UI comes back in Firefox 1.1, and a whole bunch of people with no exposure to RSS mores start using it, and since they’ve got plenty of connection bandwidth and CPU, and no patience, they all set it to check every minute. How are you going to be faring for Apache connections, next time you get Slashdotted?
Simon: every time the “dynamic feed based on Last-Modified” comes up, people who know what they are talking about with caches say you break cacheability doing that (and suddenly find out that AOL and the like have been saving you several hundred or thousand hits an hour with their cache).
Phil: I should have known there was a reason no one was doing that! Thanks for the heads up.
I tried to send a trackback, but something went horribly wrong.
Here’s the post anyway.
It’s always appealed to me, too: there they are telling you what they’ve seen, and you can’t just give them the new stuff.
What does strike me as having potential is Russ Karcher’s idea in Scoble’s comments, combined with something close to the 301 redirecting for stats you were talking about the other day. Make your public-facing feed URL excerpts only, with a sticky post saying “If your reader demonstrates that it plays well with the intarwebnet, this will turn into a full-content feed next time it refreshes.” No support for compression? It never does. An If-Modified-Since that’s too recent, based on your caching headers and in-feed things like ttl or updateFrequency? (Being sure it isn’t a cache checking freshness.) No full content for you, two weeks!
Trackback didn’t work here either.
http://blog.dalegroup.net/archive/blog/newsid/118
Yeah I know I broke trackbacks, give a me a day to catch up with things and I’ll fix it. Keep the manual pings coming. 🙂
Last I checked (right after Scoble’s post was published), all blogs on blogs.msdn.com returned a 200 OK status everytime they were requested.
Talk about a joke. 🙂
Also, most comments on Scoble’s post were about scalability of RSS with big feeds (1MB and up). Since we’re talking about big feeds, why are we surprised that they take so much bandwidth?
In the case of big feeds updated frequently, the solution seems to just reduce the update frequency. And voilà , instant bandwidth bill reduction! Gee, I bet I should patent that idea!
I think you just penned a great slogan in point 4:
“We’ve upped our standards, up yours.”
Dave L, try the 2nd link on that line…it’s from the inimitable (sp?!) Joe Clark. (Matt, that’s one of my favorite quotes too.)
I tried making the bandwidth point re: RSS vs HTML et al to someone a while back, and I don’t know if it sunk in. there’s definitely a perspective of “oh my god your program is auto-checking the file every $X?! that must be costing us bandwidth!!!!”
I sent them Tim Bray’s entry from back in April on the topic:
Dave L: I know Matt attributed the quote to Joe Clark, but I think Pat Paulsen may have said it as well. But of course, Joe Clark used it when talking about the kind of standards Matt was talking about. Though when Joe said it a while back, he attributed it simply to “Comedy Central”. It’s still a great quote though.
Great post Matt.
I guess it’s just one of those lines that works in any situation. 🙂
Phil, Simon: If they were reallly sending 1MB+ feeds with constant updates, then I don’t think that proxy caches will be saving that much bandwidth. In that case, I think that setting cache-control to private and serving Last-Modified time based diffs would probably save net bandwidth, at least for the aggregated feeds.
Trackback still doesn’t work, so I hope you give this some of your time and tell me what you think, I’m very interested in what you have to say about it Matt.
I used this optimization tool — which you referred to recently as well — to bring my own blog frontpage down to about 40K. Typing in your main URL, http://photomatt.com/, gives a bit of a mixed result. Maybe you could still reduce your HTML bandwidth usage a bit. You are right to point out that images may be cached, but you might lose a few new readers. Not that you’re short of them, though.
I’m maintaining a list of servers and clients that support RFC3229 delta encoding for RSS and/or Atom files. I know that James Robinson has built a first-cut extension to WordPress to support RFC3229 (it’s the thing I told CNET about..), however, I’m curious if there are any plans to make this a formal part of standard WordPress. If you’re going to be supporting it, please let me know so I can put you on the list.
see: http://bobwyman.pubsub.com/main/2004/09/implementations.html
bob wyman