Spam Blogs

You should read spam and fake blogs, another problem I’ve been seeing a lot lately is entire blogs being scraped and their content being re-published with ads on it. Structured formats like RSS make this easier than before. The dark side to the numbers all the blog search engines have been toting is that a LARGE percentage of these are fake blogs, so much so that I currently block over 80% of all incoming pings to Ping-O-Matic as obvious spam. This has been a huge resource burden as well. We have around 2 million legit pings per day, do the math.

17 thoughts on “Spam Blogs

  1. Is there any way to make sure that the pings from my site are going through, obviously I don’t want my content filtered as spam.

    … I’ve just reread your post, 2 million legit pings would mean roughly 10 million pings total a day – I guess it’s only going to go up as well…

  2. Wow. Thats a LOT more than I would have assumed. Can you block it at the IP level so that your XMLRPC implementation doesn’t have to deal with it?

    Of course maybe you’re already doign that. The firewall rules alone would take up a lot of CPU.


  3. We considered blocking at the IP level for Ping-O-Matic, but there are issues that make it problematic to automate. For now, we’ve chosen to continue blocking a the application level (based on the URI sent in the ping) rather than risk false-positive ping blocking.

  4. Matt, do you have a solution to this? Can RSS be altered to thwart these rascals? (I use the nice word there). Also I concur with Ben, how do we know our pings are going through? Thanks!

  5. We err on the side of letting things through, so it’s highly unlikely you’re being blocked if you’re not a spammer. If you get a blocked message, just contact me and I’ll clear it up.

  6. I just reported one of those to technorati. It was posting garbage every hour. To their credit, technorati removed it from their database within minutes of my report.

    I’m so dim when it comes to the tech stuff but I feel for people like you who put out wonderful stuff only to have it exploited. Argh.

    Thanks for all your effort.

  7. That’s just insane, when I used to blog with Blogger you’d see a few here and there, now they are everywhere. It just sucks that they are using this platform to spam…I think with alot of the plugins that WP has pretty much does the job. I haven’t had to much of a problem so that is comforting to me.

  8. Holy crap! Or, given the percentage, maybe I should say “wholly crap.” I knew there were a lot of junk “blogs” out there — I’ve seen excerpts of my own posts show up on obviously keyword-generated auto-blogs, but the scale is well beyond anything I expected.

    Now that I know Technorati is actively removing fake blogs from their index, I’ll make sure I start reporting them.

  9. Ummm, dumb question, but is this why Ping-O-Matic has thrown back an error the last 4 or 5 times I’ve posted a new entry? (About 2 weeks worth).

  10. I always thought that 14 Million was too high a number to consist of all ‘real’ blogs (its what Technorati are currently indexing), so going by the 80% number for PoM, can we assume there are only 2.8M real blogs?

    The spam-free blog search solution I am currently working on involves taking search results from sites that are linked from sites in your OPML/subscriptions file, and sites that those sites link to. You often find that the ‘third degree blogs’ (ie. blogs linked from blogs that you link to or subscribe to) contain no spam, assuming that your base list of subscriptions is clean. It would take somebody along that chain to link to a spam blog, or a fake blog to break the chain and introduce junk into the results, but at the moment this is rarely happening (with my test cases and blog index that is only 50,000 blogs large at the moment). This works a bit like social networking sites, where you have ‘friends of friends’ etc.

  11. Nic, you should support XFN too. I would imagine Technorati gets a better set of data than PoM does, so their 14 million number might be pre-filtered or have a much much lower percentage.

  12. There’s an error in assuming that each ping= 1 blog. Correct me if I’m wrong, Matt, but wouldn’t the more proper ratio be 1ping=1 post? (theoretically, of course). Blog and ping works on the basis of putting up a lot of posts and (therefore) generating a lot of pings. On a really prolific day, I might put up 10 posts across 5 blogs, or two posts per blog/day. A fakeblog or spamblog would more likely do better than 10 times that for 1 blog. So, if we see roughly 8 million spampings a day, that would more likely represent 80,00 spam blogs using 5 times the services of roughly 1.5 million real ones.

    Is that somewhat close to correct, Matt?

  13. Yes I would say that’s very perceptive, Greg. I don’t have the bandwidth to check the exact numbers right now, but you’re completely right that the worst spammers ping the most.