Attack of the Splogs Revisited

I was reading Steve Rubels wonderful MicroPersuasion blog today. In it he lamented thatother blogswere stealing his content. Steve, those arent blogs. They are splogs. They just happen to be pretty.

Some interesting points worth discussing.

In the “we arent mainstream media” counter culture world that so often we bloggers try to cultivate, are wehypocritical if we call out someone for stealing our work ? One of Steve’s comments points out that if you live by a Creative Commons license, you have to take the good and the bad. The license says “

You are free:

  • to copy, distribute, display, and perform the work

  • to make derivative works

  • to make commercial use of the work”

Which leads to the question….. Is “You” in the license, the same as “It”. In other words, does the license convey to an individual the right to use an automated program to repackage and redistirbute with attribution?

Steve of course has the option of adding the Non Commercial option to the license he offers, which in turn should prevent Splogs from stealing his content.

However, there is an even simpler mechanism. The think about splogs, even the pretty ones, is that if the blog search engines like Icerocket.com do their jobs, then the splogs arent there. Which means they arent found.If a splog is hosted on a server, but no one sees it, does it exist ?

Fortunately, the two sites that Steve cites were not in the icerocket.com index.

At icerocket.com , we define a splog as any hosted website that only uses redirected or copied content and doesnt add any unique value. Aggregation is not value add. Why ? Because a search on any blog engine should uncover the unique content on their original source. If a blog isnt updated by human hands, we dont want it in our index.

If you find, as Steve did, that a site is stealing your content, feel free to email me, blake or go to blakes blog and let us know. We will check it out and remove it if it doesnt meet our standards.

And while you are at it, feel free to check out some very cool features that we have added to icerocket that blake details in his latest entry

26 thoughts on “Attack of the Splogs Revisited

  1. Totally agree! Especially on the topics of Splogs. Thank for your valuable informations.

    Comment by Roy Phay -

  2. Totally agree!! Especially on the topics of Splogs. Thank for the valuable information

    Comment by Roy Phay -

  3. Seems to me that the two key elements seperating splogs from legitimate aggregators are attribution and excerpting. And excerpting is really partially the responsibility of the content producer. Offer an excerpted feed and offer it without ads or heavy copyright.

    Comment by runescape money -

  4. It’s not an easy problem to solve at the search level either, although a specially designed search engine for blogs might be able to specifically target the problem by giving favor to older blog entires with very similiar content above some threshold.

    Comment by wow powerleveling -

  5. very goooooood!!!

    Comment by story -

  6. very goooooood!!!

    Comment by story -

  7. very goooooood!!!

    Comment by story -

  8. Not sure what we expect here. Why are we treating blogging like it is any different from anything else in the world? If someone has the original concept, idea, article, book, etc., everything after that becomes a copy.

    Some are out to make a quick buck, just the same as the guy or girl in the cube next to you at the office who “borrows” your idea and repeats it as his or her own.

    Others, however, simply want to play the game, but don’t have the right tools (i.e., some people just can’t write). So they have to use content already provided if they want to be in the game.

    Let’s not fool ourselves. The lure of the money to be made on the Internet is real even for the purest of heart. Questions to consider before we blast everyone replicating content:

    1. What is the real harm being done? I understand in theory the issue of content being copied/sold as original by those who aren’t the originators, but I don’t believe it is realistic for us to expect that content won’t be copied.
    2. Isn’t the true benefit in getting our message out, whatever our brand of message may be? (Of course it would be nice if links to our sites/blogs and our credits were included on the sites “stealing” content.)

    If all else fails, apply the revenue-sharing model presented in the “Print piracy…” blog dated November 6, 2005, to your website. That way you’ll at least ensure that you get something out of the deal.

    Cheers!

    Comment by M Richardson -

  9. Yes, it’s true that most splogs don’t last that long, but when you’re seeing hundreds of thousands of new splog URLs every day, it’s impossible to keep up. At A2B (http://www.a2b.cc) we have had to blacklist (see http://www.a2b.cc/pingblock/ and our blog post at http://blog.a2b.cc for more info) hundreds of web server IP addresses that the splog URLs resolve to. The traffic we use (mainly) parsing the splogs went down from 27GB per day to 6GB per day, but it’s back up now and costing us hosting money. We’re a two-man band and don’t have the time to write sophisticated new scripts every week to combat the latest splogging technique! Help!

    Comment by Sam Critchley -

  10. Mark said:
    “Steve of course has the option of adding the Non Commercial option to the license he offers, which in turn should prevent Splogs from stealing his content.”

    Legally, this will prohibit anyone from stealing the content, but in reality it won’t. People who make splogs don’t generally care about any copyright infringement. And they only get caught if someone who cares about it happens to see their site.

    Mark also said:
    “if the blog search engines like Icerocket.com do their jobs, then the splogs arent there. Which means they arent found.”

    This may only help a little since most searches aren’t conducted on blog search engines, but search engines like Google. While Google is continually changing their algorithm to filter out sites like splogs, the black hat folks are continually coming up with news ways to make sites like this that Google doesn’t recognize.

    I think it’s the nature of the web that people will continue to steal content. The upside is that as the search engines become more sophisticated, fewer and fewer of these people make any real money from it. But since there are always new people doing it, the search engine clutter continues to amass.

    Tony Colan said:
    “It would be so easy for the major Blog sites to eradicate splog. It all starts with the sign-up process.”

    This is only true for hosted blog sites. Anyone can get some free blog software and register a domain. No one can stop them from doing whatever they want with it then.

    There are only small things you can do to prevent people from stealing your content. You can report sites you find to search engines. You can contact the site owners and tell them to remove your content from their site. But in the end, I think you have to accept, to some extent, that this is the nature of publishing on the web.

    Comment by Dave -

  11. At the University of Maryland, Baltimore County (UMBC), we’re working on the problem of identifying splogs. See http://ebiquity.umbc.edu/blogger/?p=429 for a recent update.

    The larger Memeta project (http://memeta.umbc.edu/) is developing a system to discover blogs, monitor their activity and build up a database of metadata about them. Memeta currently has information on over 6M blogs worldwide and identifies a blog’s language and also categorize it as being a legitimate blog or a splog. These modules were developed using machine learning techniques from artificial intelligence that base their judgment on blog’s text content, but also it’s structure and relationships to other blogs and web sites. This approach allows these modules to be periodically retrained so that they will adapt and maintain their accuracy as blog usage changes. Memeta’s current accuracy at language identification at 99% and about 90% for splog identification.

    We analyzed all new blogs posts collected using weblogs.com’s pings. Over the last four weeks over 40 million posts from about 14 million blogs were analyzed. The study shows that 75% of these posts were from blogs judged to be splogs. See http://memeta.umbc.edu/ for live data and a lik to a recent paper.

    Comment by Tim Finin -

  12. Blake,

    Thanks, I appreciate the quick response!

    Robert

    Comment by Robert Oschler -

  13. aggregating content, even if it’s not your own, does add value. why would I want to look up individual articles about a subject, if I could simply visit a site that specializes in specific topics, all handpicked for relevence.

    Comment by chris -

  14. Mark,

    I think it’s great that you are putting into place policies to prevent sploggers from cluttering the blogosphere, but it’s also worth pointing out that there are a lot of blogs that don’t show up in your engine. On one hand I like not seeing the duplicate listings, but on the other hand, I feel like there is a lot of content I miss when I use IceRocket. My suggestion would be to have a report abuse button that internet users could use to report shady sploggers. If enough people feel that something is a fake, then you could 86 them from Ice Rocket. By taking advantage of the long tail, you could not only include smaller sites, but also still have a filter in place to eliminate the splogging.

    Comment by Davis Freeberg -

  15. Unforunately, while ice-rocket might help the problem, sites like Google might only compound it by giving the edge highly visible splogs that steal lesser known blog content. It’s not an easy problem to solve at the search level either, although a specially designed search engine for blogs might be able to specifically target the problem by giving favor to older blog entires with very similiar content above some threshold.

    Comment by Chris -

  16. Robert,

    I will make sure your blog gets added. I never got an email from you, maybe it went to spam folder.

    Unfortunately we probably have and will block some legitimate blogs while trying to get rid of blogs. Anyone who feels like we may have banned them can always reach us via email or phone.

    Blake Rhodes
    IceRocket.com

    Comment by Blake Rhodes (IceRocket) -

  17. I’ve seen some websites that convert their text to pseudo-images or have them coded in a way that prevents traditional text copying. That makes it very difficult to yank information away. While no public information is truly copy-proof, there are steps that can be taken to make it difficult to reproduce your information.

    Comment by Dave -

  18. The problem with IceRocket is that it still penalizes innocent Blogger blogs. I have a blog on the Robosapien V2 that can’t be found in IceRocket. I know it’s in your system because I get a status message saying that the blog feed has already been submitted.

    But if I take any easily identifiable phrase from an old post in the blog, and punch it into IceRocket’s search, I get a “not found” error.

    I contacted IceRocket with this problem using the “Contact Us” facility and never received a reply.

    Comment by Robert Oschler -

  19. So Mark, what are we to make of http://www.memeorandum.com, Digg.com and diggdot.us?

    Personally I find those, along with my own wee contribution to Canadian politics, http://canada.info-syn.com, useful in a way that a straight RSS feed would not be.

    Like porn, I know a splog when I see one; but one man’s splog is another critical information source.

    Seems to me that the two key elements seperating splogs from legitimate aggregators are attribution and excerpting. And excerpting is really partially the responsibility of the content producer. Offer an excerpted feed and offer it without ads or heavy copyright.

    Then, if you offer a full feed at all, put up a big sign – FOR INDIVIDUAL PRIVATE USE ONLY – and send a note with a draft DCMA take down notice to anyone who publically aggreagates it.

    Comment by Jay Currie -

  20. It would be so easy for the major Blog sites to eradicate splog. It all starts with the sign-up process. Our site http://www.blogster.com is the only spam free site on the Internet and we’re proud of it, but you have to be vigilant.

    Comment by Tony Colan -

  21. I take great issue with Steve’s use of the word “steal.” The first example he gives does exactly like Mark points out above: It links back with attribution to the original blog post. This is perfectly acceptable with the CC license. If Steve doesn’t like it, he should just dump the CC license and add a copyright notice on his page.

    This sounds like a lot of whining over acceptable use. Note that Mark doesn’t use the Creative Commons license, by the way.

    Comment by Jake -

  22. Thank you. I’ve been wondering who to contact at IceRocket to get rid of all the splogs aggragating my content just to make a few quick bucks.

    However, I really don’t think the onus should be on the blog search engines to keep a watch out for splogs. If Google adwords and similar companies refused to sell ads to these guys they would be gone overnight. Have you tried applying some pressure there?

    In the meantime, thanks again for attempting to make icerocket as splog free as possible

    Comment by John Frost -

  23. I certainly agree with the concept in theory Mark … but how do you go about determing if the content is actually redirected or copied? That’s a pretty computationally intensive task. And assuming you can do so, then how do you determine who the actual owner is?

    The later is even less of a trivial problem – as I’ve read stories about black hats with high page-rank sites scraping content from lower-page rank sites which then get banned by Google due to duplicate content – i.e. the big “G” picked the wrong site to penalize.

    Dynamic content isn’t as difficult (see an extreme example in my signature) since it’s much more difficult to replicate. But static content is exactly that … and I think it’s very hard to determine who the “rightful” owner is … although it sure would be nice for those of us that write our own words, take our own pictures, etc.

    alek
    Christmas Lights for Celiac Disease Research
    http://www.komar.org/cgi-bin/xmas_webcam

    Comment by alek -

  24. Several websites were simply stealing my content a few months ago and plastering their sites with ads. I didn’t mind so long as they properly linked my site on each post.

    Most of those splogs don’t last that long anyways. Let them make their $0.04 a day and give me more publicity.
    -Nev

    Comment by Neville -

  25. Creative Commons offers varying versions of copyright terms. We use a version that prohibits commercial re-use of our work.

    Comment by Karl -

  26. Question for Mr. Cuban (No Topic)

    Can you remember looking past at the last 3 to 4 weeks, was there possibly a Happiest moment or day which stands out from the rest. Or perhaps was there a particular thing which made you the happiest ?

    Thanks for your time,

    Rick Fox Island, Wa

    Comment by Rick -

Comments are closed.