Backtype: Scraping comments without your consent and trying to monetize it.
It is one of those ideas where you say ‘yeah right, why did no body else think about this”.
TC reports about Backtype, a Twitter for comments:
They’re a blog-comment focused startup – founders Christoper Golda and Michael Montano are for the first time aggregating all comments from millions of blogs into a single, searchable, parsable stream. Think Twitter for all comments on the web.
A quick check shows that yes indeed they scraped a lot of websites. Which I don’t remeber agreeing on.
I left the following comment on the article (instead of rearranging the same sentences here i copy and paste it):
Especially as you are funded, you are in it for the money. Where is my share of you using my content without proper licencing rights?
Basically, you are scraping my content and display it without my consent. The fact that I comment on a blog like this or conciously add content through a site like discus does not make my content fair game for anybody else to scrape.
Do I see this to be useful? Yes. Is it a valuable ressource once filled? Yes. Could you have made a database from all those nuts early beta testers and how they handed over their blogs? Absolutly.
Am I amused to find my content in there displayed in a fashion I do not agree without being asked first? Absolutly not. Even less so as there is no way to opt out of where you should have asked in the first place.
Even more so since you yourself take protecting ‘your’ content seriously. From the T&C
“All trademarks not owned by BackType that appear on this Site are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by BackType. BackType-originated content included on the Site, such as text, graphics, logos, data compilations and the compilation of all content on the Site, is the property of BackType and its licensors and protected by US and international copyright laws. Except as set out in these Terms, no reproduction of any BackType-originated content is permitted without written permission from BackType. User-posted content is copyrighted, and any use or reproduction of user-posted content must comply with the terms of the respective license(s) and must include a label indicating such license.”
btw this makes me rethink my idea to move my word to a more ‘scrapeable’ plattform like wordpress.
The thing probably rubbing me off the most is the clear statement on what they want to do with contetn you leave on their sites while at the same time doing the opposite with ‘your’ content. Even more so ‘it’s licensor‘.
Now I know that I am no guru in data mining, but I know that most people have no clue about it. Even worse: No imagination about what is possible. There is a built in business modell for brands and more to have a company like this scan comments left on your site for information they would like to know and compile. Which per se is not a bad thing, I am all for business.
But it is done behind your back, using your ressources and in the end content you provide on other sites to make money off.
Second, it does expose your behaviour on the web. Are you sure you want that for others to see? Do search your name for a moment and see how far back it goes. Basic rule: Everything which can be traced and data mined will be. (It is not as if it suddenly hits me as ‘oh my god I never knew that. Quite the opposite.)
Now, tell me, where can you see that you can delete this comment? Don’t think it is relevant? Have you ever asked somebody to change or remove a comment because of any reason? Good luck with changing it on this site and followers to come. Oh and its licencors.
btw: If you think you have nothing to “hide” you just have no imagination.
Tags: tools
In case you miss the comment on TechCrunch:
Content owners can exclude themselves from our index the same way they exclude themselves from Google. If they don’t, we’ll drive traffic to their blogs if they get comments our users find interesting — similar to how Google drives traffic to websites that show up in search results.
I’ve written a post on our blog that tries to explain our motivation in a bit more detail:
http://blog.backtype.com/2008/08/on-republishing-comments/
CG, people do not use the database to find comments and interesting blogs. They’ll use it for the purpose of working those comments – which brings traffic but not readers.
As the comparison with google – there is a difference between me saying no archive to a search engine and you deriving content for a very specific purpose.
I may decide that I want real search engines to index my content but this does not necessarily mean that I do give my permission for you to extract that information from other sites. Does google find my data in other blog comments? Yes. But it does not present it in this way.
In your case even worse, you are pulling answers out of context. A search engine leads towards a complete page, not just a snippet. And it does provide my answer in a context – not ripped out of it.
Opting out of having my blog indexed is NOT comparable as you likely put it to solve the problem. You are collecting MY content from all sources which I did not give you permission to do so. Given that you are aware of how the reaction go, I find it inexcusable for you not to have set up a mechinsm for opting out and even more important removing comments.
CG, re excluding: aren’t you a bit hypocritical here? Firstly, how is one supposed to find out what user agent does your scrapebot present to the server? Secondly, not everyone has the robots.txt access on their platform (or should bother to know what on Earth robots.txt is and is for), and they’re still entitled to being able to opt-out (the no opt-in in the first place notwithstanding). Most blogging platforms provide ‘Exclude blog from search engines’ at most. So if I want my blog searcheable, but not scrapeable, I’m stuck, right?
The hypocritical part is that they especially make the T&C that whatever is done on their plattform is not to be scraped. If I would be somebody with existing T&C like this, I’d search. And every instance I find I’d make a nice letter of it.
[I like search engines. I like services who provide me with value. I just dispise being given glas pearls in exchange for gold and treated like I would not see the difference.]
You have raised a very important issue here that really should be addressed. A commentator really has no say about if he or she wants the comments to appear on BackType: a blog owner can block the blog for robots but how is a commentator supposed to do that? I have very limited knowledge about the way crawlers operate but I don’t think I can forbid them to crawl something on a 3rd-party website I have no control over.
But if they base the service on only those users that choose to create accounts, they will have a very limited reach that will be no use for any brand managers at all so it’s kind of understandable they won’t limit the content pulled in any such manner.
Google does the same thing right?
I like your gass pearls-gold metaphor, and I too am not too enthusiastic about these new “services”.
As for the “we drive traffic to your site” argument by CG – that’s nice, but obviously not the point of critique, and therefore not a good sales argument.
Tell me if I got this wrong: besides the “option” of blocking the crawler via robots.txt (or htaccess), which is available only to a tiny fraction of users, a user effectively has to open an account with backtype and start moderating their already-scraped and released comments so that they won’t appear in the stream on backtype. Is this correct?
Svetlana has good points, btw., as they underline that there is often a crucial misunderstanding of what the web2.0 is really. It is not social in the first place, it’s a marketplace where everyone tries to scoop up their little share, and one of the means is to make money with other people’s content (and intellectual property).
(btw. the redirecting to the spammers-suck-website is a bit harsh, as it isn’t self-explanatory of a genuine commenter’s fault. I regularly forget that I need to re-enter captchas and such Gedöns after previewing comments…)
eh yes, sorry about that – I am not that deep a coder to easily change that mechanism without breaking it and i tried long to convince the programmer that normal people have no idea what this is about and why. thanks for making it though ;))
another company in this field is http://www.commentino.com/
You know I never think about this before – Scraping comments without your consent and trying to monetize it. if it’s not because your article I will never knew that this kind of thing is exist.
OK found this post searching “Block Backtype” … is there a way??
My comments are mine, I retain copyright in all of them, and I choose to license them (or not) in whatever limited or unlimited way I want – and that is the de facto position granted to me by the Berne Convention
for the Protection of Literary and Artistic Works, Article 6. http://www.wipo.int/treaties/en/ip/berne/trtdocs_wo001.html#P123_20726
Robots.txt is NOT a replacement for this law. I would not invest in a business which so clearly invited litigation!
I hate Backtype!
I had a very personal blog and I deleted it but the comments are still there on Backtype! I have written to them thrice but they do not reply!