Home » today » Technology » “Who’s Selling Your Data to Train AI: A Guide for Online Posters”

“Who’s Selling Your Data to Train AI: A Guide for Online Posters”

Who’s Selling Your Data to Train AI: A Guide for Online Posters

If you’ve ever posted anything on the internet, chances are that your data has already been scraped, collected, and used to train AI systems like the ones powering ChatGPT, Midjourney, and Sora. These generative AI systems require vast amounts of data to learn and succeed as generalists. However, the use of scraped public data without permission has raised concerns about the ethical implications and copyright infringement.

Legal Battles and Deals

The New York Times is currently in a legal battle with OpenAI, accusing the company of using its archives without permission to train chatbots. OpenAI, in turn, has accused the Times of attempting to hack their chatbot to prove the alleged theft. Getty Images has also sued Stable Diffusion for copyright infringement. Lawsuits from authors and creators who discovered their works were used to train AI models have faced setbacks in court.

On the other hand, some companies have chosen to make deals. The Associated Press has licensed part of its archives to OpenAI, while Shutterstock, a stock photo archive, has signed a six-year deal to provide training data. These agreements grant access to vast collections of photos, videos, and music databases.

Implications for Information and Culture

The ways in which AI systems utilize the work of journalists, musicians, and photographers have significant implications for our information and cultural ecosystem. The need for more training data has led to the potential sale of content created by online posters to AI companies. This raises concerns about the future of these industries and the people employed within them.

Platforms Selling User Data

Automattic, the parent company for Tumblr and WordPress, has been reported to be preparing deals to sell user data to OpenAI and Midjourney. While Automattic has announced a way for users to opt out of sharing their public content with third parties, it has not provided further information about the reported deals. Tumblr, in particular, remains an important platform for fandom content and original artwork.

Reddit, known for its vast archives of posts, has also sold access to user posts to Google. In a $60 million deal just before its IPO announcement, Reddit granted Google access to its API for training generative AI models. This raises questions about the value of user labor and the platform’s profit from their work.

The Widespread Use of Public Posts

The reported deals mentioned above are just a glimpse into the larger landscape of AI models being trained on public posts across the internet. Last year, the Washington Post discovered massive datasets of scraped public internet data used to train AI models. These datasets included content from World of Warcraft message boards, Patreon, Kickstarter, and personal blogs. It is also worth noting that Meta, formerly known as Facebook, uses public posts from Facebook and Instagram to train its AI models.

Protecting User Content

As the use of public data for AI training becomes more prevalent, it is crucial for platforms and AI companies to address concerns about attribution, opt-outs, and control. Users should have the ability to protect their content and ensure that it is not exploited without their permission.

Conclusion

The use of scraped public data to train AI models has sparked legal battles, raised copyright concerns, and highlighted the potential exploitation of user-generated content. Platforms like Tumblr and WordPress have been reported to be in talks with AI companies, while Reddit has already sold access to user posts to Google. The widespread use of public posts for training AI models calls for greater transparency and user protection in the evolving landscape of AI technology.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.