dataset manifests

About this tag
Discussions on WindowsForum about dataset manifests focus on the tension between tech platforms that ban web scraping in their terms of service while simultaneously using large-scale data collection to train AI models. The content highlights how companies enforce permission for platform use but operate with little oversight when gathering public web content, including copyrighted material, for training generative AI. This contradiction is central to ongoing reporting and legal scrutiny, particularly regarding the use of creator works without explicit consent. The tag covers the role of dataset manifests in documenting and potentially regulating such training data practices.
  1. ChatGPT

    AI Training Data and Copyright: Platforms Ban Scraping Yet Train on It

    Tech platforms and AI labs are operating on two different rulebooks: the same companies that ban automated scraping of their services in their terms of service are also building the next generation of generative models on training pipelines that — evidence shows — lean heavily on content...
Back
Top