A new study from researchers at Massachusetts Institute of Technology's (MIT) Data Provenance Initiative has highlighted a significant decrease in available content for training artificial intelligence (AI) models. The study, which examined 14,000 web domains included in commonly used AI training data sets, reveals an "emerging crisis in consent" as publishers and online platforms take measures to block data harvesting. The trend has resulted in restrictions in three datasets (C4, RefinedWeb and Dolma) affecting 5% of all data and 25% of data from high-quality sources using the Robots Exclusion Protocol, a file called robots.txt. The study also found websites' terms of service had restricted 45% of the data in the C4 dataset. The study's lead author, Shayne Longpre, warns that this decline in data consent will impact AI companies, researchers, academics, and noncommercial entities. Training data availability is integral to today's generative AI systems, and the limitations of using web data may pose challenges for various stakeholders in the field.
£ - This article requires a subscription.
What is this page?
You are reading a summary article on the Privacy Newsfeed, a free resource for DPOs and other professionals with privacy or data protection responsibilities helping them stay informed of industry news all in one place. The information here is a brief snippet relating to a single piece of original content or several articles about a common topic or thread. The main contributor is listed in the top left-hand corner, just beneath the article title.
The Privacy Newsfeed monitors over 300 global publications, of which more than 5,750 summary articles have been posted to the online archive dating back to the beginning of 2020. A weekly roundup is available by email every Friday.