Can Crowdsourcing Survive the LLM Era? A Community Survey on Human Data Collection
Title: The Viability of Crowdsourcing in the Age of Large Language Models: Insights from a Community Survey on Human Data Acquisition
Abstract:
The proliferation of Large Language Models (LLMs) as writing aids raises significant concerns regarding the integrity of crowdsourced datasets, particularly as human participants may increasingly delegate tasks to these automated systems. To explore how the field is navigating this shift, we conducted a survey of 155 researchers specializing in Natural Language Processing (NLP) and adjacent fields. The study examines their practical experiences and perspectives on gathering free-text responses through crowdsourcing platforms. Our findings outline the primary obstacles faced by practitioners, the mitigation tactics currently in use, and the anticipated impact on data integrity.
Notably, 44% of participants indicated they have encountered evidence of LLM-generated content within their crowdsourced data. Although 93% of respondents had predicted this trend, 50% expressed uncertainty regarding the specific safeguards necessary to counter it. Current detection methods primarily rely on identifying anomalous writing styles and exceptionally rapid submission times. The survey results suggest that while the research community recognizes the issue and is actively implementing countermeasures, these existing strategies are not yet robust enough to fully resolve the challenge. Consequently, we propose a framework of considerations intended to steer best practices for collecting crowdsourced free-text data as LLM technology continues to evolve.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






