Low-Quality Papers Based on Public Health Data Are Flooding The Scientific Literature
The appearance of thousands of formulaic biomedical studies has been linked to the rise of text-generating AI tools. Data from five large open-access health databases are being used to generate thousands of poor-quality, formulaic papers, an analysis has found. Its authors say that the surge in publications could indicate the exploitation of these databases by people using large language models (LLMs) to mass-produce scholarly articles, or even by paper mills — companies that churn out papers to order.
The findings, posted as a preprint on medRxiv on 9 July, follow an earlier study that highlighted an explosion of such papers that used data from the US National Health and Nutrition Examination Survey (NHANES). The latest analysis flags a rising number of studies featuring data from other large health databases, including the UK Biobank and the US Food and Drug Administration’s Adverse Event Reporting System (FAERS), which documents the side effects of drugs. Between 2021 and 2024, the number of papers using data from these databases rose from around 4,000 to 11,500 — around 5,000 more papers than expected on the basis of previous publication trends.
The study’s authors warn that a large number of these papers — many of which have repetitive, template-like titles — are likely to be of low quality and could flood the scientific literature. Their analysis is intended as “an early warning system … so that peer reviewers, editors and researchers can understand where the vulnerabilities in the system lie”, says co-author Matt Spick, a biomedical scientist at the University of Surrey in Guildford, UK.
Unexpected Growth
Spick and his colleagues analysed changes in publication counts, title wording and author affiliations for papers that were based on data from 34 open-access health databases. The team used an algorithm to predict the growth in the numbers of papers expected for each data set from 2014 to 2024 — a period during which text-generating LLM tools such as ChatGPT and Gemini became mainstream.
When they compared their predictions with actual publication rates, the researchers identified six data sets that had significantly exceeded the growth rates predicted by the algorithm. All but one also showed a rise in the number of papers with ‘template-like’ titles. These data sets were NHANES, UK Biobank, FAERS, the Global Burden of Disease (GBD) study and the Finnish genetic database FinnGen. By 2024, the number of papers using FinnGen data grew by nearly 15 times from 2021, for example, while those using FAERS increased by nearly 4 times and UK Biobank by 2.4 times over the same period.
The researchers also uncovered some dubious papers, which often linked complex health conditions to a single variable. One paper used Mendelian randomization — a technique that helps to determine whether a particular health risk factor causes a disease — to study whether drinking semi-skimmed milk could protect against depression, whereas another looked into how education levels affect someone’s chances of developing a hernia after surgery.
“A lot of those findings might be unsafe, and yet they’re also accessible to the public, and that really worries me,” says Spick.
“This whole thing undermines the trust in open science, which used to be a really non-controversial thing,” adds Csaba Szabó, a pharmacologist at the University of Fribourg in Switzerland.
Broad Perspective
Igor Rudan, a global-health researcher at the University of Edinburgh, UK, and a co-editor-in-chief of the Journal of Global Health, praises the study for having “systematically addressed this problem in the entirety of the scientific literature”. “We need to understand this issue better. From the perspective of a single journal, you cannot do that,” he adds.
Rudan says that, in 2022, Journal of Global Health editors noticed an unusual rise in submissions for papers that used open-access data sets, including the UK Biobank, GBD and NHANES. In 2023 and 2024, these manuscripts constituted 10% and 15% respectively of all submissions to the journal. That has now risen to nearly 20%, and the journal is receiving manuscripts on these databases almost daily, he adds.
In response, the journal introduced guidelines earlier this month for researchers submitting research on open-access data sets. These require authors to declare how many papers they published in the past three years that analysed such data sets, disclose the use of artificial intelligence in preparing manuscripts and explain how they rule out false positives in their results.
Spick hopes that other journals and publishers will adopt similar checks when handling “research on a data set that might be being exploited”. He hopes that the methods in the preprint offer a starting point that other researchers can build on to monitor the use of open-access health data more closely.
Source: https://www.nature.com/articles/d41586-025-02241-2