Science

Science journalists find ChatGPT is bad at summarizing scientific papers

Science journalists find ChatGPT is bad at summarizing scientific papers

Summarizing complex scientific findings for a non-expert audience is one of the most important things a science journalist does from day to day. Generating summaries of complex writing has also been frequently mentioned as one of the best use cases for large language models (despite some prominent counterexamples).
With all that in mind, the team at the American Association for the Advancement of Science (AAAS) ran an informal year-long study to determine whether ChatGPT could produce the kind of “news brief” paper summaries that its “SciPak” team routinely writes for the journal Science and services like EurekAlert. These SciPak articles are designed to follow a specific and simplified format that conveys crucial information, such as the study’s premise, methods, and context, to other journalists who might want to write about it.
Now, in a new blog post and white paper discussing their findings, the AAAS journalists have concluded that ChatGPT can “passably emulate the structure of a SciPak-style brief,” but with prose that “tended to sacrifice accuracy for simplicity” and which “required rigorous fact-checking by SciPak writers.”
“These technologies may have potential as helpful tools for science writers, but they are not ready for ‘prime time,’ at this point for the SciPak team,” AAAS writer Abigail Eisenstadt said.
Where’s the human touch?
From December 2023 to December 2024, AAAS researchers selected up to two papers per week for ChatGPT to summarize using three different prompts of varying specificity. The team focused on papers with difficult elements like technical jargon, controversial insights, groundbreaking discoveries, human subjects, or non-traditional formats. The tests used the “Plus” version of the latest publicly available GPT models available through the study period, which generally spanned the eras of GPT-4 and GPT-4o.
In total, 64 papers were summarized, and those summaries were evaluated both quantitatively and qualitatively by the same SciPak writers who had briefed those papers for the AAAS. The researchers note that this design “could not account for human biases,” which we’d argue might be significant among journalists evaluating a tool that was threatening to take over one of their core job functions.
Still, the quantitative survey results among those journalists were pretty one-sided. On the question of whether the ChatGPT summaries “could feasibly blend into the rest of your summary lineups, the average summary rated a score of just 2.26 on a scale of 1 (“no, not at all”) to 5 (“absolutely”). On the question of whether the summaries were “compelling,” the LLM summaries averaged just 2.14 on the same scale. Across both questions, only a single summary earned a “5” from the human evaluator on either question, compared to 30 ratings of “1.”
Not up to standards
Writers were also asked to write out more qualitative assessments of the individual summaries they evaluated. In these, the writers complained that ChatGPT often conflated correlation and causation, failed to provide context (e.g., that soft actuators tend to be very slow), and tended to overhype results by overusing words like “groundbreaking” and “novel” (though this last behavior went away when the prompts specifically addressed it).
Overall, the researchers found that ChatGPT was usually good at “transcribing” what was written in a scientific paper, especially if that paper didn’t have much nuance to it. But the LLM was weak at “translating” those findings by diving into methodologies, limitations, or big picture implications. Those weaknesses were especially true for papers that offered multiple differing results, or when the LLM was asked to summarize two related papers into one brief.
While the tone and style of ChatGPT summaries were often a good match for human-authored content, “concerns about the factual accuracy in LLM-authored content” were prevalent, the journalists wrote. Even using ChatGPT summaries as a “starting point” for human editing “would require just as much, if not more, effort as drafting summaries themselves from scratch” due to the need for “extensive fact-checking,” they added.
These results might not be too surprising given previous studies that have shown AI search engines citing incorrect news sources a full 60 percent of the time. Still, the specific weaknesses are all the more glaring when discussing scientific papers, where accuracy and clarity of communication are paramount.
In the end, the AAAS journalists concluded that ChatGPT “does not meet the style and standards for briefs in the SciPak press package.” But the white paper did allow that it might be worth running the experiment again if ChatGPT “experiences a major update.” For what it’s worth, GPT-5 was introduced to the public in August.