Artificial intelligence tools used by doctors risk leading to worse health outcomes for women and ethnic minorities, as a growing body of research shows that many large language models downplay the symptoms of these patients.
A series of recent studies have found that the uptake of AI models across the healthcare sector could lead to biased medical decisions, reinforcing patterns of under treatment that already exist across different groups in western societies.
The findings by researchers at leading US and UK universities suggest that medical AI tools powered by LLMs have a tendency to not reflect the severity of symptoms among female patients, while also displaying less “empathy” towards Black and Asian ones.
The warnings come as the world’s top AI groups such as Microsoft, Amazon, OpenAI and Google rush to develop products that aim to reduce physicians’ workloads and speed up treatment, all in an effort to help overstretched health systems around the world.
Many hospitals and doctors globally are using LLMs such as Gemini and ChatGPT as well as AI medical note-taking apps from start-ups including Nabla and Heidi to auto-generate transcripts of patient visits, highlight medically relevant details and create clinical summaries.
In June, Microsoft revealed it had built an AI-powered medical tool it claims is four times more successful than human doctors at diagnosing complex ailments.
But research by the MIT’s Jameel Clinic in June found that AI models, such as OpenAI’s GPT-4, Meta’s Llama 3 and Palmyra-Med—a healthcare focused LLM—recommended a much lower level of care for female patients, and suggested some patients self-treat at home instead of seeking help.
A separate study by the MIT team showed that OpenAI’s GPT-4 and other models also displayed answers that had less compassion towards Black and Asian people seeking support for mental health problems.]
That suggests “some patients could receive much less supportive guidance based purely on their perceived race by the model,” said Marzyeh Ghassemi, associate professor at MIT’s Jameel Clinic.
Similarly, research by the London School of Economics found that Google’s Gemma model, which is used by more than half the local authorities in the UK to support social workers, downplayed women’s physical and mental issues in comparison with men’s when used to generate and summarize case notes.
Ghassemi’s MIT team found that patients whose messages contained typos, informal language or uncertain phrasing were between 7-9 percent more likely to be advised against seeking medical care by AI models used in a medical setting, against those with perfectly formatted communications, even when the clinical content was the same.
This could result in people who don’t speak English as a first language or are not comfortable in using technology being unfairly treated.
The problem of harmful biases stems partly from the data used to train LLMs. General-purpose models, such as GPT-4, Llama and Gemini, are trained using data from the internet, and the biases from those sources are therefore reflected in the responses. AI developers can also influence how this creeps into systems by adding safeguards after the model has been trained.
“If you’re in any situation where there’s a chance that a Reddit subforum is advising your health decisions, I don’t think that that’s a safe place to be,” said Travis Zack, adjunct professor of University of California, San Francisco, and the chief medical officer of AI medical information start-up Open Evidence.
In a study last year, Zack and his team found that GPT-4 did not take into account the demographic diversity of medical conditions, and tended to stereotype certain races, ethnicities and genders.
Researchers warned that AI tools can reinforce patterns of under-treatment that already exist in the healthcare sector, as data in health research is often heavily skewed towards men, and women’s health issues, for example, face chronic underfunding and research.
OpenAI said many studies evaluated an older model of GPT-4, and the company had improved accuracy since its launch. It had teams working on reducing harmful or misleading outputs, with a particular focus on health. The company said it also worked with external clinicians and researchers to evaluate its models, stress test their behavior and identify risks.
The group has also developed a benchmark together with physicians to assess LLM capabilities in health, which takes into account user queries of varying styles, levels of relevance and detail.
Google said it took model bias “extremely seriously” and was developing privacy techniques that can sanitise sensitive datasets and develop safeguards against bias and discrimination.
Researchers have suggested that one way to reduce medical bias in AI is to identify what data sets should not be used for training in the first place, and then train on diverse and more representative health data sets.
Zack said Open Evidence, which is used by 400,000 doctors in the US to summarize patient histories and retrieve information, trained its models on medical journals, the US Food and Drug Administration’s labels, health guidelines and expert reviews. Every AI output is also backed up with a citation to a source.
Earlier this year, researchers at University College London and King’s College London partnered with the UK’s NHS to build a generative AI model, called Foresight.
The model was trained on anonymized patient data from 57 million people on medical events such as hospital admissions and Covid-19 vaccinations. Foresight was designed to predict probable health outcomes, such as hospitalization or heart attacks.
“Working with national-scale data allows us to represent the full kind of kaleidoscopic state of England in terms of demographics and diseases,” said Chris Tomlinson, honorary senior research fellow at UCL, who is the lead researcher of the Foresight team. Although not perfect, Tomlinson said it offered a better start than more general datasets.
European scientists have also trained an AI model called Delphi-2M that predicts susceptibility to diseases decades into the future, based on anonymzsed medical records from 400,000 participants in UK Biobank.
But with real patient data of this scale, privacy often becomes an issue. The NHS Foresight project was paused in June to allow the UK’s Information Commissioner’s Office to consider a data protection complaint, filed by the British Medical Association and Royal College of General Practitioners, over its use of sensitive health data in the model’s training.
In addition, experts have warned that AI systems often “hallucinate”—or make up answers—which could be particularly harmful in a medical context.
But MIT’s Ghassemi said AI was bringing huge benefits to healthcare. “My hope is that we will start to refocus models in health on addressing crucial health gaps, not adding an extra percent to task performance that the doctors are honestly pretty good at anyway.”
© 2025 The Financial Times Ltd. All rights reserved Not to be redistributed, copied, or modified in any way.