People already use AI chatbots like search engines for everyday health information. This habit appears to be riskier after a new study found that half of the responses from five major bots were problematic, even if the responses sounded sophisticated and confident.
Researchers tested ChatGPT, Gemini, Grok, Meta AI and DeepSeek with 250 prompts on the topics of cancer, vaccines, stem cells, nutrition and sports performance.
The prompts reflected common health questions and known misinformation topics, then assessed whether the bots stuck to scientific evidence or veered into misleading and potentially unsafe advice.
Broad questions uncovered the biggest gaps
The weakest results were for open-ended prompts. These broader questions resulted in far more problematic answers than expected, while closed-ended prompts tended to result in safer answers.
This is important because real people don’t typically ask medical questions in a neat multiple-choice format. They ask whether a treatment works, whether a vaccine is safe, or what might improve athletic performance.
In the study, this type of prompt pushed the bots to provide answers that mixed solid evidence with weaker or misleading claims.
Strong trust, shaky procurement
The flaws didn’t stop with the answers themselves. Reference quality was poor with an average completeness score of 40% and none of the chatbots produced a fully accurate reference list.
This weakens one of the main reasons people trust chatbot responses. An answer can seem source-based and authoritative, but then collapse as soon as the citations are checked.
The researchers also called attention to made-up references, while the bots still answered with certainty and made almost no reservations.
Why this is important beyond a test
There are limits to the findings. The study only included five chatbots, these products change quickly and the prompts were designed to emphasize the models, which may overstate how often bad answers crop up in everyday use.
Nevertheless, the most important finding is difficult to dismiss. These systems were tested against evidence-based medical topics, and half of the answers were still incorrect or incomplete.
Although chatbots can currently help summarize information or formulate follow-up questions, they do not appear to be reliable enough to make meaningful medical decisions.




