AI in healthcare: new doctor studies show ChatGPT and other AI rival doctors — with big caveats
People are talking a lot about AI in healthcare. AI means computer programs that can think and answer in ways that feel human. In June 2026, the company OpenAI said its new ChatGPT health update gave better answers than real doctors in its own tests. ChatGPT is a popular AI chatbot — a program you can chat with by typing questions. Around the same time, a science journal called Nature shared studies where AI did as well as doctors, or better, on hard cases. That sounds amazing. But the small print matters a lot. This article explains the studies in simple words. It also shares the warnings the researchers gave, and what this means for patients — including people in India.
One word you will see is benchmark. A benchmark is just a test that scores how well an AI does a job. Think of it like an exam for software.
What did OpenAI announce?
On June 18, 2026, OpenAI shared a health upgrade for ChatGPT. It uses a system called GPT-5.5 Instant. This system is the “AI brain” that reads your question and writes the answer. OpenAI says this brain is faster and cheaper, but still as good as its most costly “Thinking” brains on health tests. It is free for all ChatGPT users, but you can only ask so many questions each day.
OpenAI says the upgrade is safer too. It says wrong health answers dropped by 71% over two months. In its tests, the new brain beat the older one (called GPT-4o). It also beat answers written by human doctors in all five scoring areas. To check the answers, OpenAI used more than 260 doctors from 60 countries. These doctors looked at over 700,000 AI answers.
One number shows why this matters. OpenAI says more than 230 million people use ChatGPT every week for health questions. People ask it to explain lab results, get ready for doctor visits, or understand their insurance.
The honest caveat
Here is the catch. These numbers come from OpenAI’s own tests. That is not the same as a fair test done by outsiders. The tests were called HealthBench and HealthBench Professional. OpenAI built these tests itself. So the results look good, but no outside group has proven them yet.
What did the Nature studies find?
The Nature research looked at two different AI systems. Both did well. But the second one comes with a surprising warning.
MIRA — for emergency cases
MIRA stands for Medical Intelligence for Reasoning and Action. Researchers in Germany built it, at TUD Dresden and Heidelberg University. Its job is to find out what is wrong with emergency patients. To do this, it can choose from over 85,000 options across eleven tools. For example, it can order the right scan or test.
MIRA got the right answer in 88.9% of more than 500 emergency cases. In a direct test on 311 cases, MIRA scored 87.8%. Specialist doctors scored 78.1%. A mix of junior and senior doctors scored 71.1%. Best of all, MIRA never missed a patient who really needed to go into hospital.
AMIE — for managing patients over time
AMIE is Google’s system. It helps care for patients across many visits, not just one. In the study, AMIE’s first-visit plans were judged good in 95% of cases. Human doctors scored 72%. AMIE was as good as doctors on choosing treatments. It beat them on making correct plans and on following medical rules. Specialist doctors, and the actors who played the patients, often liked AMIE more than the human doctors.
Key facts at a glance
| System / claim | AI score | Doctors | Source |
|---|---|---|---|
| MIRA, 311 emergency cases | 87.8% | 78.1% specialists / 71.1% mixed | Nature study |
| MIRA, 500+ cases overall | 88.9% correct | — | Nature study |
| AMIE, first-visit plan appropriate | 95% | 72% | Nature study |
| ChatGPT instruction-following | up to 89.9% | — | OpenAI (HealthBench) |
| ChatGPT wrong-statement drop | down 71% in 2 months | — | OpenAI |
| Doctors who reviewed ChatGPT | 260+ from 60 countries | — | OpenAI |
The catch: the tech may not age well
Here is the most interesting finding. Systems like AMIE add extra software around the AI brain. This extra layer is called scaffolding. Think of it as a support frame that helps the AI act like a careful doctor.
This scaffolding helped a lot with an older AI brain (Gemini 1.5 Flash). But when researchers used a newer brain (Gemini 2.5 Flash), the help “almost vanished.” On drug-knowledge tests, newer brains like o3 and GPT-5 already did well on their own. In simple words: as the basic AI gets smarter, the special medical add-ons may stop being useful. They risk becoming “dead weight.” A clever system built today could be out of date fast.
The researchers were honest about other limits too. MIRA still gave care that “deviated from best practices” in a small number of cases — small, but not zero. And outside experts made one key point. These were all simulations. A simulation is a pretend test, not real life. So they are far from the messy, complex world of real, everyday healthcare.
Why it matters (especially for India / founders)
India has far fewer doctors per person than rich countries. Many people also live far from good hospitals. So tools that help sort patients by need, or explain lab reports, could be very useful here. They could help tired, busy doctors — not replace them.
For health-tech founders (people who start health technology companies), the aging warning is the real lesson. If your product is just a thin layer on top of an AI brain, a future brain could wipe out your edge overnight. Lasting value comes from things AI cannot easily copy. These include trusted data, support in local languages, partnerships with doctors, and safe use. Investors (people who put money into companies) are watching this area closely — see related coverage in our roundup on HealthQuad’s healthcare fund.
For patients, the simple rule still holds. AI can help you understand and prepare. But it is not your doctor. Always check anything serious with a trained professional.
FAQ
Can ChatGPT replace my doctor now?
No. The high scores come from controlled tests and OpenAI’s own benchmarks (its own exams for the AI). Real care is messier. Use AI to learn and prepare. Then see a real doctor for diagnosis and treatment.
Did AI really beat doctors in these studies?
On certain tasks, yes. MIRA and AMIE scored higher than doctors on some sets of cases. But experts warn these were simulations (pretend tests), not real patients. So the results may not hold up in the real world.
What does “the tech won’t age well” mean?
Special medical add-ons help today’s AI a lot. But as the main AI gets smarter on its own, those add-ons may stop adding value. They could even become useless extra work.
Takeaway
AI in healthcare is moving fast, and the early scores are truly impressive. But the headlines are ahead of the proof. OpenAI’s numbers come from its own tests. The Nature wins came in simulations. And today’s smartest systems may not stay smart for long. The promise is real. The hype needs a check-up.
Sources: The Decoder — ChatGPT’s new health upgrade and The Decoder — AI systems rival doctors in new Nature studies.