It’s official: the robots know a lot about baby bumps, but they’re not quite ready to deliver the news on their own. In a brisk faceoff pitting ChatGPT-3.5, ChatGPT-4.0, and Microsoft Copilot against the mysteries of obstetric ultrasound, AI proved it’s no slouch—if you want accuracy, consistency, and perhaps a sprinkle of grammatical pizzazz. Still, if you’re expecting your chatbot to moonlight as an OB-GYN, you might need a real, living, breathing doctor on speed dial.
The study in question tossed twenty tricky ultrasound queries and a stack of 110 real-world reports into the digital laps of the world’s favorite LLMs. Results? A clear win on paper for ChatGPT-3.5 and 4.0 in both accuracy and consistency across those questions, with Copilot trailing—but not stumbling face-first.
The lead shrank in the cold, statistical light: no model was significantly better than the others, at least when sample sizes were small (cue the tiny violin for overlooked P-values). It’s either a testament to the skill of all three, or a subtle hint that we need a bigger test batch—preferably with more plot twists.
Yet, when asked to parse and interpret actual ultrasound reports, ChatGPT flexed its language-model muscles. Both 3.5 and 4.0 soared above Copilot in accuracy (over 84% for ChatGPTs, a modest 77% for Copilot), and their consistency almost made you believe they’d been chatting with sonographers over coffee. But—and there’s always a but—even the best digital brains botched it when it came to red-flagging fetal growth abnormalities. Accuracy plummeted, suggesting AIs are more comfortable with text than with nuance (or, you know, tiny humans and centimeters).
But the same chatbots occasionally stepped on their own shoelaces. Copilot, for instance, called out a “placenta previa” where there was none, threatening to send entire families into premature panic mode. ChatGPT-4.0, usually the Hermione Granger of the group, tripped over placental maturity classifications. And nowhere did anyone stop and ask for clarification, as a real clinician (or an internet-savvy parent) would.
So, who would you want reading your ultrasound? It turns out, it depends whether you value economy of words, thoroughness, or just a really well-formatted answer.
The bottom line: for patient education and streamlined communication, these AIs are promising assistants, but as always in medicine, let the human have the last word. And if your chatbot says the placenta is in the wrong place, maybe get a second opinion—preferably from someone with, well, a medical degree.
Source: Nature Performance of ChatGPT and Microsoft Copilot in Bing in answering obstetric ultrasound questions and analyzing obstetric ultrasound reports - Scientific Reports
AI vs. Obstetric Ultrasound: The Shootout
The study in question tossed twenty tricky ultrasound queries and a stack of 110 real-world reports into the digital laps of the world’s favorite LLMs. Results? A clear win on paper for ChatGPT-3.5 and 4.0 in both accuracy and consistency across those questions, with Copilot trailing—but not stumbling face-first.The lead shrank in the cold, statistical light: no model was significantly better than the others, at least when sample sizes were small (cue the tiny violin for overlooked P-values). It’s either a testament to the skill of all three, or a subtle hint that we need a bigger test batch—preferably with more plot twists.
Yet, when asked to parse and interpret actual ultrasound reports, ChatGPT flexed its language-model muscles. Both 3.5 and 4.0 soared above Copilot in accuracy (over 84% for ChatGPTs, a modest 77% for Copilot), and their consistency almost made you believe they’d been chatting with sonographers over coffee. But—and there’s always a but—even the best digital brains botched it when it came to red-flagging fetal growth abnormalities. Accuracy plummeted, suggesting AIs are more comfortable with text than with nuance (or, you know, tiny humans and centimeters).
The Double-Edged Scalpel: Coherence, Care, and Catastrophe
AI’s bedside manner? Surprisingly warm. The chatbots cranked out explanations that a layperson—or a panicked parent—might actually understand, delivering recommendations with a gentle nudge toward “consult your doctor.” It’s empathy by code, and some studies show patients actually prefer it. Maybe the real surprise is that bedside manner is now up for competition.But the same chatbots occasionally stepped on their own shoelaces. Copilot, for instance, called out a “placenta previa” where there was none, threatening to send entire families into premature panic mode. ChatGPT-4.0, usually the Hermione Granger of the group, tripped over placental maturity classifications. And nowhere did anyone stop and ask for clarification, as a real clinician (or an internet-savvy parent) would.
Training Woes and Hallucination Blues
The study pulls no punches: blame it on the training data. LLMs feast on a never-ending buffet of internet content—think Wikipedia marathons and speculative “Dr. Google” forum posts. This means they aren’t always up-to-date with the fine print of clinical guidelines, let alone sensitive to how standards vary across countries or patient groups. The result is errors that can range from small slip-ups to what can only be described as digital daydreaming (hallucinations, if you want the clinical term).Structure, Style, and the Quest for the Perfect AI Doctor
Digging through the generated reports was like finding three doctors with very distinct personalities: ChatGPT-3.5 cuts to the chase, offering brief, clear answers and a quick word of advice for next steps. ChatGPT-4.0 is the detail-oriented overachiever, analyzing every detail and providing neat executive summaries. Copilot, meanwhile, is the stickler for structure, walking through each report line by line before offering its final verdict.So, who would you want reading your ultrasound? It turns out, it depends whether you value economy of words, thoroughness, or just a really well-formatted answer.
The Undercurrents of Risk: Security, Scope, and Sample Size
For all their promise, these language models aren’t exactly audit-proof medical experts. Cybersecurity concerns lurk in the shadows—a little prompt here, a stray “adversarial attack” there, and suddenly your diagnosis could take a sharp detour. The study also notes that results might vary with a bigger sample size or different types of medical data, and that responses were judged by ultrasound docs at different stages of their careers. In other words, your mileage may vary.A Future Worth Watching — But Not Without a Human
The big takeaway? AI chatbots are already capable of lending a hand—especially if you need complicated OB concepts distilled for patient-friendly dinner conversation. But no one should be outsourcing clinical decisions to them anytime soon. They still lack the ability to incorporate nuanced histories, understand shifting clinical standards, or double-check misunderstood questions.The bottom line: for patient education and streamlined communication, these AIs are promising assistants, but as always in medicine, let the human have the last word. And if your chatbot says the placenta is in the wrong place, maybe get a second opinion—preferably from someone with, well, a medical degree.
Source: Nature Performance of ChatGPT and Microsoft Copilot in Bing in answering obstetric ultrasound questions and analyzing obstetric ultrasound reports - Scientific Reports
Last edited: