This comparative evaluation of five leading conversational AIs in infectious disease education — ChatGPT 3.5, Google Bard (Gemini), Perplexity AI, Microsoft Copilot, and Meta AI — presents both an encouraging and cautionary picture for educators and clinicians. A recent peer-reviewed analysis using 160 multiple-choice questions drawn from 20 clinical case studies reports that ChatGPT 3.5 led numerically with 65.6% accuracy, closely followed by Perplexity (63.2%), Microsoft Copilot (60.9%), Meta AI (60.8%), and Google Bard (58.8%). The study also found clear strengths in symptom identification but consistent weaknesses in therapy and antimicrobial dosing recommendations, alongside measurable instability when identical prompts were re-submitted 24 hours later.
For educators and clinicians, the pragmatic path forward is clear: harness AI for low-risk educational tasks, pair AI outputs with mandatory human verification for clinical recommendations, and demand provenance, reproducibility, and guideline-anchored retrieval from vendors before adopting models as decision-support tools. Continued independent benchmarking, prospective clinical validation, and vendor transparency are essential to convert the promise of LLMs into safe, dependable medical teaching and clinical support systems.
Source: Frontiers Frontiers | Evaluating AI Performance in Infectious Disease Education: A Comparative Analysis of ChatGPT, Google Bard, Perplexity AI, Microsoft Copilot, and Meta AI
Background
Why this matters now
Large language models (LLMs) and retrieval-augmented systems are rapidly entering classrooms, clerkships, and clinical decision-support workflows. Their ability to generate exam-style answers, create study notes, and suggest differential diagnoses makes them compelling tools for medical education and training. However, the same probabilistic mechanisms that make LLMs versatile also expose them to factual errors, hallucinations, and inconsistent therapeutic reasoning — risks that are particularly consequential when recommendations touch on antimicrobial selection and dosing. The Frontiers study provides a focused lens on how current consumer-facing AIs perform on infectious disease case-based MCQs, quantifying accuracy, diagnostic strength, and stability.The prior landscape of AI in medical education
Multiple independent investigations have measured LLM performance on medical MCQs and clinical prompts. Results vary by model, version, question format (MCQ vs free response), and domain specificity. For example, systematic reviews put ChatGPT’s integrated medical-answer accuracy near the mid-50% range across heterogeneous studies, while targeted comparisons show substantial variance between models and by question taxonomy (recall vs problem-solving). Real-world clinical decision tasks — particularly therapeutic dosing and individualized drug choices — are consistently more challenging for LLMs than recall-style questions. These bench findings contextualize the new comparative results in infectious disease.Overview of the Frontiers study: scope, methods, and core results
Study design (brief)
- Source material: 20 clinical case studies from "Infectious Diseases: A Case Study Approach" by Jonathan C. Cho.
- Questions: 7–10 MCQs per case → total of 160 questions spanning symptom recognition, microorganism identification, diagnostics, prevention, and therapy (including antimicrobial selection).
- Prompting: Standardized prompts — case text + MCQs — submitted to each AI platform with no additional context or system training data.
- Evaluation: Responses checked against the textbook answer key. Accuracy measured as percent correct; consistency measured by repeating the identical prompts after 24 hours and comparing results.
Key numerical findings
- Overall accuracy (top to bottom): ChatGPT 3.5 — 65.6%, Perplexity AI — 63.2%, Microsoft Copilot — 60.9%, Meta AI — 60.8%, Google Bard — 58.8%.
- Best-performing content domain: Symptom identification — 76.5% average accuracy.
- Weakest domain: Therapy-related questions — 57.1% average accuracy (antimicrobial selection and dosing particularly problematic).
- Diagnostic vs therapeutic split for ChatGPT 3.5: diagnostic accuracy 79.1% vs antimicrobial recommendations 56.6%.
- Stability: Microsoft Copilot demonstrated the most stable responses on repeated testing; ChatGPT showed a 7.5% drop in accuracy on retest; Perplexity and Meta AI displayed notable variability in individualized treatment suggestions.
What the numbers mean: strengths and shortfalls
Strengths — where AIs add educational value
- High performance on recognition and recall tasks. Models consistently handled symptom identification and straightforward microorganism recognition well. This aligns with prior work showing LLMs excel at recall and lower-level Bloom taxonomy tasks. Using AIs as rapid reviewers or quiz generators for fact-based study is a practical early win.
- Rapid, scalable tutoring potential. For learners who need iterative clarifications, quick explanations, or exam-style practice, these models can supply helpful scaffolding in seconds — especially when supervised by an instructor who validates content.
- Cross-model parity on many non-therapeutic tasks. Differences in overall accuracy among leading models were modest (range ~58.8%–65.6%), implying that multiple vendors can serve similar educational use-cases with appropriate guardrails.
Weaknesses — where caution is essential
- Therapeutic decision-making is brittle. Antimicrobial selection, dosing adjustments, and patient-specific pharmacotherapy require structured clinical context (weight, renal function, drug interactions, local antibiograms) that these consumer-facing systems often lack or fail to reason about reliably. The Frontiers study documents a striking gap: strong diagnostic answers but frequent errors when asked to recommend or dose antimicrobials.
- Instability and time-varying outputs. Models that perform web retrieval or are updated frequently can change answers for the same prompt over a short interval. Frontiers’ repeated-testing protocol revealed this instability, with some systems giving different antimicrobial recommendations 24 hours later — a red flag for reproducibility and trust in longitudinal learning or clinical use. Broader audits of chatbots show similar drift when systems become more web-grounded.
- Multiple-choice format can overstate capability. Recent research demonstrates that MCQs sometimes inflate LLM performance compared with open-ended free-response assessments; models may exploit answer-format cues rather than deep reasoning. The Frontiers MCQ-based design is valid for educational testing but should be interpreted alongside free-text performance metrics.
Stability, retrieval, and the web-grounding problem
Why answers change over time
Several vendors have moved from "static" models (knowledge cutoff + refusal when unsure) to web-augmented assistants that perform live retrieval. This reduces refusal rates but increases exposure to low-quality or deliberately manipulated web content, creating an attack surface that results in confident yet inaccurate outputs. External audits (e.g., large-scale misinformation monitors) have documented that increased web access correlates with higher false-claim repetition in news contexts — the same mechanism can perturb clinical recommendations when retrieval heuristics surface poor-quality or out-of-context treatment guidance.Practical implications for educators and clinicians
- Do not assume consistency. An AI-generated treatment plan produced yesterday may not be reproducible tomorrow. For learning artifacts that students or educators will rely on, lock the prompt-and-response pair as a static teaching resource and annotate it with human verification.
- Prefer retrieval-constrained or citation-enabled modes. Where available, use AI configurations that either (a) restrict answers to a validated knowledge base (institutional guidelines, local antibiograms), or (b) provide transparent source citations so a clinician can verify recommendations quickly. Retrieval without provenance increases risk.
Cross-referencing with other studies: agreement and divergence
- Independent comparative work in specific specialties (e.g., autoimmune liver disease) shows similar model ordering and typifies the same strengths/weaknesses: good for general clinical reasoning and recall, weaker on specialized dosing and nuanced therapy decisions. Specialist panels rated models differently, but the overall trend held.
- Meta-analyses and systematic reviews of ChatGPT in medicine report wide heterogeneity but place overall medical-query accuracy in the mid-50% to mid-70% range depending on task and version, matching the Frontiers results for MCQ-style evaluation while underscoring caution about over-generalization across tasks.
- Benchmarks that compare MCQ vs free-response performance show that MCQs can inflate apparent competence. Free-response formats typically reveal larger performance gaps and less robustness, reinforcing the Frontiers study’s warning that correct MCQ answers do not equal safe clinical recommendations.
Risks, ethics, and patient safety concerns
- Therapeutic errors carry high stakes. Incorrect antimicrobial selection or dosing can lead to treatment failure, toxicity, and broader public health harms through antimicrobial resistance. Any clinical application of these tools must involve clinician validation and safeguards.
- Overreliance and deskilling. Unchecked use of AI could encourage learners to accept model answers uncritically, undermining development of clinical judgment. Educational programs must design assessments and curricula that preserve critical reasoning.
- Provenance and liability. When an AI cites a recommendation, institutions must determine whether the provenance is trustworthy and who is responsible if AI-suggested therapy contributes to harm. Legal and regulatory frameworks lag model deployment.
- Information laundering via the web. As red-team audits show, disinformation networks can manipulate the retrieval landscape; medical content is not immune. Systems that ground answers in unvetted web content risk importing flawed or maliciously crafted guidance.
Practical recommendations — for educators, clinical educators, and product teams
For medical educators and institutions
- Use AIs as adjunct study tools, not as authoritative treatment sources. Frame them as interactive study aids for recall, reasoning practice, and formative feedback.
- Integrate verification checkpoints into AI-enabled learning tasks:
- Require students to supply the clinical reasoning path and explicit sources when they accept an AI answer.
- Use instructor-curated question banks or local guideline bundles to constrain AI retrieval.
- Maintain an AI-literacy module that trains students to identify hallucinations, assess provenance, and cross-check recommendations.
- Favor assessment designs that evaluate free-text reasoning and clinical decision-making — not just MCQ recall — to avoid inflating student competence via AI assistance.
For clinicians and health systems
- Treat consumer-facing AI therapeutic recommendations as hypotheses requiring clinician validation.
- Deploy enterprise-grade, retrieval-constrained, or guideline-locked LLMs for clinical decision support, with audit trails and escalation workflows for high-risk decisions.
- Monitor model outputs over time and periodically revalidate any AI-based clinical protocols because model updates or retrieval changes can alter behavior unpredictably.
For AI vendors and researchers
- Prioritize explainability, provenance, and stability:
- Provide source-level citations and confidence estimates for therapeutic content.
- Implement versioning and changelogs so clinical customers can audit behavior over time.
- Consider ensemble verification layers (verifier models or rules-based checks) specifically targeted at pharmacotherapy and dosing tasks to reduce unsafe outputs.
- Partner with clinical experts to curate and test domain-specific retrieval corpora (guidelines, formularies, antibiograms) to improve safety for treatment recommendations.
- Fund longitudinal, prospective clinical validation trials before positioning models as decision support tools in patient care.
Methodological strengths and limitations of the Frontiers study
Strengths
- Domain-focused and case-based: Using real published case studies ensures clinical relevance beyond synthetic or isolated question banks.
- Controlled prompting: Standardized prompts across models reduce confounders attributable to prompt engineering variability.
- Repeatability test: The 24-hour retest provides data on temporal stability — an actionable, rarely-assessed property.
Limitations and caveats
- MCQ format bias: As broader research indicates, multiple-choice designs may overstate LLM reasoning ability compared with free-response tasks. The study's MCQ focus should therefore be interpreted as one dimension of competence, not definitive proof of safe clinical reasoning.
- Model versions and real-time updates: The snapshot reflects specific model versions and configurations at test time. Web-grounded models or vendor updates can change performance rapidly; results may not generalize across time without re-evaluation.
- Lack of patient-specific data: Therapeutic questions often require granular patient variables (renal function, weight, allergies, concurrent meds). The standardized prompts may not capture the real-world information density clinicians use when prescribing, which exacerbates the therapeutic performance gap.
Larger trends and where research must go next
- Prospective, clinician-supervised trials that assess AI suggestions in simulated or real clinical workflows — particularly in antimicrobial stewardship — are needed to move from bench performance to safe deployment.
- Comparative evaluations should include free-response and context-rich prompts to better approximate clinical reasoning demands.
- Research must explore ensemble verification approaches (separate verifier models, curated retrieval, and rules engines) and measure their impact on therapeutic accuracy and stability.
- Longitudinal audits are crucial; vendors and institutions must track model drift and retrieval poisoning over time to maintain trust in deployed AI aids.
Conclusion
The Frontiers comparative analysis offers a careful, domain-specific snapshot of where consumer-facing AIs stand in infectious disease education: useful for recognition and formative learning, limited and unstable for therapeutic decision-making. ChatGPT 3.5 scored highest on the MCQ set, but no model demonstrated reliable, reproducible competence in antimicrobial selection and dosing — the very areas where errors could translate into patient harm. Stability concerns introduced by live retrieval and vendor updates compound the risk.For educators and clinicians, the pragmatic path forward is clear: harness AI for low-risk educational tasks, pair AI outputs with mandatory human verification for clinical recommendations, and demand provenance, reproducibility, and guideline-anchored retrieval from vendors before adopting models as decision-support tools. Continued independent benchmarking, prospective clinical validation, and vendor transparency are essential to convert the promise of LLMs into safe, dependable medical teaching and clinical support systems.
Source: Frontiers Frontiers | Evaluating AI Performance in Infectious Disease Education: A Comparative Analysis of ChatGPT, Google Bard, Perplexity AI, Microsoft Copilot, and Meta AI