AI in NAM Education: NAM accuracy and model limits for caregivers

  • Thread Author
The recent BMC Oral Health study that asked ChatGPT-4, Google Gemini and Microsoft Copilot about nasoalveolar molding (NAM) offers the cleft-care community a clear, timely mirror of where artificial intelligence stands today: useful for background education, uneven on detail, and liable to change with the next model update.

Background​

Nasoalveolar molding (NAM) is a presurgical infant‑orthodontic technique used to reshape nostrils, approximate alveolar segments, and improve nasal symmetry before primary cleft lip and palate repair. It’s usually started in the first weeks of life and combines an intraoral molding plate with nasal stents and periodic adjustments. The clinical literature — including multiple systematic reviews and meta‑analyses — finds NAM can improve short‑ and medium‑term nasal symmetry and certain nasolabial aesthetic parameters, although study quality and long‑term evidence vary.
The BMC Oral Health paper set out to measure three things simultaneously: the accuracy of AI‑generated NAM information, the reliability (consistency across models), and the comprehensibility of the answers parents and caregivers would actually read. It compared three modern conversational AIs — ChatGPT‑4, Google Gemini, and Microsoft Copilot — across a domain‑structured question set covering soft tissues, device mechanics, clinical indications, and patient management.

What the study found — the essentials​

  • Across the three AIs, the study reported no statistically significant overall difference in correctness of the NAM answers; all three provided moderate accuracy overall, with “objectively true” responses in the 57–70% range.
  • When broken down by subtopic, ChatGPT‑4 outperformed the others specifically in the “soft tissues” domain, with more “Objectively True” responses than Gemini or Copilot in that category. However, ChatGPT’s answers were also longer and more complex, which may lower accessibility for non‑clinical readers.
  • Gemini tended to respond in a mixture of objectively correct statements and selective facts, while Copilot produced shorter, simpler replies but was comparatively lower on objective correctness for NAM‑specific soft‑tissue details in this sample.
  • The authors emphasise that AI outputs used for NAM‑related education should be checked by qualified professionals and that AI’s knowledge base is rapidly evolving — a response given at one point in time may change as models learn from newly indexed web content.
These findings echo broader cross‑model studies in medicine where model performance varies by task, topic and version: ChatGPT family models often score better on many clinical tasks, but results are not universal and depend heavily on dataset, question format and evaluation metrics.

Why this matters for parents, clinicians and IT teams​

NAM is a specialized, time‑sensitive intervention. Parents commonly turn to web search and chatbots for quick explanations about:
  • What NAM is and how it works
  • Expected benefits (nasal symmetry, reduced cleft width)
  • Practical burdens (frequent visits, plate adjustments)
  • Risks and what to expect during surgery
Incorrect, incomplete, or poorly worded answers can create false expectations, reduce adherence to treatment plans, or delay seeking expert care. The BMC Oral Health study explicitly highlights that while AIs can give a decent baseline of information, errors and omissions are common enough that clinician verification is essential.
From an IT and product perspective, the study also illustrates how design trade‑offs (conciseness versus depth, retrieval access versus static knowledge) influence usefulness for parent education. Systems that prioritize short, plain‑language explanations may be easier for families to use but risk omitting critical nuance; conversely, verbose, citation‑lean outputs may confuse non‑specialists despite being richer.

Technical and clinical context: what systematic reviews tell us about NAM outcomes​

Multiple systematic reviews and meta‑analyses show a consistent pattern: NAM improves early nasal symmetry and some aspects of vermillion/nostril form compared with no presurgical orthopedics, but the magnitude and longevity of benefits vary and high‑quality randomized evidence is limited. Effect sizes are moderate, and some outcomes converge with non‑NAM presurgical appliances over longer follow‑up. Burden of care (clinic visits, caregiver workload) is higher for NAM versus passive approaches.
This clinical nuance is exactly the sort of context that AI systems must capture when answering parent queries: what NAM tends to improve, what it does not guarantee, what trade‑offs families face, and how team experience affects outcomes. The BMC study shows models can approximate these facts, but not consistently across every subtopic.

Deep dive — strengths and weaknesses of current conversational AIs for NAM information​

Strengths​

  • Fast baseline education: AI chatbots can deliver an instant primer on NAM mechanics, typical treatment timelines, and common benefits — a practical starting point for families exploring options.
  • Accessibility for clinicians: clinicians can use chatbots to generate checklists, patient leaflets drafts, or clinician reminders — saving time in routine patient education tasks. Cross‑discipline evaluations show LLMs can assist effectively with background content and standardized educational material.
  • Consistency on common facts: When facts are well represented on reputable web pages (hospital or society guidance), AI responses tend to converge on the same core messages, because models often retrieve or mirror widely available content.

Weaknesses and risks​

  • Hallucinations and omissions: LLMs sometimes produce plausible‑sounding but false statements, or omit crucial caveats about indications and contraindications for NAM. This is well documented across specialties.
  • Variability by domain: Performance is topic‑dependent. The BMC study showed significant within‑topic differences (soft tissue vs device mechanics), meaning the model you ask may be good at one aspect of NAM and weak at another.
  • Readability vs accuracy trade‑off: Models that produce longer answers (e.g., ChatGPT‑4 in this study) can be more accurate in some domains yet harder for parents to read; simpler outputs (e.g., Copilot) may be more readable but less complete or precise.
  • Temporal instability: Systems that do live web retrieval can change answers over time as the underlying web changes. A treatment recommendation reproduced one day may be different the next. That instability complicates reproducibility and medico‑legal traceability.
  • Overtrust by lay users: Empirical work shows non‑experts tend to rate AI responses as authoritative and may act on them without clinician verification — a behavior that raises safety concerns in medical contexts.

Practical recommendations — how to use AI with NAM information safely​

The BMC Oral Health study and broader AI‑in‑medicine literature imply a pragmatic, layered approach for clinicians, hospitals and product teams designing patient‑facing tools.

For clinicians and cleft teams​

  • Treat AI outputs as starting drafts — always verify and adapt to the patient’s clinical picture before sharing.
  • Provide patients with a short, clinician‑vetted FAQ or handout that addresses the most common NAM questions (what it does, timeline, burdens, risks, alternatives). Use plain language and avoid long LLM‑style paragraphs.
  • If families use chatbots, ask them to bring printed or screenshoted AI answers to appointments so the team can correct misinformation in real time.
  • Document patient education interactions: if an AI‑generated handout or summary is used, note that a clinician reviewed and approved the content.

For hospitals, digital teams and vendors​

  • Build retrieval‑constrained models or RAG (retrieval‑augmented generation) systems that limit answers to trusted institutional or peer‑reviewed sources, with clear provenance labels. Cite the underlying guideline or article where possible.
  • Implement a “clinician in the loop” workflow for any AI content that will be published on patient portals — require review and sign‑off before release.
  • Provide short, layered explanations: a brief one‑sentence summary, a two‑paragraph plain‑English explanation, and a separate “for clinicians” technical note. This respects different literacy and information needs.
  • Log versions and timestamps for AI responses; enable static “snapshot” exports so clinicians can see exactly what a patient read at the time of a decision.
  • Offer AI “safe modes” for health: simple, citation‑rich and refusal behaviors for high‑risk queries (e.g., dosing, emergent triage, personalized medical advice).
These practical steps mitigate the most significant failure modes noted in the literature: hallucination, overtrust, and temporal drift.

Editorial and regulatory considerations​

  • Labeling and provenance: If an AI answer cites a hospital guideline or a peer‑reviewed review, make that provenance front‑and‑center. Evidence‑anchored answers reduce hallucination risk and improve clinician trust.
  • Liability and disclaimers: Licensed content and retrieval layers do not remove responsibility. Systems must include explicit disclaimers and escalation prompts (“consult your cleft team”) and avoid giving specific clinical directives without clinician review.
  • Update cadence: Medical advice evolves. Provide a “last reviewed” and “next review due” date on any AI‑generated patient education page. This addresses a concrete problem raised by model drift and web changes.
  • Research transparency: When deploying AI tools in patient education, publish evaluation protocols (what was tested, who reviewed it, accuracy metrics) so the community can audit safety and efficacy claims.

Where the evidence is thin — and what the BMC authors flagged as limitations​

The BMC Oral Health authors were cautious: their sample covered a specific time window and a finite set of question prompts, and they used a small evaluator panel and Likert‑scale scoring. They noted that asking the same prompts later could yield different answers as models update, and they recommended larger, multi‑centre evaluation protocols for future work. Those are reasonable caveats; they matter because LLM performance is time‑dependent and context‑sensitive.
Beyond the study’s internal limitations, key uncertainties remain in the literature: while NAM commonly improves early nasal form, long‑term comparative benefits versus passive presurgical appliances remain debated and sensitive to surgical technique and team experience. Any AI system that gives categorical, undated claims about long‑term outcomes is therefore on shaky ground.

Concrete examples: good and bad AI outputs for NAM (what to watch for)​

  • Good output: a short, accurate explanation of NAM’s mechanism, a caveated success summary (“often improves early nasal symmetry; long‑term benefit varies”), a plain‑language list of typical visits and caregiver responsibilities, and a prompt to contact the cleft team for personalised guidance. (Ideal: includes citations to institutional guideline or peer‑reviewed review.)
  • Bad output: a confident, prescriptive claim that “NAM will prevent future surgeries” or numeric success rates without citation; advice on plate manipulation or home adjustments that should only be delivered in clinic; or omission of common burdens and potential for extra clinic visits.
The BMC study found that AI responses straddled both these behaviors — often accurate, but sometimes incomplete or too technical, underlining the need for clinician oversight.

Industry perspective — product design and deployment checklist​

For product teams building patient education features around NAM or other specialized treatments, use this checklist:
  • Restrict the AI’s knowledge domain to curated, clinician‑approved content for high‑risk topics.
  • Present multi‑layered explanations (one‑line summary + plain text + clinician note).
  • Surface provenance and last‑reviewed dates prominently.
  • Provide an easy “flag this answer” feature that routes questionable AI outputs back to the cleft team for review.
  • Log all queries and responses for post‑hoc audit and continuous improvement.
  • Include human review (a named clinician reviewer) before publishing AI‑generated FAQs or handouts.
These design patterns reduce risk while retaining the time‑saving benefits of AI for patient education.

Final assessment — where AI helps, where it must not lead​

AI chatbots today are good at baseline NAM information: explaining what the device looks like, general timing, and typical caregiver tasks. They are less reliable at nuanced clinical judgement, personalized risk‑benefit calculations, and long‑term outcome probabilities. The BMC Oral Health study demonstrates moderate accuracy for NAM queries across major LLMs but also shows meaningful variation by subtopic and model — a pattern replicated in other medical comparative studies.
In short:
  • Strength: AI can accelerate and standardize the first pass of patient education content and free clinicians’ time for higher‑value interactions.
  • Risk: Left unchecked, AI can mislead parents with incomplete or hard‑to‑interpret information, and users often confer undue authority on AI outputs.
  • Best path: Use AI as an assistant, never as a substitute for clinician judgement; deploy with provenance, clinician sign‑off and clear escalation pathways.

Conclusion​

The BMC Oral Health investigation into what AI can tell the public about NAM is an important, practical contribution: it measures the real‑world performance of current LLMs on a narrow, high‑impact clinical topic and reveals a mixed but promising picture. For clinicians and product teams, the takeaways are concrete: harness AI for background education and workflow efficiency, but design systems so a qualified clinician remains the final arbiter of any medical advice. The cleft community — teams, parents and technology teams — must collaborate to turn AI’s speed into safe, evidence‑anchored support for families navigating NAM and other specialized treatments.

(Challenge and caveat: LLMs are updated continuously. The comparative performance reflected in this article is tied to model versions and the evidence base at the time of evaluation; results will shift as models, licensed datasets, and retrieval pipelines evolve — so continuous monitoring and re‑evaluation are essential.)

Source: BMC Oral Health What artificial intelligence (AI) can tell us about Nasoalveolar Molding (NAM)? - BMC Oral Health