CU Anschutz researchers are moving from proof‑of‑concept to practical deployment, delivering a set of validated, clinician‑centered tools designed to make Large Language Models (LLMs) and other health A.I. safer, auditable and useful at the bedside—efforts that combine peer‑reviewed measurement instruments, clinician‑facing prompt tooling, and infrastructure work that together lower the barrier to responsible clinical adoption.
The last two years have seen generative A.I. – particularly LLMs – migrate out of research demos and into everyday clinical workflows: drafting notes, triaging inbox messages, summarizing longitudinal records, and assisting with patient education. These gains promise to reduce clinician administrative burden and reclaim time for direct patient care, but they also expose familiar failure modes: hallucination, over‑confidence (sycophancy), and drift as models and retrieval indices evolve. Effective, safe deployment therefore depends on measurement, tooling, and governance—not only better models. CU Anschutz’s Department of Biomedical Informatics (DBMI) has published a portfolio of interventions that address those three pillars: validated evaluation instruments for summarization, clinician‑centered prompt tooling, and efforts to make genomic references and analytic pipelines more inclusive and reproducible. These initiatives are expli‑loop workflows, auditable logs, and conservative defaults for patient‑facing answers.
The initiative’s strengths are real and verifiable: PDSQI‑9’s psychometric results and the pangenome assembly metrics are documented in public manuscripts and journal venues. Reported operational successes such as Cliniciprompt’s adoption are promising and practically useful, but they remain institution‑reported and should be validated independently before being used as sole evidence in procurement decisions. For health systems, the immediate priority is not to race to model selection but to harden governance: adopt validated evaluation gates, require clinician‑in‑the‑loop defaults, insist on auditable logs, and demand independent safety audits. Those investments convert emerging A.I. capability into durable, trustworthy clinical benefit rather than short‑lived experimentation.
Source: CU Anschutz newsroom https://news.cuanschutz.edu/dbmi/health-ai-tools-support-clinicians?hs_amp=true]
Background
The last two years have seen generative A.I. – particularly LLMs – migrate out of research demos and into everyday clinical workflows: drafting notes, triaging inbox messages, summarizing longitudinal records, and assisting with patient education. These gains promise to reduce clinician administrative burden and reclaim time for direct patient care, but they also expose familiar failure modes: hallucination, over‑confidence (sycophancy), and drift as models and retrieval indices evolve. Effective, safe deployment therefore depends on measurement, tooling, and governance—not only better models. CU Anschutz’s Department of Biomedical Informatics (DBMI) has published a portfolio of interventions that address those three pillars: validated evaluation instruments for summarization, clinician‑centered prompt tooling, and efforts to make genomic references and analytic pipelines more inclusive and reproducible. These initiatives are expli‑loop workflows, auditable logs, and conservative defaults for patient‑facing answers. Overview of the major developments
PDSQI‑9 — a validated instrument for evaluating LLM clinical summaries
- What it is: The Provider Documentation Summarization Quality Instrument (PDSQI‑9) is a nine‑item scoring instrument designed to evaluate LLM‑generated clinical summaries on organization, clarity, accuracy and clinical utility.
- Why it matters: Prior to PDSQI‑9, health systems lacked a short, validated tool that operational teams and procurement committees could use to measure whether a model’s output met minimal clinical standards before being permitted to enter workflows. PDSQI‑9 fills that gap by providing reproducible psychometric measures suitable for pilot testing and vendor comparisons.
- Validation evidence: In an open manuscript and its supporting materials, seven physician raters evaluated 779 summaries and answered 8,329 item‑level questions. The instrument demonstrated strong internal consistency (Cronbach’s alpha = 0.879) and high inter‑rater reliability (ICC ≈ 0.867), and factor analysis supported a four‑factor structure representing organization, clarity, accuracy and utility. Those results support PDSQI‑9’s use as a decision‑quality gate in clinical pilots.
- Practical implications: Hospitals can use PDSQI‑9 to:
- Quantify sumecialties and clinical contexts.
- Set acceptance thresholds (for example, minimum mean item score) prior to EHR integration.
- Make score reporting a contractual requirement in procurement to enforce safety targets.
Cliniciprompt — reducing prompt engineering as a barrier to clinician adoption
- What it is: Cliniciprompt is a clinician‑facing prompting framework and toolkit that codifies high‑quality prompt templates, provides retrieval‑augmented generation (RAG) patterns to ground responses in curated content, and stores validated prompt examples for reuse. The design prioritizes reproducibility of prompts across users and specialties.
- Reported adoption and limits: CU Anschutz reports high uptake in internal pilots — approximately 90% nurse adoption and about 75% physician adoption in the environments described by the institution. Those figures come from internal rollout and conference presentations; independent, peer‑reviewed benchmarks of Cliniciprompt’s performance (for example, comparative reductions in hallucination rates across vendors) are not yet publicly available. Institutional reports demonstrate the tool’s practical value but should be treated as preliminary until independent evaluations appear.
- Why it matters: Prompt engineering is often the single largest usability friction when clinicians try to extract clinically usable output from LLMs. Cliniciprompt’s human‑centered approach shortens the learning curve, embeds retrieval constraints to reduce hallucination risk, and promotes re‑use of successful prompts—concrete steps to reduce brittle, ad‑hoc LLM behavior in daily practice.
Inclusive pangenomes and ancestry‑aware tools (genomics infrastructure)
- What it is: A major pangenome effort (sequencing 65 diverse human genomes and building 130 haplotype‑resolved assemblies) closed a large fraction of previous assembly gaps and revealed thousands of structural variants tpproaches miss. This work underpins better genotyping across ancestries and reduces representation bias in genomic interpretation pipelines.
- Why it matters for clinical AI: Diagnostic pipelines that rely on single linear references undercall variants in under‑sampled populations. A pangenome reference and associated tooling allow downstream AI and decision support systems to produce less biased results, improving equity in genetic diagnoses and research. CU Anschutz’s coverage emphasizes the practical translation challenge—clinical labs will need to re‑benchmark pipelines and provision compute for graph‑aware references.
Technical verification and cross‑checks
- PDSQI‑9 metrics: The instrument’s validation statistics—779 summaries, Cronbach’s alpha ≈ 0.879, ICC ≈ 0.867—are documented in the publicly posted manuscript and in the university press materials. These numbers are reproducible from the paper’s methods and results. The instrument’s validation used multiple LLMs (GPT‑4o, Mixtral 8x7b and Llama 3‑8b) and multiple clinical contexts, giving the results practical credibility for hospital pilots.
- Pangenome claims: The pangenome results are published in Nature and in an accessible PMC copy, confirming the 65‑individual / 130‑haplotype assemblies, gap‑closure numbers and resolved structural variants. Independent indexers (Nature, PubMed Central, NSF repository) corroborate these metrics, making the genomic claims high‑confidence.
- Cliniciprompt adoption: CU Anschutz pages and institutional conference coverage report the high adoption percentages for Cliniciprompt within pilot services. External third‑party confirmation beyond conference coverage and regional reporting is limited; therefore those adoption figures should be considered institution‑reported metrics rather than independently validated performance statistics. Flagging these numbers as institution‑reported is essential when procurement or regulatory decisions rely on independent validation.
Strengths: What CU Anschutz did well
- Measurement first: Publishingn instrument (PDSQI‑9) is a crucial step toward evidence‑based procurement and governance. It gives hospitals a concrete, reproducible way to measure LLM outputs against safety and usability targets before the tools reach patients.
- Human‑centered tooling: Cliniciprompt focuses on clinician workflows and re‑usehan model internals. That pragmatic emphasis accelerates adoption and lowers friction for busy clinicians who cannot spend hours tuning prompts. Coupled with retrieval‑anchoring, this reduces hallucination exposure in many drafting tasks.
- Broad translational focus: Coupling genomics infrastructure (inclusive pangenomes) with LLM evaluation tools addresses both upstream bias issues and downstream clinical workflow integration—an unusually broad, systems‑level appre chance of durable benefits.
- Practical implementation guidance: Public guidance and conference outputs emphasize required governance: clinician‑in‑the‑loop workflows, logging, versioning, conservative defaults on high‑risk queries, and contractual protections against opaque data use. These are the kinds of controls health systems need to manage legal and safety risk.
Risks and open gaps
- Reliance on institutional metrics and press releases: Several of the achievements (notably adoption percentages for Cliniciprompt and coveries) are described primarily in institutional materials and conference presentations. Independent, peer‑reviewed replication or third‑party audits are not yet available for all claims—this increases procurement risk if teams rely on internal metrics alone. Where institutional claims drive purchasing, require external validation clauses in contracts.
- Scale and infrastructure costs: Pangenome adoption and graph‑aware genomic pipelines materially increase compute and storage demands. Hospitals and labs will need financial and technical plans to provision graph‑aware references, new genotyping workflows, and the long‑term storage of large reference datasets.
- Human factors and deskilling: As systems draft notes and messages, there is a risk clinicians gradually over‑trust AI drafts and stop verifying details—errors in medication names, dosages, or allergies can propagate quickly. CU’s playbook explicitly recommends design and training to keep clinicians in the loop; organizations that skip this risk patient safety.
- Equity and bias testing: LLMs and downstream analytics must be tested across languages, dialects and demographic cohorts. The pangenome work helps address genomic bias, but models that summarize or triage must likewise be stress‑tested for demographic performance disparities before scaling.
Practical checklist for hospitals, IT leaders, and procurement
- Governance board and policy
- Create a multidisciplinary governance board (clinicians, legal, privacy, IT, patient safety) to approve pilots and define acceptance thresholds.
- Measurement gates
- Require PDSQI‑9 (or equivalent validated instruments) as a pre‑deployment gate for any summarization tool; publish aggregate pilot results internally.
- Human‑in‑the‑loop workflows
- Mandate explicit clinician review and sign‑off before any AI‑generated content bgal medical record.
- Retrieval and provenance
- Default to retrieval‑constrained answers for clinical queries and display provenance and “last reviewed” timestamps on patient‑facing materials.
- Logging and auditability
- Log prompts, model versions, responses and reviewer identities; store immutable snapshots for audits and quality improvement.
- Contractual protections
- Insist on BAAs (Business Associate Agreements), explicit prohibitions on unauthorized model retraining with patient data, rights to export logs, and exit clauses.
- Independent audits and red‑teaming
- Require third‑party safety audits and adversarial testing reports for vendors before system‑wide rollouts.
- Workforce training
- Train clinicians on common failure modes (hallucination, sycophancy, omission) and create short verification checklists.
- Equity testing
- Require vendors to provide stratified performance metrics by language, race/ethnicity, age, and other relevant subgroups.
- Pilots and scale strategy
- Run targeted pilots (90–120 days) with instrumented telemetry and a plan to publish or share de‑identified evaluation metrics.
How to interpret institution‑reported successes (practical guidance)
- Treat internal adoption metrics (e.g., Cliniciprompt uptake numbers) as useful signals of operational traction but not as standalone evidence of safety or clinical effectiveness. Those figures are best used to justify further evaluation rather than procurement by themselves. CU’s reporting is transparent about pilot context; procurement teams should ask vendors for raw or de‑identified test sets and an independent audit before production deployment.
- Require vendors to provide both worst‑case examples and distributions of error types (omissions, hallucinations, incorrect medication names) so clinical teams can prioritize mitigations for the highest‑risk failure modes. The PDSQI‑9 framework can help categorize and quantify those error modes when testing summarizers at scale.
Outlook: what to watch for in 2026
- Standards adoption: Look for PDSQI‑9 or related validated instruments to be referenced in procurement checklists or by professional societies as a recommended evaluation standard. That would materially raise the bar for safe summarization deployments.
- Third‑party audits become routine: Expect health systems to require red‑team results and public safety summaries from vendors as part of contract negotiations; organizations that can’t produce this evidence will have procurement friction.
- Pangenome pipeline consolidation: Watch for commercial clinical‑lab vendors to integrate graph‑aware references into their standard diagnostic pipelines; this will drive broader adoption but will also create short‑term workload spikes for lab IT.
- Independent evaluations of prompt‑tooling: Cliniciprompt‑style frameworks will benefit from head‑to‑head studies showing whether structured prompt libraries reduce hallucination rates and clinical errors when compared to ad‑hoc prompting practices. Until those studies exist, adoption should be accompanied by local evaluation.
Conclusion
CU Anschutz’s recent work represents a pragmatic blueprint for translating A.I. into clinical value: publish validated measurement tools, build clinician‑focused prompt infrastructure, and invest in inclusive foundational data such as pangenomes. Those three elements—measurement, tooling, and data—are mutually reinforcing: validated instruments let organizations set objective acceptance thresholds; clinician‑friendly tooling reduces friction and mechanical error; inclusive, well‑versioned data reduces bias in downstream models.The initiative’s strengths are real and verifiable: PDSQI‑9’s psychometric results and the pangenome assembly metrics are documented in public manuscripts and journal venues. Reported operational successes such as Cliniciprompt’s adoption are promising and practically useful, but they remain institution‑reported and should be validated independently before being used as sole evidence in procurement decisions. For health systems, the immediate priority is not to race to model selection but to harden governance: adopt validated evaluation gates, require clinician‑in‑the‑loop defaults, insist on auditable logs, and demand independent safety audits. Those investments convert emerging A.I. capability into durable, trustworthy clinical benefit rather than short‑lived experimentation.
Source: CU Anschutz newsroom https://news.cuanschutz.edu/dbmi/health-ai-tools-support-clinicians?hs_amp=true]