USA TODAY Tests AI NFL Predictions With Copilot Week 6 Picks

  • Thread Author
For the first time this NFL season, Microsoft Copilot’s Week 5 run of picks came up short — and USA TODAY’s experiment with the AI chatbot now turns a brighter spotlight on what happens when large language models chase live, fast-moving sports news. Copilot finished Week 5 with a 5-9 straight-up record and a season-to-date mark that USA TODAY reported as 44-33-1, and on Oct. 9 a full set of Week 6 predictions — complete with final scores for every game — was published as part of an ongoing experiment that highlights both the promise and the peril of letting a general-purpose AI forecast one of the most data-driven competitions in American sports.

A man analyzes data in a futuristic control room with holographic screens and football helmets.Background​

What USA TODAY did and why it matters​

USA TODAY repeated a simple, repeatable procedure: prompt Microsoft Copilot for a winner and score for each Week 6 matchup, collect the outputs, and publish the set of predictions with short, human-authored commentary. That format mirrors many outlets’ recent experiments with generative AI: use an LLM as a fast, explainable predictor, then layer human judgment on top for context and reality checks.
The experiment matters for three reasons:
  • It tests a mainstream, broadly deployed assistant (Microsoft Copilot) in a domain where facts change hourly (injuries, inactives, weather, coach decisions).
  • It shows how a single-model voice can influence casual readers and bettors when presented in a mainstream publication.
  • It surfaces the strengths and failure modes of LLM-led sports forecasting — speed, pattern recognition, and fluency versus staleness, hallucination risk, and reproducibility problems.

How Copilot’s picks were generated​

The reporter used a straightforward repeating prompt — essentially: “Can you predict the winner and the score of the [Team A] vs. [Team B] NFL Week 6 game?” — swapping the matchup for each of the 15 Week 6 games. In some cases Copilot initially reflected outdated injury information and was re-prompted to correct those errors. The published set included picks such as Philadelphia over New York (27-16) and Detroit over Kansas City (31-27), among others.
This is an important point: the entire process depended on prompting the same assistant repeatedly and, when necessary, correcting the assistant’s incorrect claims about roster availability or injuries. That human-in-the-loop step is critical — the AI is not autonomously refreshing live data; the experiment relied on editorial oversight.

Copilot’s Week 6 slate: quick summary of the picks​

Below is a concise review of the 15 game predictions Copilot returned for Week 6, paired with brief human analysis of where the picks line up with conventional thinking and where they diverge.
  • Philadelphia Eagles 27, New York Giants 16
    Copilot leaned on red-zone efficiency (offense and defense) and positioned the Eagles as the clear, short-week favorite against a Giants offense hampered by personnel losses. The general assessment — back the better team on the short week — is conventional and sensible.
  • Denver Broncos 27, New York Jets 14
    The assistant highlighted Denver’s stingy scoring defense vs. the Jets’ leaky unit and questioned Justin Fields’ consistency. Defensive advantage plus Fields’ issues make Denver a plausible pick, particularly in an international (London) setting where travel factors into preparation.
  • Indianapolis Colts 28, Arizona Cardinals 17
    Copilot favored the Colts’ two-way edge and Daniel Jones’ form against a Cardinals pass defense that had been vulnerable. This is a typical model-based outcome: favor the team with superior aggregate offensive and defensive metrics.
  • Los Angeles Chargers 24, Miami Dolphins 21
    The assistant noted Justin Herbert’s interception issues and the Chargers’ lower explosiveness without a key runner, but still gave L.A. a narrow edge. That’s a classic “healthy roster > hot streak” call; the margin reflects uncertainty.
  • New England Patriots 24, New Orleans Saints 20
    Copilot praised the Patriots’ balance and Drake Maye–Stefon Diggs chemistry and viewed the Saints as a tough home opponent. The pick favors an improving visiting team; historically, this kind of road pick is riskier than the model’s language suggests.
  • Pittsburgh Steelers 24, Cleveland Browns 14
    The model judged a quarterback edge (Aaron Rodgers vs. Dillon Gabriel) and the Steelers’ rest differential coming off a bye. Rest, matchup, and experience tilt here in the prediction.
  • Dallas Cowboys 31, Carolina Panthers 21
    Copilot backed the higher-octane offense and recent dominant performances by Dallas, projecting a comfortable win despite Carolina’s improving offense.
  • Seattle Seahawks 24, Jacksonville Jaguars 20
    Labeled “evenly matched,” Copilot skewed to Seattle because of an improved run defense and Sam Darnold’s steadiness. This pick shows sensitivity to matchup specifics rather than pure record-based signals.
  • Los Angeles Rams 31, Baltimore Ravens 17
    The assistant called out Baltimore’s defensive struggles and Lamar Jackson’s uncertain status, projecting a convincing Rams win. That’s a high variance pick driven by roster risk and injury opacity.
  • Las Vegas Raiders 23, Tennessee Titans 20
    Copilot saw both teams as flawed but favored the Raiders due to a perceived rushing mismatch. Small margins, higher volatility.
  • Green Bay Packers 27, Cincinnati Bengals 20
    After Cincinnati’s quarterback carousel, Copilot favored Green Bay — citing Joe Flacco’s experience if he were to start, but also noting Flacco’s recent mixed performance. This pick blends pedigree with skepticism.
  • Tampa Bay Buccaneers 27, San Francisco 49ers 23
    Copilot praised Baker Mayfield and Tampa Bay’s depth, but this is arguably one of the more contrarian calls given San Francisco’s underlying metrics and recent Thursday-night win that often yields extra rest for the opponent.
  • Detroit Lions 31, Kansas City Chiefs 27
    A bold projection: Copilot suggested the Lions might be the league’s most complete team and gave them a narrow edge in what it called a potential Super Bowl preview. That’s high-profile and will invite postmortem attention regardless of the outcome.
  • Buffalo Bills 27, Atlanta Falcons 21
    Copilot expected Josh Allen to bounce back and flagged the Falcons’ strong passing defense and Bijan Robinson’s matchup as reasons the game could be tighter than expected.
  • Washington Commanders 27, Chicago Bears 22
    The assistant favored Jayden Daniels’ dual-threat impact and believed Washington’s emerging pieces would outscore Caleb Williams’ Bears in a high-scoring affair.

How accurate is Copilot’s reasoning — and how do we verify it?​

Copilot’s picks are delivered with fluent English and plausible statistical arguments: red-zone conversion, defensive points-per-game, recent offensive form, quarterback consistency, and roster availability. Those merit-based signals are the right kind of inputs for a predictive system.
However, two caveats are essential:
  • Data freshness is non-trivial. Roster moves, late-week injuries, inactives and even weather can flip a game’s expected outcome. LLMs that aren’t tethered to a live, single-source sports feed will either be stale or will require human prompts to refresh.
  • Statistical precision varies by provider. Numbers such as “Eagles red-zone TD rate = 92.3%” are time-sensitive and may differ across stat aggregators; rounding, cutoffs (e.g., “red zone defined as inside 20-yard line vs. inside 10”), and update cadence affect these figures.
To evaluate Copilot’s reasoning objectively, one must compare its picks to:
  • The prevailing betting market (money lines and spreads).
  • Independent model-based predictors (sports betting models and other AI prediction services).
  • Recent injury reports and official team inactives.
When those components are assembled, Copilot’s output can be judged against a consensus. In many cases its logic aligned with standard model outputs; in others — especially where Copilot discounted a late injury or over-weighted pedigree — its take diverged.

Strengths: what Copilot does well in sports prediction​

  • Natural, explainable narratives. Copilot communicates reasons for picks in clear, human-understandable sentences (red-zone efficiency, defensive points allowed, quarterback consistency). That makes its answers easy to parse and vet.
  • Pattern recognition across seasons. The assistant can synthesize multi-season trends and recent windows of performance into a single narrative quickly.
  • Rapid, repeatable workflow. Prompt once per game and you have a complete predicted slate in minutes — a huge time-saver for editors and casual bettors who want a consistent baseline.
  • Human-in-the-loop correction. When the chatbot produced outdated injury information in the USA TODAY experiment, re-prompting allowed a corrected result; that interactivity improves final output reliability when used properly.

Risks and limitations (the hard realities)​

  • Staleness and the “last-mile” problem. LLMs are often trained on static dumps or behind refresh cycles; they’re not canonical sources for minute-to-minute injury or personnel updates. That makes them risky for prediction tasks where late-breaking news matters.
  • Hallucination risk. LLMs can invent plausible but false facts (e.g., claiming a player is “questionable” when they’re ruled out). Even if the final pick is reasonable, supporting facts can be erroneous.
  • Lack of calibration to betting lines. Copilot gives scores and winners, but it does not automatically synthesize or adjust for market odds or implied win probabilities; bettors need to translate scores into edge vs. market lines themselves.
  • Reproducibility and prompt sensitivity. Slight wording changes can yield different predictions. The experiment used identical prompts, but two prompts that only differ by phrasing sometimes produce divergent outputs — a concern for editorial rigor.
  • No built-in sources or timestamps. The assistant may not indicate when a stat was last updated; that opacity complicates audit trails for a news outlet or bettor trying to validate a claim.
  • Ethical and commercial hazards. Publishing model picks in a mainstream outlet can influence betting behavior and fan perception. There is regulatory and reputational risk if the output is later shown to be systematically biased or wrong due to predictable model failures.

Practical guidance for editors, bettors and fans using Copilot-style predictions​

  • Always treat LLM picks as one input among several. Combine them with market lines and at least one independent model.
  • Verify roster statuses using official team injury reports and league inactives within the final 90 minutes before kickoff.
  • Use ensemble logic: if three separate predictive models (market-implied probability, a statistical model, and Copilot) converge, you have higher confidence.
  • Implement a reproducible prompt template and log the assistant’s complete output with timestamps for auditability.
  • Avoid publishing final betting recommendations based solely on the LLM’s raw score outputs; add human editorial confirmation, especially for high-stakes content.
  • For betting use, consider stake sizing strategies that account for model uncertainty (smaller stakes on high-variance matchups).

Deeper technical concerns: why the assistant can be wrong​

Copilot’s probabilities are implicit, not explicit. It returns a score and supporting language, but not an internally consistent probability distribution. Two technical points explain many mistakes:
  • Training and data latency. Many LLMs rely on periodically updated datasets and are not architected to stream live league feeds. Without integration to a low-latency sports API, last-minute changes are invisible unless a human intervenes.
  • Prompt-driven retrieval vs. knowledge grounding. Copilot responds to prompts using patterns learned during training and any context provided, but it lacks guaranteed grounding in primary sources unless explicitly connected to them. That is why re-prompting or manual verification is necessary.
These technical design choices are solvable (connect the assistant to official play-by-play and injury APIs, add model ensembling, expose probabilities), but they require engineering work and explicit product choices about latency, cost, and safety.

Copilot vs. specialized sports AIs: apples and oranges​

There are two broad categories of “AI predictions” in sports:
  • General-purpose LLMs (Copilot, ChatGPT) that are conversational and can synthesize reasoning.
  • Specialist models (SportsLine’s PickBot, advanced statistical simulators) that run thousands of simulations using curated, up-to-date data streams.
Copilot’s strength is conversational reasoning and editorial-ready prose. Specialist sports AIs generally win on the hard metric of predictive accuracy because they integrate live feeds, team-specific injury reports, and probabilistic simulations tuned for betting. For editorial coverage, the LLM’s prose is valuable; for wagering, specialized simulators remain the better single-source tool.

Cross-checks, verification and the need for multiple sources​

Any claim based on Copilot’s output — whether a projected score, a cited red-zone percentage, or an injury status — should be cross-checked. Stat lines and percentages change as the league progresses; different stat aggregators compute red-zone metrics differently (inside-the-20 vs. inside-the-10, touchdowns vs. scoring percentage). Editors should require corroboration from at least two independent stat providers before publishing a numeric stat as fact.
When Copilot’s reason cites a stat (for example, “Eagles lead the league with a 92% red-zone TD rate”), flag it for verification. That exact percentage may be accurate at a snapshot in time, but the number’s precision is only useful if sources and timestamps are attached.

How outlets should present LLM-driven predictions​

  • Use explicit labels: “AI-assisted picks” or “Predictions generated by Microsoft Copilot and reviewed by editors.”
  • Publish the exact prompt and the time the assistant was queried.
  • Include a short “how we used the AI” note so readers understand the human oversight involved.
  • Maintain an internal log of AI outputs and edits for regulatory or reader disputes.
These practices preserve transparency and protect both the outlet and readers from misunderstanding an AI’s role and reliability.

Final assessment: useful, but not a replacement for specialist systems or final human judgment​

The USA TODAY experiment with Microsoft Copilot offers a practical look at what mainstream AI can deliver for sports coverage: rapid, readable predictions grounded in plausible reasoning. That is valuable to a newsroom that needs fast content and an audience accustomed to AI-driven prose.
But enforcement of best practices is mandatory. Copilot-style picks should be packaged as conversation-starting analysis, not as authoritative betting advice. The assistant’s inability to guarantee up-to-the-minute, verifiable roster and injury data — combined with the risk of hallucination and prompt variability — means editorial oversight remains essential.
For readers and bettors, the smart play is to treat Copilot’s slate as a well-expressed hypothesis: interesting, often accurate on macro signals, but fragile around late-breaking items that materially affect lines. Combine the assistant’s output with specialist simulators, official injury reports, and basic market checks before placing money or publishing definitive claims.

Copilot’s Week 6 predictions are an instructive case study: they reveal how far conversational AI has come in synthesizing sports narratives and where its limitations still leave the final mile to human editors and dedicated sports models. The assistant can accelerate analysis and surface useful angles — but judgment, verification and an awareness of data freshness remain the deciding factors between a helpful headline and a hazardous claim.

Source: USA Today NFL Week 6 predictions by Microsoft Copilot AI for every game
 

Back
Top