Article image
SEIKOURI Inc.

AI in court is hard. The coverage is harder.

Markus Brinsa 8 January 7, 2026 7 7 min read Download Web Insights Edgefiles™

Sources

The AVA court chatbot and the media’s reflex to turn pilots into parables

I came across a national news story about Alaska’s state courts building an AI chatbot for probate help. It hit all the right buttons for modern AI coverage: a high-stakes public service, a grieving user scenario, a “minimum viable product” mindset colliding with legal reality, and the inevitable cameo by hallucinations. It is exactly the kind of piece that spreads quickly because it seems to confirm what many readers already believe about generative AI: that it is powerful, unpredictable, and therefore unfit for anything that matters.

Before we get to the framing, it is worth staying neutral about the underlying project. Alaska’s courts have been working on an “Alaska Virtual Assistant” (AVA) intended to help self-represented residents navigate probate forms, procedural steps, and how to move through the court’s self-help materials. In other words, the goal is not to replace judges or attorneys; it is to reduce friction for people who are already going to court without representation by translating a complicated self-help maze into plain-language guidance. This type of access-to-justice tooling is not new, but generative AI changes the interface: instead of hunting through pages, a user can ask “what do I do next?” and get a tailored response.

Third-party reporting and court-adjacent commentary describe AVA as a retrieval-augmented system grounded in court self-help content, built with a legal-tech partner (LawDroid) and supported by work involving the National Center for State Courts. A Thomson Reuters Institute write-up, for example, describes AVA as using enhanced retrieval-augmented generation, producing citations to sources, and being in a testing phase prior to launch.  That is the appropriate baseline: a court-facing information tool, designed to be constrained, audited, and reviewed—because the cost of being wrong is not an annoyed user, but a harmed litigant.

Now the part that deserves scrutiny: the way the story is told

The NBC-style narrative arc (as circulated and discussed) leans heavily on a familiar template: “AI hype meets real life; the system fails; responsible adults scramble to contain the damage.” That template can be valid. But it becomes misleading when the reporting blurs what is known, what is measured, and what is merely said in an interview. In this case, several of the most shareable “facts” are not independently evidenced in the article itself; they are presented as statements from the project team, without supporting artifacts (test logs, evaluation reports, screenshots, policy documents, or publicly posted accuracy metrics). That does not mean the interviewees are wrong. It means the story’s certainty exceeds the publicly verifiable record.

To be concrete, here are the specifics that should be treated as interview-only claims (and therefore repeated only with clear attribution to NBC and the named interviewees), not as independently established facts.

The timeline claim is a good example. The story relays that AVA “was supposed to be a three-month project” but has stretched to “well over a year and three months,” i.e., 15+ months. That framing is powerful because it implies failure or dysfunction. But without the project plan, scope-change history, procurement constraints, staffing realities, or governance gates, it is essentially an anecdote about expectations. In government technology, “three months” often means “a prototype that proves the interface,” not “a public-facing tool we are willing to stand behind.” If the team extended the timeline to do due diligence, that is not a scandal. It is what competent institutions do when they realize the last mile is accountability, not code.

The hallucination anecdote is the most viral detail: when asked where to get legal help, the chatbot allegedly suggested contacting a law school alumni network—despite there being no law school in Alaska. The punchline is real: Alaska does not currently have an in-state law school, and Alaska institutions explicitly say so.  But the reporting still leaves a material gap: we do not see the prompt, the exact response, the system configuration at the time, whether the answer came from the model or from tool glue code, whether it was pre-RAG or post-RAG, or whether it was immediately fixed. One story about one bad answer is useful as a caution. It is not proof that the system “makes matters worse,” nor proof that the project is “plagued” by hallucinations in any quantified sense. It is a single example described by a participant in an interview.

The testing change from 91 questions to 16 is another detail that reads like a retreat. But again, the only substantiation is the interview description: that 91 questions were too time-consuming to run and evaluate, given the need for human review, so the team refined to 16 questions designed to cover basic, frequent, and previously incorrect queries. Without seeing the test harness, the scoring rubric, or the acceptance criteria, it is impossible to know whether this is a weakening of validation or a normal shift from broad exploration to a release-gate regression suite. In quality engineering terms, reducing the number of test cases can be either reckless or smart, depending on coverage design and automation maturity. The article provides the number, but not the methodology.

The cost claim—“20 queries would cost only about 11 cents”—is similarly persuasive and similarly incomplete. Even if the token cost is accurate for a given model and pricing tier, it does not include what courts actually pay for: integration, hosting, monitoring, security review, legal review, prompt updates, knowledge base maintenance, staff time for ongoing evaluation, accessibility compliance, and incident response. This is not a criticism of AVA. It is a criticism of cost storytelling that treats model inference as the cost of a public service system. Courts do not procure “tokens.” They procure responsibility.

Finally, the “late January” launch schedule should be treated as a plan mentioned in the story, not as a confirmed public commitment, unless there is a primary Alaska Court System announcement to cite. In public institutions, scheduled launches move for reasons that have nothing to do with “AI being hard” and everything to do with governance and risk. The coverage presents the date as an endpoint in a morality play; a more sober framing would treat it as tentative.

Those are the journalistic issues around evidentiary status. The bigger problem, though, is the overstatement embedded in the story’s broader claims about hallucinations and reliability.

The most notable overstatement that needs tighter wording is the suggestion that “across the AI industry, hallucinations have decreased over time and present less of a threat today than they did even several months ago.” That sentence sounds reasonable, and it is directionally true in some narrow senses: some model families have improved on certain factuality and grounding benchmarks; tool-use patterns can reduce unsupported claims; and retrieval can constrain answers when implemented well. But the industry’s own technical disclosures complicate the clean “it’s getting better fast” storyline.

OpenAI’s own research argues that hallucinations persist not because developers forgot to “add guardrails,” but because training and evaluation incentives reward guessing over admitting uncertainty.

That is a structural explanation, not a “we’ll patch it next quarter” explanation. And even within a single vendor’s lineup, hallucination behavior is not monotonic: OpenAI’s o3 and o4-mini system card reports hallucination rates on internal evaluations (including SimpleQA and PersonQA) that differ materially by model, with at least one newer model showing worse hallucination rate than a comparator on the cited table.  In parallel, third-party tracking efforts like Vectara’s hallucination leaderboard exist precisely because “hallucination” is not a solved or uniformly improving metric; models vary, and measurement is workload-dependent.

This matters because the NBC-style framing uses “hallucinations are getting better” as a rhetorical release valve.

It lets the story end on an optimistic note without changing the premise. The implied conclusion becomes: yes, today’s bots are risky, but the arc of progress will take care of it. That is comforting. It is also not a governance strategy—especially not in court contexts where the relevant question is not “is the model improving?” but “can this specific system produce grounded, cited, reviewable answers under predictable failure modes?”

There is also a subtle category error in how these stories blend different AI products into one narrative blob. AVA, as described by court-adjacent sources, is not a free-roaming chatbot doing web searches; it is meant to be a constrained, retrieval-grounded assistant anchored in court self-help content, ideally with citations.  That is an entirely different risk profile than a general consumer chatbot improvising from broad pretraining.

When coverage slides from “this probate assistant hallucinated once in testing” to “AI in general is unreliable,” it collapses the engineering distinctions that actually determine whether a system is safe enough to deploy.

If you want a sharper way to tell this story without cheap shots at the Alaska team, it is this: the interesting part is not that a court pilot encountered hallucinations. The interesting part is that building a responsible public-service AI system forces you to do the boring work the hype cycle keeps skipping—evaluation design, regression testing, documentation, human review workflows, and content governance. That is why timelines stretch. That is why “11 cents” is not the point. That is why “late January” is always conditional. The real headline is institutional maturity, not AI embarrassment.

And there is one more irony. The story criticizes “minimum viable product” thinking in a high-stakes domain, and it is right to do so. But it also performs a kind of journalistic MVP: it ships a clean narrative with minimal evidence. The result is a piece that is easy to share and hard to verify. In 2026, that is not just an AI problem. It is a media literacy problem.

If we want better outcomes here, the ask is simple. Not “stop building court chatbots,” and not “trust the bots because progress.” The ask is: when we cover these systems, treat them like what they are—public infrastructure experiments. Demand artifacts, not vibes. Ask for evaluation results, not just quotes. Distinguish prototypes from deployable services. And when a story relies on interview anecdotes—as this one does—label them clearly as anecdotes, because that is what they are.

That is how you build public trust without either romanticizing AI or sensationalizing it. And it is how you avoid turning every cautious pilot project into yet another parable designed to harvest clicks.

About the Author

Markus Brinsa is the Founder & CEO of SEIKOURI Inc., an international strategy firm that gives enterprises and investors human-led access to pre-market AI—then converts first looks into rights and rollouts that scale. He created "Chatbots Behaving Badly," a platform and podcast that investigates AI’s failures, risks, and governance. With over 25 years of experience bridging technology, strategy, and cross-border growth in the U.S. and Europe, Markus partners with executives, investors, and founders to turn early signals into a durable advantage.

©2026 Copyright by Markus Brinsa | SEIKOURI Inc.