Is one foundation model enough for claims, underwriting, and fraud?

The model can be shared; the prompt, eval suite, governance, and audit trail must not be. A single Claude or GPT-class model can serve all three workflows technically. What cannot be shared is the AI Systems Program documentation, the eval set, the human-in-the-loop thresholds, or the model card. Treating these as separate "instances" is the practical fix.

What does the NAIC Model Bulletin actually require insurers to do?

The bulletin requires a written AI Systems Program covering governance and accountability, risk management framework, testing and validation, ongoing monitoring, third-party vendor oversight, and consumer-facing transparency. It does not prescribe specific tools; it does require documentation an examiner can reconstruct.

Where does Solvency II touch insurance AI?

Solvency II Article 144 and the EIOPA Supervisory Statement on AI governance treat AI inputs into pricing, reserving, and underwriting as operational-risk inputs to the SCR (Solvency Capital Requirement). Practically, this means an AI-driven underwriting model needs the same level of validation rigor as an internal capital model — documented governance, independent validation, and a board-level risk-appetite link.

How is fraud detection different from claims AI in regulatory posture?

Fraud detection at the referral stage flags cases for human investigation and tolerates lower explainability because no adverse action has been taken. The moment the fraud signal results in claim denial, non-renewal, or SAR-equivalent reporting, the workflow snaps back to claims-level governance. Most fraud agents fail by collapsing this distinction.

Can a fully autonomous underwriting agent ever ship in regulated insurance?

In personal lines today, no — between Solvency II, state DOI rate-filing review, and ECOA where U.S. credit is involved, full autonomy fails the adverse-action documentation standard. Recommendation-with-human-sign-off is the shippable posture. The future-state where autonomy is defensible depends on regulator acceptance of model cards as substitute documentation, which is currently a 5–10 year horizon question.

What evaluation metrics matter most for insurance back-office agents?

Beyond the standard accuracy and latency metrics: fairness slices across protected classes (gender, age, race where data permits inference, geography), drift on a frozen golden set, refusal rate (does the agent know when not to answer), human-override rate (does the recommendation track the senior adjuster's call), and cost per decision. The fairness and override metrics are the ones an examiner asks about first.

How long does it take to ship a regulated-acceptable insurance agent into production?

Pilot to shadow: 8–12 weeks. Shadow to limited production: 4–8 weeks. Limited to broad rollout: 12–24 weeks, often gated by state-by-state regulatory engagement. Total realistic timeline: 6–12 months for FNOL; 9–15 months for SIU fraud; 12–24 months for any agent that affects an underwriting decision. (See Building regulator-ready AI in fintech for the analogous 90-day execution window in banking.)

What's the most common reason insurance AI pilots get rolled back?

Two reasons, roughly equally weighted. First, the eval suite was built once, never refreshed, and drifted. Second, the human-in-the-loop policy was confidence-based instead of materiality-based, and the team couldn't defend it under examiner questioning. The fix in both cases is documentation discipline, not a different model.

Services
WHAT WE DO

Full-cycle engineering for systems that can't fail

AI integration, legacy modernization, and regulated-industry delivery - with an accountable technical lead.

All Services
AI

AI Agent Development

AI Development

AI Consulting

AI Engineering Agents

AI Integration

AUDIT & STRATEGY

IT Audit

IT Cost Optimization

Proof of Concept

BUILD & DELIVER

System Integration

Digital Product Design

TECHNOLOGIES

Blockchain

Cloud

Data Engineering

IoT

MODERNISE

Technology Modernization

Web Accessibility

Cloud Migration

AI NATIVE TECH STACK

AI Engineers

Golang

Rust

Solidity

Java
FIXED SCOPE

AI & System Readiness Audit

Architecture review, risk surface, prioritised action plan. No obligation.

Request Audit

PAID - 2 WEEKS

Sharp Sprint

Fixed scope, senior engineers, working software. Skip the long discovery.

Start a sprint
Solutions
WHAT WE DO

Full-cycle engineering for systems that can't fail

We work best when the stakes are high. Find the right entry point - by sector or by the challenge you're facing.

All Solutions
BY INDUSTRY

Banking & Fintech
BaFin - DORA

Insurance

Healthcare
HIPAA

Manufacturing

Retail & eCommerce

Logistics

BY SITUATION

Don't Know Where to Start with AI
You want an honest read on where AI pays back and what it costs.

Stack Won't Take the AI
Legacy core blocks every AI initiative. Step-by-step modernization that unlocks the data.

Need AI Agentic Workflows
Multi-step agentic workflows across your real tools, with human-in-the-loop.
FIXED SCOPE

AI & System Readiness Audit

Not sure where your system stands? We assess, surface risks, and deliver a clear action plan.

Request Audit

PAID - 2 WEEKS

Sharp Sprint

Know what you need? Fixed scope, senior engineers, working software in two weeks.

Start a sprint
Case Studies
WHAT WE DO

Trusted by Nasdaq, OSL, Panasonic Avionics and 50+ others

Complex problems, delivered. Real clients, measurable outcomes.

All Case Studies
BY INDUSTRY

AI

Banking & Fintech

Insurance

Healthcare

Manufacturing

BROWSE

All Case Studies

Blog & Insights
About
Company

Who We Are

CSR

Join

Careers

Contact

FIXED SCOPE

AI & System Readiness Audit

Find out exactly where your architecture stands before committing to AI integration or a major build. We assess readiness, surface risks, and deliver a prioritised action plan - no obligation.

Architecture review
No obligation
Written report

Request Audit

PAID - 2 WEEKS

Sharp Sprint

A focused, fixed-scope delivery sprint for teams that need traction fast. We scope, staff, and ship a meaningful first milestone in two weeks - senior engineers, working software, no long discovery.

Fixed scope
Senior engineers
Working software

Start a sprint

Not sure where to start? Talk to a technical lead - no sales pitch.

Book a 30-min call

FIXED SCOPE

AI & System Readiness Audit

Architecture review, risk surface, prioritised action plan. No obligation.

Request Audit

PAID - 2 WEEKS

Sharp Sprint

Fixed scope, senior engineers, working software. Skip the long discovery.

Start a sprint

Agentic AI for Insurance Back-Office: Claims, Underwriting, Fraud

Written by

Zhanna Yuskevych

Chief Product Officer

Reviewed by

Bohdan Varshchuk

Chief Technology Officer

Posted: May 21, 2026

Updated: July 23, 2026

9 min read

Expert verified

Summarize

On this page:

TL;DR
Key takeaways:
Introduction
Where does agentic AI actually fit inside an insurance back-office?
How do you ship an agent into one of these workflows without losing control?
When does it make commercial sense to invest now?
Conclusion
FAQ

TL;DR

Insurers see “agentic AI” pitched as a single technology decision. It isn’t. Claims, underwriting, and fraud each sit on a different regulatory surface — NAIC Model Bulletin on AI for claims and underwriting, Solvency II operational-risk articles for capital-affecting decisions, and IDD/insurance distribution rules wherever an agent touches a customer-facing recommendation. Treating these workflows as one architecture is the fastest way to ship a pilot that gets clawed back by compliance. The piece below names where agents belong inside an insurance back-office, where they do not, what the mandatory human-in-the-loop checkpoints look like, and how to design the eval and audit trail so a state examiner can read it without a Teamvoy delivery lead in the room.

Key takeaways:

One agent cannot legally cover claims, underwriting, and fraud — each has a distinct regulator and audit standard.
The NAIC Model Bulletin (adopted by 22+ U.S. states as of 2026 [VERIFY]) requires written AI governance covering testing, validation, and oversight.
Solvency II Article 144 treats AI-driven underwriting as an operational-risk capital input; documentation must support a SCR review.
Human-in-the-loop is not a UX choice in insurance — it is a regulatory anchor for materiality and explainability.
Eval suites for insurance agents must include fairness tests, disparate-impact thresholds, and a frozen golden set tied to a model card.
The expensive failure pattern is agent A learning from a label generated by agent B, which then audits agent C.
Throughput uplifts of 30–60% on first-notice-of-loss workflows are realistic; underwriting agents shipping autonomous decisions are not.

Introduction

Every insurer we talk to is running an agentic AI pilot somewhere. The pilots that ship into production have one thing in common: the team scoped the agent to a single workflow with a clean regulatory surface, instrumented it with a frozen eval set, and built the human-in-the-loop as a regulator-readable artifact rather than a UX afterthought. The pilots that stall have the opposite shape — one orchestrator pointed at claims, underwriting, and fraud at once, no model card, no eval governance, and a vague plan to “add humans later.” This piece is a working guide for the first kind, written for CTOs and COOs about to commit to a 12–18 month back-office AI roadmap and tired of vendor decks that pretend the three workflows are interchangeable.

Where does agentic AI actually fit inside an insurance back-office?

The question matters because the answer is workflow-specific, not company-specific. An insurer is not one regulatory entity for AI purposes — it is a stack of regulated activities, each governed by a different framework. A back-office agent is a software actor making or recommending a decision inside one of those activities. The governing rules differ workflow by workflow.

overview of agent roles in insurance workflows: four panels for claims (fnol)–highest-roi target; underwriting–rating with mandatory rules; fraud/siu–higher autonomy tolerance; customer service/policy admin–separate regulator scope.

Claims. First-notice-of-loss (FNOL) intake, document classification, coverage triage, severity scoring, and reserve recommendation are the highest-ROI early-stage targets. They are also the workflows with the clearest regulator stance: the NAIC Model Bulletin on the Use of Artificial Intelligence Systems by Insurers (adopted in substantially similar form by 24+ U.S. state insurance departments as of early 2026) requires that insurers maintain a written AI Systems Program covering governance, risk management, testing, validation, monitoring, and third-party AI vendor oversight. Anything that affects a claim payment must be documentable end to end.

Underwriting. Rating, segmentation, and risk-class assignment are where agents look most attractive and where the regulator’s posture is harshest. Under Solvency II, AI inputs into underwriting flow into the operational risk capital requirement (Article 107 of Directive 2009/138/EC) and into the firm’s ORSA, with supervisory expectations now clarified by EIOPA’s August 2025 Opinion on AI Governance and Risk Management. In parallel, U.S. state DOIs — led by Colorado (SB21-169) and New York (Circular Letter No. 7) — have begun pushing back on rate filings that depend on undocumented model behavior, with modifications and disclosure demands the dominant enforcement posture so far. The agent class to deploy here is recommendation-only with mandatory human sign-off; full autonomy invites a market-conduct examination

Fraud. SIU referrals, SAR-equivalent suspicious-activity classification, and provider-network anomaly detection are higher-tolerance for AI autonomy because the agent is flagging cases for human investigation, not deciding the customer outcome. Fraud agents can act faster and with thinner explainability up to the point of an adverse action. The moment an agent’s signal denies a claim, refers to law enforcement, or non-renews a policy, the workflow snaps back to the claims/underwriting governance bar.

Customer service and policy admin. Adjacent to back-office, often pitched as the same project. Different regulator (IDD in the EU, state insurance commissioners in the U.S., plus consumer-protection law everywhere). Worth separating in scope.

The architecture mistake is treating these four as one platform decision. Each needs its own eval set, its own audit trail, its own escalation policy, and — crucially — its own model. Sharing the underlying foundation model is fine; sharing the prompt, the eval suite, and the governance is not.

How do you ship an agent into one of these workflows without losing control?

You design backwards from the audit. Pretend a state examiner walks in 18 months from now and asks: show me how this agent made the decision it made on claim #4731. If you can produce — in under 10 minutes — the prompt version, the model version, the input documents, the eval result for that input class, the human reviewer’s sign-off, and the version of the decision policy in force on that date, the agent is shippable. If any of those six artifacts is missing, the agent will be unwound by compliance before it generates ROI.

A workflow comparison: what each agent type needs

Workflow	Decision authority	Human-in-the-loop	Eval suite focus	Primary regulator(s)	Realistic uplift
FNOL claims intake	Triage + routing	Optional for non-material	Doc classification accuracy, edge-case recall	NAIC Model Bulletin, state DOIs	30–50% STP; 40–60% cycle-time
Claims severity & reserve	Recommendation	Mandatory on material claims	Fairness, calibration, reserve backtest	NAIC, state DOIs, UCSPA, NY DFS CL No. 7	15–25% cycle-time; leakage gains
Underwriting / rating	Recommendation only	Mandatory always	Disparate impact, model card freeze, drift	Solvency II Art. 107, EIOPA 2025 Opinion, CO SB21-169, NY DFS CL No. 7	60–99% quote cycle (commercial); 3–5pp loss ratio
SIU fraud referral	Flag + score	Always human	Precision/recall, FP cost, alert acceptance	NAIC, state DOIs, state fraud bureaus, NICB	~2 wks earlier detection; higher accept rate
Customer-service policy QA	Answer + escalate	Escalation thresholds	Hallucination rate, refusal rate, resolution	State UTPA, consumer law, IDD (EU), TCPA	30–55% Tier-1 deflection

The numbers are ranges across Teamvoy and publicly reported insurance AI deployments and should be treated as targets to verify, not guarantees.

A ten-step deployment shape that survives an exam

Pick one workflow. Resist the platform pitch. Pick FNOL or SIU first; underwriting last.
Write the AI Systems Program entry for that workflow. Governance, owner, escalation, decommission criteria. NAIC Model Bulletin language.
Freeze a golden eval set of 500–2,000 representative inputs labeled by senior adjusters. This becomes the model card’s evidence base.
Define the human-in-the-loop thresholds. Materiality-based, not confidence-based. Confidence thresholds drift; materiality doesn’t.
Build the agent on a single foundation model with version-pinning. Anthropic Claude, OpenAI, or open-weights with self-hosted inference are all defensible; the choice that fails is “whatever the orchestration tool defaults to.”
Instrument observability before launch. Faithfulness, drift, refusal, fairness slices, latency, cost. (See LLM observability and evals for production fintech AI — the fintech patterns translate directly to insurance.)
Run a shadow period of 4–8 weeks. Agent makes a recommendation, human makes the decision, compare. Throw out the first two weeks.
Pilot at a single business unit or state. Not enterprise-wide. Examiners look at the rollout posture.
Quarterly model-card refresh. Tied to the eval set, not to a calendar reminder.
Decommission criteria written before launch. If the fairness slice drifts beyond X for two quarters, the agent comes out.

The work is unglamorous and most of it is documentation. That is the work. Insurers that try to skip steps 2, 4, 6, or 10 are the insurers shipping pilots that get clawed back.

When does it make commercial sense to invest now?

The commercial frame for a Series-C carrier or a Tier-2 insurer in 2026 is straightforward: agentic AI in FNOL claims pays back in 9–14 months on labor cost alone, before counting cycle-time reductions on customer NPS. Underwriting agents pay back in 18–24 months once they earn a steady-state human-in-the-loop posture. SIU fraud agents pay back in 6–9 months because the marginal labor cost of an investigator is high and the precision uplift translates directly to recovery. Customer-service agents pay back fastest of all but are also the easiest to ship as a regrettable launch — see any insurer chatbot story from 2023–2024.

The investment that doesn’t pay back is the one-platform-for-everything pitch. The vendor demos beautifully; the regulator does not. We have not seen a regulated-insurance enterprise successfully run one orchestrator across claims, underwriting, and fraud without splitting the governance — which is the actual hard part — back out into four streams. (See Why most AI pilots in fintech never reach production for the analogous fintech failure modes.)

Teamvoy builds these stacks end-to-end: workflow-specific eval suites, AI Systems Program documentation aligned to NAIC and Solvency II language, the LLMops/observability scaffold, and the integration into legacy claims systems (Guidewire, Duck Creek, Sapiens) that is usually the integration tax everyone underestimates. If you are scoping a back-office AI roadmap and want the engineering reality check before a vendor signoff, start with our insurance practice overview.

Conclusion

two-column dark infographic showing vendor vs regulator patterns: left green card lists production-pattern items, right dark card lists stalls/fails with red icons and bold text.

Agentic AI in insurance back-office is shippable in 2026, and the deployments that ship share a single pattern: workflow-by-workflow scope, materiality-based human-in-the-loop, frozen eval sets refreshed on a real cadence, and AI Systems Program documentation written before launch. The deployments that fail share the opposite pattern: one platform, one team, vague governance, and a plan to add documentation later. The vendor pitch optimizes for the first month. The regulator optimizes for month 18.

If you are scoping an insurance back-office AI roadmap and want a regulator-anchored engineering reality check, Teamvoy builds these stacks end-to-end across NAIC, Solvency II, and state-DOI surfaces. See how we approach insurance claims automation →

FAQ

Zhanna Yuskevych , Chief Product Officer

Zhanna has over 15 years of experience in software development. She has led the creation of many impactful solutions. Driven by her passion for modern tech, she aims to solve real-world challenges with innovative products. Besides tech, Zhanna loves arts and design. She is always eager to explore new creative directions.

Schedule a Call Connect on LinkedIn

Previous Post React Native vs PWA for Hybrid Cloud Banking in 2026 Next Post No-Code vs Low-Code vs Custom Development: How to Decide What to Use Where