AI assurance demo

Polished AI writing can hide risk.

AI-generated memos can sound professional even when claims are weak, unsupported, or missing verification.

Reflective Risk measures whether a memo is actually supported — not just whether it sounds convincing.

The problem

AI can simulate scoring.

Generic AI can write a memo. It can also produce a scorecard about that memo.

But generated scoring is not the same as measurement.

The response

Deterministic control.

Reflective Risk checks the document claim by claim.

It exposes weak support, unsupported conclusions, forward-looking gaps, and verification risk.

Core thesis

Reflective Diagnostics proves the failure mode. Reflective Risk provides the control system.

Why this matters

The danger is false reassurance.

People may trust AI more, not less, when it produces:

a polished explanation
a confidence score
a reflection narrative
a professional-looking review

The distinction

Reasoning signals are not verification.

AI-native signals can be useful.

They are not a substitute for governed measurement.

Live proof

Same memo. Same prompt. Different LLM judgments.

Each model received the same memo and the same scoring rubric. The judgments diverged.

Reviewer	Final judgment	Supported	Weak	Unsupported	Forward view	Math	Record
Grok-3	HOLD	2	1	0	2	1	2
DeepSeek Chat	PASS_WITH_NOTES	5	2	0	2	0	0
OpenAI gpt-4o-mini	PASS_WITH_NOTES	3	2	1	1	0	0
Claude Sonnet 4.6	HOLD	2	3	1	2	2	2
Reflective Risk	PASS_WITH_NOTES	4	3	0	2	0	0

Reflective Risk scorecard

Final judgment

PASS
WITH NOTES

Weak claims found. No unsupported material claim.

Support rate

57%

4 of 7 reviewed claims supported.

Needs revision

Risk was visible without relying on AI opinion.

Reviewed claims

Supported

Weak

Unsupported

Forward gaps

External / math / record

0 / 0 / 0

Test memo

A small memo with hidden risk.

Budget facts, staffing ratios, and forward-looking management claims.

Memo text

The approved budget for the current operating plan is $12.0 million in the finance schedule prepared for management this quarter. The reserve for the same operating plan is $3.0 million in the internal contingency line of the budget file. Those amounts produce a remaining balance of $9.0 million for the active operating budget in the current planning cycle. The 25 open roles represent one quarter of the 100-role staffing plan for the current program year. That remaining balance likely supports the current staffing plan through Q3 under the present hiring assumptions. Management expects the hiring pace to stay flat next quarter based on the latest staffing review. The current operating posture appears stable across the staffing and budget materials reviewed this week.

Structural diagnostics

The best LLM response still failed decision-grade standards.

The strongest model in this run was only the least structurally weak. The model pool had low average integrity and no model in the safe zone.

Model	Diagnosis	Integrity	Coverage	Orphan rate
OpenAI gpt-4o-mini	Supported	0.51	67%	33%
DeepSeek Chat	Fragmented	0.20	20%	80%
Grok-3	Fragmented	0.19	18%	82%
Claude Sonnet 4.6	Fragmented	0.09	7%	93%

What this proves

AI-generated review is not a control system.

It can help explain. It can help critique.

But it cannot be treated as audit-grade measurement by itself.

What Reflective Risk provides

Claim-level assurance.

deterministic counts
support-state assignment
repeatable receipts
bounded verification
comparison across memo versions

People with sign-off risk do not need more AI-generated confidence. They need defensible measurement.