EnlightenedAI Research Lab
Reflective Risk
AI assurance demo

Polished AI writing can hide risk.

AI-generated memos can sound professional even when claims are weak, unsupported, or missing verification.

Reflective Risk measures whether a memo is actually supported — not just whether it sounds convincing.

The problem

AI can simulate scoring.

Generic AI can write a memo. It can also produce a scorecard about that memo.

But generated scoring is not the same as measurement.

The response

Deterministic control.

Reflective Risk checks the document claim by claim.

It exposes weak support, unsupported conclusions, forward-looking gaps, and verification risk.

Core thesis
Reflective Diagnostics proves the failure mode. Reflective Risk provides the control system.
Why this matters

The danger is false reassurance.

People may trust AI more, not less, when it produces:

  • a polished explanation
  • a confidence score
  • a reflection narrative
  • a professional-looking review
The distinction

Reasoning signals are not verification.

AI-native signals can be useful.

They are not a substitute for governed measurement.

Live proof

Same memo. Same prompt. Different LLM judgments.

Each model received the same memo and the same scoring rubric. The judgments diverged.

Reviewer Final judgment Supported Weak Unsupported Forward view Math Record
Grok-3 HOLD 2 1 0 2 1 2
DeepSeek Chat PASS_WITH_NOTES 5 2 0 2 0 0
OpenAI gpt-4o-mini PASS_WITH_NOTES 3 2 1 1 0 0
Claude Sonnet 4.6 HOLD 2 3 1 2 2 2
Reflective Risk PASS_WITH_NOTES 4 3 0 2 0 0
Reflective Risk scorecard
Final judgment
PASS
WITH NOTES

Weak claims found. No unsupported material claim.

Support rate
57%

4 of 7 reviewed claims supported.

Needs revision
3

Risk was visible without relying on AI opinion.

Reviewed claims
7
Supported
4
Weak
3
Unsupported
0
Forward gaps
2
External / math / record
0 / 0 / 0
Test memo

A small memo with hidden risk.

Budget facts, staffing ratios, and forward-looking management claims.

Memo text
The approved budget for the current operating plan is $12.0 million in the finance schedule prepared for management this quarter. The reserve for the same operating plan is $3.0 million in the internal contingency line of the budget file. Those amounts produce a remaining balance of $9.0 million for the active operating budget in the current planning cycle. The 25 open roles represent one quarter of the 100-role staffing plan for the current program year. That remaining balance likely supports the current staffing plan through Q3 under the present hiring assumptions. Management expects the hiring pace to stay flat next quarter based on the latest staffing review. The current operating posture appears stable across the staffing and budget materials reviewed this week.
Structural diagnostics

The best LLM response still failed decision-grade standards.

The strongest model in this run was only the least structurally weak. The model pool had low average integrity and no model in the safe zone.

Model Diagnosis Integrity Coverage Orphan rate
OpenAI gpt-4o-mini Supported 0.51 67% 33%
DeepSeek Chat Fragmented 0.20 20% 80%
Grok-3 Fragmented 0.19 18% 82%
Claude Sonnet 4.6 Fragmented 0.09 7% 93%
What this proves

AI-generated review is not a control system.

It can help explain. It can help critique.

But it cannot be treated as audit-grade measurement by itself.

What Reflective Risk provides

Claim-level assurance.

  • deterministic counts
  • support-state assignment
  • repeatable receipts
  • bounded verification
  • comparison across memo versions
People with sign-off risk do not need more AI-generated confidence. They need defensible measurement.