Polished AI writing can hide risk.
AI-generated memos can sound professional even when claims are weak, unsupported, or missing verification.
Reflective Risk measures whether a memo is actually supported — not just whether it sounds convincing.
AI can simulate scoring.
Generic AI can write a memo. It can also produce a scorecard about that memo.
But generated scoring is not the same as measurement.
Deterministic control.
Reflective Risk checks the document claim by claim.
It exposes weak support, unsupported conclusions, forward-looking gaps, and verification risk.
The danger is false reassurance.
People may trust AI more, not less, when it produces:
- a polished explanation
- a confidence score
- a reflection narrative
- a professional-looking review
Reasoning signals are not verification.
AI-native signals can be useful.
They are not a substitute for governed measurement.
Same memo. Same prompt. Different LLM judgments.
Each model received the same memo and the same scoring rubric. The judgments diverged.
| Reviewer | Final judgment | Supported | Weak | Unsupported | Forward view | Math | Record |
|---|---|---|---|---|---|---|---|
| Grok-3 | HOLD | 2 | 1 | 0 | 2 | 1 | 2 |
| DeepSeek Chat | PASS_WITH_NOTES | 5 | 2 | 0 | 2 | 0 | 0 |
| OpenAI gpt-4o-mini | PASS_WITH_NOTES | 3 | 2 | 1 | 1 | 0 | 0 |
| Claude Sonnet 4.6 | HOLD | 2 | 3 | 1 | 2 | 2 | 2 |
| Reflective Risk | PASS_WITH_NOTES | 4 | 3 | 0 | 2 | 0 | 0 |
WITH NOTES
Weak claims found. No unsupported material claim.
4 of 7 reviewed claims supported.
Risk was visible without relying on AI opinion.
A small memo with hidden risk.
Budget facts, staffing ratios, and forward-looking management claims.
The best LLM response still failed decision-grade standards.
The strongest model in this run was only the least structurally weak. The model pool had low average integrity and no model in the safe zone.
| Model | Diagnosis | Integrity | Coverage | Orphan rate |
|---|---|---|---|---|
| OpenAI gpt-4o-mini | Supported | 0.51 | 67% | 33% |
| DeepSeek Chat | Fragmented | 0.20 | 20% | 80% |
| Grok-3 | Fragmented | 0.19 | 18% | 82% |
| Claude Sonnet 4.6 | Fragmented | 0.09 | 7% | 93% |
AI-generated review is not a control system.
It can help explain. It can help critique.
But it cannot be treated as audit-grade measurement by itself.
Claim-level assurance.
- deterministic counts
- support-state assignment
- repeatable receipts
- bounded verification
- comparison across memo versions