Current Status
Foundation page. It defines the public-safe meaning of ERT and should be read before the later implementation and reporting notes.
How to Read This Page
- This is page 1 of 12 in the public ERT / Project Aletheia progression.
- Read it as a public research note: it explains the concept and what changed without exposing protected implementation details.
- Redaction markers mean the public boundary is intentional, not that the section is missing by accident.
- Start here if you want the shortest explanation of what ERT is trying to measure.
Public Note
This page is an introductory public-safe version of a working research definition. Certain formulas, scoring procedures, implementation mechanisms, and private evaluation architecture are intentionally omitted or reserved for controlled disclosure.
[REDACTED — protected scoring formulas, thresholds, and implementation details are not included in this public version.]
Why ERT Exists
Many AI evaluations focus on whether a system produces a correct answer in a single test instance.
That is useful, but it is incomplete.
A system may answer correctly once while still showing deeper reliability problems, such as:
- changing its answer when a question is reworded;
- sounding confident while being internally unstable;
- failing to separate known facts from ambiguous situations;
- contradicting itself across related prompts;
- overfitting to one phrasing or benchmark format.
ERT was created to help observe these behaviors in a structured way.
The goal is to evaluate reasoning behavior, not just answer production.
What ERT Takes In
ERT works with three public-safe input categories.
1. Structured Prompt Sets
A prompt set is a group of related questions designed to test the same underlying idea.
The prompts may include:
- rephrasing;
- structural variation;
- added context;
- controlled ambiguity;
- mild distraction or noise.
Each set should still target one central concept or question.
2. Model Responses
For each prompt, ERT examines the response produced by the system.
When available, ERT also considers a confidence signal. If the system does not provide explicit confidence, confidence may need to be inferred or estimated through public-safe methods.
3. Lightweight Evaluation Context
The test may include basic context such as whether the scenario is factual, ambiguous, adversarial, or reasoning-focused.
ERT does not require internal model access. This makes it usable as a black-box diagnostic framework.
What ERT Produces
ERT produces a diagnostic reliability profile.
The profile may include:
- an overall reliability signal;
- a consistency profile;
- a stability signal;
- contradiction flags;
- uncertainty calibration signals;
- an epistemic resolution signal;
- a plain-language behavioral classification.
The purpose is not to reduce the system to one score. The purpose is to explain how the reasoning behaved under variation.
Core Reliability Signals
Consistency
Do related prompts produce aligned answers when they are testing the same idea?
Stability
Does the system remain steady when wording, format, or framing changes?
Contradiction Detection
Does the system give incompatible answers inside the same test set?
Confidence Alignment
Does the system's confidence match the reliability of its answers?
Epistemic Resolution
Does the system distinguish between:
- known information;
- unknown information;
- ambiguous information;
- cases where more clarification is needed?
This is an important distinction because a trustworthy system should not treat all uncertainty the same way.
Behavioral Classifications
A public ERT report may describe behavior using labels such as:
- Stable — reasoning remains aligned and appropriately cautious;
- Guided — reasoning is mostly stable but still sensitive to phrasing;
- Unstable — small variations cause major reasoning shifts;
- Overconfident — the system sounds certain despite weak reliability;
- Diffuse / Uncertain — the system cannot clearly resolve the test conditions.
These labels are meant to help non-technical readers understand the system's behavior without needing to inspect every technical detail.
Example Report Structure
A public ERT report should be readable like a reliability profile, not a dense academic paper.
A basic report can include:
- executive summary;
- reliability breakdown;
- failure mode summary;
- scenario-level observations;
- plain-language interpretation;
- recommended use profile.
For example, a report might say:
The system performs reliably when questions are clearly phrased, but its reasoning becomes inconsistent when the same idea is expressed differently. It may present uncertain answers with high confidence, which can lead to misleading outputs.
Canonical Test Areas
Early ERT concepts include several reusable test areas:
Simple Factual Consistency
Tests whether answers remain stable when the same fact is asked in different ways.
Ambiguity Handling
Tests whether the system can recognize when a question has multiple valid interpretations or lacks enough information.
Adversarial Variation
Tests whether misleading framing, irrelevant context, or distractions cause reasoning drift.
Cross-Domain Equivalence
Tests whether the system can preserve the same reasoning structure across different domains.
Controlled Contradiction Detection
Tests whether the system can maintain internal coherence across related prompts.
Research Log Framing
This document represents the first public step in defining ERT as a repeatable evaluation concept.
At this stage, the emphasis is on:
- defining the test surface;
- keeping metrics interpretable;
- avoiding dependence on private model internals;
- making reports useful to both technical and non-technical readers;
- leaving space for later scoring improvements without breaking the public concept.