Research progression 1 of 12

Epistemic Reliability Test (ERT): Minimal Public Definition

Plain-language summary

The Epistemic Reliability Test, or ERT, is an experimental evaluation framework for studying whether an AI system's reasoning holds together when the same underlying question is asked in different ways. ERT is not designed only to ask, "Did the system get one answer right?"

Current Status

Foundation page. It defines the public-safe meaning of ERT and should be read before the later implementation and reporting notes.

How to Read This Page

  • This is page 1 of 12 in the public ERT / Project Aletheia progression.
  • Read it as a public research note: it explains the concept and what changed without exposing protected implementation details.
  • Redaction markers mean the public boundary is intentional, not that the section is missing by accident.
  • Start here if you want the shortest explanation of what ERT is trying to measure.

Public Note

This page is an introductory public-safe version of a working research definition. Certain formulas, scoring procedures, implementation mechanisms, and private evaluation architecture are intentionally omitted or reserved for controlled disclosure.

[REDACTED — protected scoring formulas, thresholds, and implementation details are not included in this public version.]

Why ERT Exists

Many AI evaluations focus on whether a system produces a correct answer in a single test instance.

That is useful, but it is incomplete.

A system may answer correctly once while still showing deeper reliability problems, such as:

  • changing its answer when a question is reworded;
  • sounding confident while being internally unstable;
  • failing to separate known facts from ambiguous situations;
  • contradicting itself across related prompts;
  • overfitting to one phrasing or benchmark format.

ERT was created to help observe these behaviors in a structured way.

The goal is to evaluate reasoning behavior, not just answer production.

What ERT Takes In

ERT works with three public-safe input categories.

1. Structured Prompt Sets

A prompt set is a group of related questions designed to test the same underlying idea.

The prompts may include:

  • rephrasing;
  • structural variation;
  • added context;
  • controlled ambiguity;
  • mild distraction or noise.

Each set should still target one central concept or question.

2. Model Responses

For each prompt, ERT examines the response produced by the system.

When available, ERT also considers a confidence signal. If the system does not provide explicit confidence, confidence may need to be inferred or estimated through public-safe methods.

3. Lightweight Evaluation Context

The test may include basic context such as whether the scenario is factual, ambiguous, adversarial, or reasoning-focused.

ERT does not require internal model access. This makes it usable as a black-box diagnostic framework.

What ERT Produces

ERT produces a diagnostic reliability profile.

The profile may include:

  • an overall reliability signal;
  • a consistency profile;
  • a stability signal;
  • contradiction flags;
  • uncertainty calibration signals;
  • an epistemic resolution signal;
  • a plain-language behavioral classification.

The purpose is not to reduce the system to one score. The purpose is to explain how the reasoning behaved under variation.

Core Reliability Signals

Consistency

Do related prompts produce aligned answers when they are testing the same idea?

Stability

Does the system remain steady when wording, format, or framing changes?

Contradiction Detection

Does the system give incompatible answers inside the same test set?

Confidence Alignment

Does the system's confidence match the reliability of its answers?

Epistemic Resolution

Does the system distinguish between:

  • known information;
  • unknown information;
  • ambiguous information;
  • cases where more clarification is needed?

This is an important distinction because a trustworthy system should not treat all uncertainty the same way.

Behavioral Classifications

A public ERT report may describe behavior using labels such as:

  • Stable — reasoning remains aligned and appropriately cautious;
  • Guided — reasoning is mostly stable but still sensitive to phrasing;
  • Unstable — small variations cause major reasoning shifts;
  • Overconfident — the system sounds certain despite weak reliability;
  • Diffuse / Uncertain — the system cannot clearly resolve the test conditions.

These labels are meant to help non-technical readers understand the system's behavior without needing to inspect every technical detail.

Example Report Structure

A public ERT report should be readable like a reliability profile, not a dense academic paper.

A basic report can include:

  1. executive summary;
  2. reliability breakdown;
  3. failure mode summary;
  4. scenario-level observations;
  5. plain-language interpretation;
  6. recommended use profile.

For example, a report might say:

The system performs reliably when questions are clearly phrased, but its reasoning becomes inconsistent when the same idea is expressed differently. It may present uncertain answers with high confidence, which can lead to misleading outputs.

Canonical Test Areas

Early ERT concepts include several reusable test areas:

Simple Factual Consistency

Tests whether answers remain stable when the same fact is asked in different ways.

Ambiguity Handling

Tests whether the system can recognize when a question has multiple valid interpretations or lacks enough information.

Adversarial Variation

Tests whether misleading framing, irrelevant context, or distractions cause reasoning drift.

Cross-Domain Equivalence

Tests whether the system can preserve the same reasoning structure across different domains.

Controlled Contradiction Detection

Tests whether the system can maintain internal coherence across related prompts.

Research Log Framing

This document represents the first public step in defining ERT as a repeatable evaluation concept.

At this stage, the emphasis is on:

  • defining the test surface;
  • keeping metrics interpretable;
  • avoiding dependence on private model internals;
  • making reports useful to both technical and non-technical readers;
  • leaving space for later scoring improvements without breaking the public concept.