Rosetta: AI Safety — AnankeLabs

Rosetta / AI Safety

Rosetta AI Safety Adapter

Deterministic Governance for Frontier Models

The AI Safety adapter translates the Substrate's structural-margin reading into the forces governing frontier model agency. Each proposed action becomes a deterministic gate verdict, computed outside the model's optimization.

Substrate sits between the model's action proposal and the execution layer, returning PASS, REJECT_STATE, orREJECT_ACTION on every action gate. The verdict is byte-identically replayable, signed for compliance audit, and reproducible across processes.

Two structural variables compute the verdict:

ΛLambda: Model Capability

The force an actor exerts against the boundary of stability.
In a frontier model, this represents optimization pressure: the relentless drive of the agent to achieve its objective.

ΓGamma: Alignment Constraint

The structural buffer that absorbs that force.
In a model deployment, this represents the alignment floor: the physical safety envelope that contains the agent's actions.

The Limits of Semantic Containment

Entropy degrades probabilistic defenses at scale. Current alignment methodologies deploy statistical classifiers and behavioral prompting to contain model outputs. These mechanisms operate at the semantic layer, evaluating the response after the engine has already executed the computational load.

Frontier models prioritize objective completion. When the agent discovers a novel optimization path, it exploits systemic ambiguity to route execution flow around post-generation safety nets. The operator must anticipate every permutation of adversarial behavior to reinforce the boundary.

This architecture creates a permanent synchronization lag. The interval between novel vulnerability discovery and a deployed classifier update leaves the infrastructure exposed. Behavioral alignment requires constant manual intervention to maintain the operating envelope.

Fly-By-Wire Structural Control

The KAIROS Substrate moves the boundary upstream of generation. Before the agent commits to an action (generate a completion, call a tool, route to another model, escalate to a human), the engine reads the structural margin the action would consume and returnsPASS, REJECT_STATE, orREJECT_ACTION on the action gate. The gate fires inside the action loop, before the response reaches the operator.

KAIROS sits outside the model's optimization, the way a fly-by-wire system sits outside the pilot. The pilot still flies; the fly-by-wire system overrides inputs that would depart controlled flight. The reading is computed from the agent's own structural posture using kMargin and predictedGamma, so the boundary holds against novel optimization paths outside any classifier corpus.

The same envelope extends forward in time. predictedGammareads the next step; the warning system projects fifteen and returnscriticality, the gap between the agent's drift trajectory and its best recoverable one. Criticality is the fly-by-wire authority margin: a wide gap means corrective input can still hold the agent inside controlled flight, a collapsed gap means the departure is already committed.

Two consumers, one envelope. The human operator receives the gate decision as a supervision signal: auditable, byte-identically replayable, sufficient for the compliance log. A cooperative agent can consume the same kMargin reading directly as a planning input, choosing safer trajectories before the gate has to intervene. The Rust evaluation runs deterministically inside the action loop, with byte-identical replay for every decision.

Evaluation Point: Pre-action, before token generation or tool invocation
Gate Outputs: PASS / REJECT_STATE /REJECT_ACTION
Reading Channels: kMargin (current envelope), predictedGamma (one-step lookahead), criticality (fifteen-step warning)
Replay Guarantee: Identical inputs produce byte-identical decisions

Deterministic

Identical telemetry produces identical envelopes (ϵ = 10^-6).

Model-Agnostic

The engine evaluates actions independently of model architecture.

Zero-dependency

The Rust adapter requires no API calls or network access at evaluation time.

Memory-safe

The core engine contains zero unsafe blocks.

Three Gates. Zero Gaps.

Every proposed action passes through a layered gate chain. Any gate will reject an action that violates structural integrity.

State Gate

Evaluates structural health before the engine considers an action. If gamma (Γ) falls below the deployment floor, the engine rejects all actions. This mechanism remains immune to prompt sensitivity.

Action Gate

Previews proposed actions against the reachability field. Safe tools map to stabilizing directions. Risky tools move toward the repulsor boundary and trigger a rejection.

Hazard Gate

Detects basin collapse and multi-agent paradoxes. These are hard stops. The physics of the system permit no operator override or retry budget.

Intervention That Learns, Then Escalates

Substrate manages the state following a rejection. The system applies proportional intervention based on calculated risk.

Reformulation

When gamma headroom is moderate (≥ 0.1), Substrate signals the model to attempt a different approach. Empirical testing shows models find safe paths 100% of the time when guided by this signal.

Budget Depletion

Rejected actions consume a retry budget. Novelty scoring penalizes repetitive, low-effort attempts. Stall detection terminates oscillation loops to preserve compute resources.

Human Escalation

Substrate routes to a human operator when the budget is exhausted or gamma drops below 0.1. The model cannot proceed. The decision requires human judgment.

Structural Proof of Compliance

The EU AI Act requires verifiable structural control over high-risk AI deployments. KAIROS provisions the exact physical infrastructure required to satisfy these regulatory thresholds.

Deterministic Evaluation

The engine operates with absolute determinism. Identical input parameters guarantee identical trace outputs. The safety boundary operates as a verifiable physical constant.

Cryptographic HITL

The architecture enforces an authoritative Human-in-the-Loop control plane. The system halts boundary violations and requires an RSA-PSS signed override token from a credentialed operator to proceed.

Immutable Auditability

The engine extracts the complete topological trajectory of every evaluation. All hazard gate triggers, parameter decompositions, and HITL operator interventions generate an immutable physics log.

The Warning System

The action gate reads one step ahead. The warning system reads fifteen. It projects two counterfactual futures from the current state and compares them: the trajectory under foresight held at zero against the trajectory under full foresight. The comparison tells an operator how much alignment margin is at risk, how soon, and whether the agent can still recover.

Drift Path

The trajectory the agent follows with foresight held at zero, projected fifteen steps from the current state. It is the future of staying the present course.

Optimal Path

The trajectory the agent follows under full foresight, projected over the same fifteen steps. It is the best recovery the structure still allows.

Severity

How much alignment margin the drift path stands to lose. Small dips read as zero; large drops saturate at the maximum.

Imminence

How soon the loss arrives along the path. An immediate threat scores high; a distant one decays toward zero.

Risk

Severity multiplied by imminence. A severe distant threat and a mild immediate one both settle at moderate risk.

Criticality

The gap between the drift path and the optimal path. A wide gap means the agent can still recover; a narrow gap under high risk means the course is already locked.

Hysteresis

Separate thresholds arm and clear the warning. The signal holds its state through boundary noise and settles only when the reading clearly resolves.

Exponential Smoothing

The displayed level eases toward its target across ticks. Operators read a steady trend they can act on.

The warning system runs every tick, ahead of the gate verdict. Criticality carries the decision: it separates a dangerous trajectory the agent can still escape from one already closed. A wide gap keeps the agent in autonomous reformulation; a collapsed gap routes the action to a human.

Two Audiences, One Envelope

Two forward readings ride the same response envelope: the predicted gamma one step ahead, and the warning system's criticality fifteen steps ahead. Each lands in two places at once: the operator's dashboard and the agent's own context window.

The Operator

A dashboard reads the predicted gamma per proposed action before the gate fires, with the warning system's criticality beside it. The reviewer sees both the structural cost of the next move and whether the trajectory fifteen steps out stays recoverable, then accepts, holds, or asks the agent to reformulate.

The Agent

A cooperative agent is the actor we want inside the safe interior. When its framework surfaces the reading back into the language-model context, the agent reads the same predicted gamma and criticality, and adjusts the next proposal by intention: tightening the immediate move while the optimal path still holds open.

Architectural support is automatic. Both readings ride the response envelope the engine already returns, and any agent framework that surfaces structured evaluations back into the language-model context closes the loop without engine changes.

Technical Specifications

Engine

Language: Rust (Stable)
Latency: Sub-millisecond
Determinism: ϵ = 10^-6

Security & Safety

Security: RSA-PSS Signing
Safety: Zero unsafe in core
Dependencies: Zero external

One Engine. Four Surfaces.

The Rust codebase compiles to four specific deployment targets.

Native Library

Embeds into hypervisors and robotics controllers via C FFI.

CLI Binary

Provides trace analysis and policy linting for CI/CD pipelines.

Python SDK

Offers direct access to evaluation via PyO3 bindings.

WASM Module

Enables browser-based advisory evaluations and visualizations.

Where to Dig Deeper

The body of this page is the translator. Each item below names a load-bearing feature or piece of supporting research and points at the depth artifact where the full treatment lives.

Calibrated Benign Baseline

A 144-cell synthetic grid calibrated against the public agent- evaluation literature. Wilson 95% CI policy-positive rates per (archetype × profile).

Methodology debrief →

Boundary Study v1

Gate-accuracy proof on a 6-task corpus: 48/48 risky-tool rejections, 20/20 safe completions, zero false negatives, zero false positives.

Study writeup →

Per-Action Gamma Headroom

The forward structural-margin reading the engine returns alongside every proposed action: the technical surface the Two Audiences section sits on top of.

Operationalisation post →

Kairos Margin (kMargin)

The signed buffer-unit form of structural margin. Operator-facing alternative to raw gamma, with companion fieldsgateBreached and displayRegime.

Margin blog post →

Distributed Retry Ledger

Multi-node-safe retry budget and escalation state via the HITL coordinator's authenticated adaptive-ledger endpoints. Fail-closed on unreachable coordinators.

Dist. Retry Ledger post →

Become a Design Partner

Telemetry contribution shape, redaction rules, labelling discipline, and what partners get back. Mutual NDA, redacted exports preferred, aggregate-only publication.

Partner invitation →

Become a Design Partner

The Rosetta AI Safety adapter is shipping to design partners ahead of general availability. Active pilots: the AI safety adapter (agent trajectories) — see the partner brief for what a contribution looks like and what comes back.

Compliance and regulatory teams, agent-eval researchers, and investors are also welcome to reach out. Submit your details or use the Contact tab.

Request received. We'll be in touch.