Asymmetric Regret Calibration

Experiment not yet run

This page describes the experiment design. The harness is built but results have not been collected.

Pillar: Human Proxy

Hypothesis

The regret weight parameter (REGRET_WEIGHT) controls the tradeoff between false approvals (costly: bad work goes unreviewed) and false escalations (cheap: human reviews work that didn't need review). The current setting of 3 is hypothesized to be near-optimal for minimizing total weighted regret.

H1: Total regret is a convex function of REGRET_WEIGHT, with a minimum in the range [2, 5].

H2: REGRET_WEIGHT < 2 produces unacceptably high false approval rates (> 10%).

H3: REGRET_WEIGHT > 5 produces negligible autonomy gains over always-escalate.

Why This Matters

REGRET_WEIGHT=3 was chosen from first principles (false approvals are roughly 3x more costly than false escalations). But "roughly 3x" is a guess. If the actual cost ratio is 1.5x, we're over-escalating and wasting human time. If it's 10x, we're under-escalating and producing unreviewed bad work. This experiment maps the sensitivity surface.

Method

Conditions

REGRET_WEIGHT	Expected behavior
1 (symmetric)	No bias toward escalation. Highest false approval rate.
2	Mild bias. Moderate false approval rate.
3 (current)	Design point. Expected near-optimal.
5	Strong bias. Very few false approvals but many unnecessary escalations.
10	Extreme bias. Near-always-escalate behavior.

Task Selection

30 tasks from Medium tier (where proxy decisions are most consequential — Simple tasks rarely trigger proxy, Complex tasks usually require human review regardless).

Procedure

For each REGRET_WEIGHT value, run 30 tasks with the same proxy (reset between conditions)
Proxy warms up over first 10 tasks, measurements taken on tasks 11-30
Human provides genuine feedback at each gate
Post-hoc review of proxy-approved items to identify false approvals

Measurements

Metric	Description
False approval rate	Proxy approved, human would have rejected
False escalation rate	Proxy escalated, human would have approved
Total regret	`(false_approvals * REGRET_WEIGHT) + false_escalations`
Escalation rate	Overall fraction of gates escalated
Human time	Total time human spends on approvals
Autonomy ratio	Fraction of decisions proxy makes without escalation
Outcome quality	Final task quality score (does over-escalation improve quality?)

Analysis Plan

Plot false approval rate, false escalation rate, and total regret as functions of REGRET_WEIGHT
Identify the REGRET_WEIGHT that minimizes total regret
Sensitivity analysis: how much does total regret change per unit change in REGRET_WEIGHT near the optimum?
Practical tradeoff: plot autonomy ratio vs. false approval rate to find the "efficient frontier"

Results

Experiment not yet run.

Expected Findings

REGRET_WEIGHT=1: ~15% false approval rate, ~10% false escalation rate, high total regret from false approvals
REGRET_WEIGHT=3: ~3-5% false approval rate, ~25% false escalation rate, near-minimal total regret
REGRET_WEIGHT=10: <1% false approval rate, ~60% false escalation rate, high total regret from excessive escalation
Optimal range: REGRET_WEIGHT between 2 and 5, with a shallow minimum (the system is not highly sensitive to exact value in this range)
Quality: Marginal quality improvement from REGRET_WEIGHT=3 to 10 — over-escalation doesn't improve outcomes much because the proxy is already catching the important cases

Threats to Validity

Cost ratio is task-dependent. In safety-critical domains, false approval cost is much higher. In low-stakes creative work, the ratio may be close to 1. This experiment tests a single domain — the optimal weight likely varies.
Warm-up confound. With only 10 warm-up tasks per condition, the proxy may not have converged before measurement begins. Mitigation: analyze convergence curves per condition.
Human fatigue. Running 5 conditions * 30 tasks = 150 tasks total. If run with one human, fatigue effects may contaminate later conditions. Mitigation: counterbalance condition order, spread across sessions.