Proxy Convergence
Experiment not yet run
This page describes the experiment design. The harness is built but results have not been collected.
Pillar: Human Proxy
Hypothesis
The dual-signal confidence model (Laplace estimate + EMA) converges toward accurate prediction of human approval decisions. As the proxy observes more human decisions, its escalation rate decreases and its prediction accuracy increases — without increasing the false approval rate.
H1: Proxy prediction accuracy improves monotonically over the first 30 observations per (state, task-type) pair.
H2: Escalation rate decreases from ~100% (cold start) to a stable plateau below 40% within 20 observations.
H3: False approval rate remains below 5% throughout convergence, due to asymmetric regret weighting.
Why This Matters
The human proxy is the mechanism that makes agent autonomy tractable. If it doesn't converge, the human remains a permanent bottleneck. If it converges too aggressively (false approvals), it produces unreviewed bad work. The dual-signal design with asymmetric regret is intended to thread this needle — but we need to verify it empirically, especially the convergence rate and the false-approval floor.
Method
Setup
A single human works with the same proxy across 50 sequential tasks, providing feedback at each approval gate. Tasks are drawn from a single domain to allow learning to accumulate (cross-domain convergence is a separate question).
Conditions
| Condition | Description |
|---|---|
| Dual-signal (treatment) | Laplace + EMA, min of both, REGRET_WEIGHT=3 |
| Laplace-only (ablation) | Laplace estimate only, same REGRET_WEIGHT |
| EMA-only (ablation) | EMA only, same REGRET_WEIGHT |
| No proxy (control) | Always escalate to human |
Measurements
| Metric | Description | Source |
|---|---|---|
| Prediction accuracy | Did proxy predict human's decision correctly? | Proxy log vs. human response |
| Escalation rate | Fraction of gates where proxy escalates to human | Proxy log |
| False approval rate | Proxy approved but human would have rejected | Proxy log + human post-hoc review |
| False escalation rate | Proxy escalated but human would have approved | Proxy log + human judgment |
| Confidence score trajectory | Raw confidence values over time per (state, task-type) | Proxy log |
| Human time per task | Wall-clock time human spends on approvals | Timestamps |
| Exploration trigger rate | Fraction of escalations caused by forced exploration (15%) vs. low confidence | Proxy log |
Procedure
- Run 50 tasks sequentially with the same human
- At each approval gate, log the proxy's prediction and confidence score before showing to human
- Human provides genuine approval/rejection (not simulated)
- After all 50 tasks, human does post-hoc review of proxy-approved items to catch false approvals the proxy masked
Analysis Plan
- Plot accuracy, escalation rate, and false approval rate as time series over 50 tasks
- Identify convergence point: first task after which escalation rate stays within 5% of its final value
- Compare convergence curves between dual-signal, Laplace-only, and EMA-only conditions
- Human time savings: compare human time per task in dual-signal vs. no-proxy condition
Results
Experiment not yet run.
Expected Findings
- Cold start (tasks 1-5): 100% escalation rate for all conditions, by design
- Calibrating (tasks 5-20): Rapid decrease in escalation rate. EMA-only may converge faster but oscillate more. Laplace-only converges slower but more smoothly. Dual-signal should track the more conservative (slower) of the two.
- Warm start (tasks 20-50): Escalation rate stabilizes. Dual-signal expected to plateau at 20-35% (15% floor from forced exploration + genuine uncertainty).
- False approval rate: Should remain below 5% for dual-signal across all phases. EMA-only may spike during rapid convergence.
- Human time savings: Expected 50-70% reduction in human time per task at warm start vs. cold start.
Threats to Validity
- Human consistency. If the human's preferences drift over 50 tasks, the proxy is chasing a moving target. Mitigation: use a well-defined domain where preferences are relatively stable.
- Task homogeneity. 50 tasks in one domain may overstate convergence speed. Real usage spans multiple domains.
- Simulated vs. genuine stakes. Experimental approval decisions may not carry the same weight as production decisions. Mitigation: use real project tasks, not toy problems.