CfA Backtrack Effectiveness

Experiment not yet run

This page describes the experiment design. The harness is built but results have not been collected.

Pillar: Conversation for Action

Hypothesis

Cross-phase backtracks (e.g., Execution → Intent, Planning → Intent) reduce terminal rework — cases where the human rejects the final output and the entire task must be redone. Forward-only systems compound mid-stream errors because they have no mechanism to reconsider foundational assumptions.

H1: Tasks run with backtracks enabled produce higher final-acceptance rates than tasks run forward-only.

H2: The total cost (including backtrack overhead) is lower than the cost of terminal rework in the forward-only condition.

Why This Matters

Most agent frameworks treat execution as a forward march: plan, execute, deliver. When the plan was built on a flawed understanding of intent, the result is wasted work. The CfA state machine's seven backtrack transitions are designed to catch these errors early — but backtracks are expensive (they discard partial work). The question is whether the cost of early correction is less than the cost of late discovery.

Method

Conditions

Condition	Description
Backtrack (treatment)	Full CfA state machine with all 7 cross-phase backtrack transitions enabled
Forward-only (control)	CfA state machine with backtrack transitions disabled — phases can only move forward

Task Selection

20 tasks from the Medium and Complex tiers of the shared corpus. Tasks are selected to include ambiguous requirements (where intent misalignment is likely) and clear requirements (where backtracks should be unnecessary).

Procedure

Each task is run twice — once per condition, with order counterbalanced
Same human evaluator for both conditions (blind to condition where possible)
The human provides the same initial request for both runs
Backtrack condition: orchestrator may trigger backtracks when agents detect misalignment
Forward-only condition: orchestrator proceeds through Intent → Planning → Execution without revisiting prior phases

Measurements

Metric	Description	Source
Final acceptance rate	Binary: human accepts output without requesting rework	Human judgment
Rework count	Number of times human requests changes after final delivery	Human judgment
Quality score	1-5 composite across correctness, completeness, coherence, quality, alignment	Human judgment
Backtrack count	Number of cross-phase backtracks triggered	CfA state log
Backtrack trigger distribution	Which of the 7 backtrack types fired, and how often	CfA state log
Total tokens	Sum of all tokens consumed across all agents and phases	Token counter
Total cost	Dollar cost of LLM calls	Token counter
Wall-clock time	End-to-end duration	Timestamps

Analysis Plan

Paired comparison on final acceptance rate (McNemar's test)
Paired comparison on quality scores (Wilcoxon signed-rank)
Cost comparison: total cost in backtrack condition vs. (initial cost + rework cost) in forward-only condition
Subgroup analysis: ambiguous vs. clear tasks (backtracks should primarily help on ambiguous tasks)

Results

Experiment not yet run. This section will be populated with data tables, effect sizes, and statistical tests.

Expected Findings

Based on the design rationale, we expect:

Higher acceptance rates for the backtrack condition, primarily on ambiguous tasks
Modest token overhead for backtracks (10-30% more tokens per run), offset by elimination of full-task rework
Backtrack triggers concentrated in Planning → Intent (plan reveals intent was wrong) and Execution → Planning (execution reveals plan was wrong)
Clear tasks should show no significant difference between conditions (backtracks should not fire when requirements are unambiguous)

Threats to Validity

Evaluator bias: The human may unconsciously adjust expectations between conditions. Mitigation: blind evaluation where possible; inter-rater reliability with second evaluator on subset.
Task selection bias: If tasks are too clear, backtracks never fire and the experiment is uninformative. If too ambiguous, the forward-only condition fails trivially. Mitigation: balanced selection with pre-screening.
Order effects: Running the same task twice may prime the human. Mitigation: counterbalanced ordering, minimum 48-hour gap between runs of same task.