Experimental Results
Status: Harness built, experiments not yet run
The experimentation harness and instrumentation infrastructure described below are implemented. The individual experiment designs are complete — hypotheses, methodologies, and evaluation criteria are specified. However, no experiments have been executed yet. The results sections in the individual experiment pages are empty.
This section presents empirical evaluations of TeaParty's four research pillars. Each experiment isolates a specific architectural claim, describes the methodology for testing it, and reports results.
Motivation
The conceptual design makes strong structural claims: that cross-phase backtracks reduce terminal rework, that hierarchical teams outperform flat ones at scale, that asymmetric regret weighting calibrates autonomy better than symmetric approaches, and that scoped retrieval beats flat retrieval for agent coordination. These claims need evidence.
The experiments below are designed to be:
- Ablative — each experiment isolates one architectural choice and compares the system with and without it
- Reproducible — tasks, prompts, and evaluation criteria are specified precisely enough to rerun
- Cost-aware — token usage and dollar cost are first-class metrics alongside quality
Experimental Infrastructure
All experiments use the TeaParty POC orchestrator with instrumented logging. Key instrumentation points:
- Token counters per agent, per hierarchy level, per CfA phase
- State transition logs from the CfA state machine (timestamps, transition type, trigger)
- Proxy decision logs from the approval gate (confidence scores, decision, human response)
- Learning retrieval logs (query, results, relevance judgments)
- Task outcome ratings (human-judged quality on 1-5 Likert scale, plus binary accept/rework)
Task Corpus
Experiments draw from a shared corpus of tasks spanning three complexity tiers:
| Tier | Description | Example | Expected agents |
|---|---|---|---|
| Simple | Single-file, single-domain | Fix a bug in one module | 1-2 |
| Medium | Multi-file, single-domain | Add a feature with tests | 3-5 |
| Complex | Multi-file, cross-domain | Design and implement a new subsystem | 5-10+ |
Each tier contains 15-20 tasks with known-good reference solutions for quality comparison.
Evaluation Methodology
Quality scoring. Each output is rated by a human evaluator on five dimensions: 1. Correctness — does it do what was asked? 2. Completeness — are all requirements addressed? 3. Coherence — do the parts fit together? 4. Code quality — is the implementation clean and idiomatic? 5. Alignment — does it reflect the spirit of the request, not just the letter?
Statistical approach. Paired comparisons (treatment vs. control on same task) with Wilcoxon signed-rank tests for ordinal quality scores. Bootstrap confidence intervals for cost and token metrics. Effect sizes reported as Cohen's d or rank-biserial correlation.
Experiments
Pillar 1: Conversation for Action
- CfA Backtrack Effectiveness — Do cross-phase backtracks reduce terminal rework compared to forward-only execution?
Pillar 2: Hierarchical Teams
- Hierarchical vs. Flat Coordination — Does hierarchical dispatch with liaison-mediated context compression outperform flat single-team execution?
- Liaison Context Compression — How much information is preserved vs. lost at each hierarchy boundary?
Pillar 3: Human Proxy
- Proxy Convergence — Does the dual-signal confidence model converge toward accurate human preference prediction?
- Asymmetric Regret Calibration — Is REGRET_WEIGHT=3 near-optimal, and how sensitive are outcomes to this parameter?
Pillar 4: Learning System
- Scoped vs. Flat Retrieval — Does team-scoped retrieval produce more relevant context than undifferentiated global retrieval?
Cross-Cutting
- Cost-Quality Frontier — How do cost and quality scale with task complexity across architectural configurations?