Cost-Quality Frontier

Experiment not yet run

This page describes the experiment design. The harness is built but results have not been collected.

Cross-cutting

Hypothesis

TeaParty's architectural choices (CfA protocol, hierarchical teams, human proxy, learning retrieval) have different cost-quality tradeoffs at different task complexity levels. There exists a configuration frontier — the set of architectural configurations that achieve the best quality for a given cost budget.

H1: The full TeaParty stack (all pillars enabled) dominates simpler configurations on Complex tasks but is dominated by simpler configurations on Simple tasks.

H2: Each pillar contributes positive marginal quality, but with diminishing returns — the first pillar enabled provides more lift than the fourth.

H3: Learning retrieval has the highest quality-per-dollar of any individual pillar, because it reuses prior work rather than generating new computation.

Why This Matters

Multi-agent hierarchical systems are expensive. A hiring manager's first question is: "Is it worth the cost?" This experiment answers that question directly by mapping the cost-quality surface across configurations and task complexities. It also identifies which architectural components provide the most value per dollar — critical for prioritizing implementation work and for making practical deployment recommendations.

Method

Configurations

Config	CfA	Hierarchy	Proxy	Learning	Description
Baseline	No	No	No	No	Single agent, single prompt, no protocol
+CfA	Yes	No	No	No	Three-phase protocol, flat team
+Hierarchy	Yes	Yes	No	No	Hierarchical teams, always escalate
+Proxy	Yes	Yes	Yes	No	Proxy-mediated approval
Full	Yes	Yes	Yes	Yes	All pillars enabled
Learning only	No	No	No	Yes	Baseline + learning retrieval
Proxy only	No	No	Yes	No	Baseline + proxy approval

Task Selection

45 tasks: 15 per tier (Simple, Medium, Complex). Each task has a reference solution and an estimated "ideal cost" (tokens required by an expert human giving direct instructions to a single agent).

Measurements

Metric	Description
Quality score	1-5 composite
Total cost ($)	Dollar cost of all LLM calls
Total tokens	Input + output across all agents
Cost efficiency	Quality / cost
Overhead ratio	Total cost / ideal cost (how much more expensive than minimal)
Human time	Minutes the human spends on approvals, feedback, rework
Combined cost	LLM cost + (human time * hourly rate)

Analysis Plan

Frontier construction: For each task tier, plot quality vs. cost for all configurations. Identify the Pareto frontier — configurations where no other config achieves higher quality at equal or lower cost.
Marginal contribution: For each pillar, compute the quality gain from adding it to the configuration that has everything except it. Rank pillars by marginal contribution per tier.
Scaling analysis: Regress cost on task complexity for each configuration. Compare scaling exponents — does hierarchy bend the cost curve?
Break-even analysis: At what task complexity does the full stack become cheaper than baseline + rework? (Include rework cost in the baseline.)
Human cost inclusion: Re-compute frontiers with combined cost (LLM + human time). Does proxy shift the frontier by reducing human time?

Results

Experiment not yet run.

Expected Findings

Simple tasks: Baseline is cheapest and sufficient. Each additional pillar adds cost without quality improvement. The overhead ratio for Full config is 5-10x.
Medium tasks: CfA provides the largest quality lift. Hierarchy is marginally helpful. Proxy saves human time. Overhead ratio for Full config is 2-4x.
Complex tasks: Full config dominates. Baseline requires extensive rework (2-3 iterations), making its effective cost higher than Full config despite lower per-run cost. Hierarchy provides the largest quality lift by enabling parallel execution and preventing context exhaustion.
Learning retrieval: Highest quality-per-dollar across all tiers because retrieval cost is negligible compared to generation cost. Even on Simple tasks, relevant learnings prevent known mistakes.
Break-even: Full stack breaks even with baseline around the Medium/Complex boundary when rework costs are included.

Threats to Validity

Configuration interactions. Pillars may interact non-additively (e.g., proxy is more valuable when learning is enabled). The factorial design partially addresses this but 7 configs * 45 tasks = 315 runs is already expensive.
Reference solution bias. Quality is rated against reference solutions, which may reflect one valid approach. Alternative correct solutions may be scored lower.
Cost variability. LLM costs depend on model, caching, prompt length. Results are specific to current pricing and model capabilities.
Human time estimation. Human time per task varies by evaluator speed and domain familiarity. Mitigation: use consistent evaluators and report inter-evaluator variance.