Skip to content

Cost-Quality Frontier

Experiment not yet run

This page describes the experiment design. The harness is built but results have not been collected.

Cross-cutting

Hypothesis

TeaParty's architectural choices (CfA protocol, hierarchical teams, human proxy, learning retrieval) have different cost-quality tradeoffs at different task complexity levels. There exists a configuration frontier — the set of architectural configurations that achieve the best quality for a given cost budget.

H1: The full TeaParty stack (all pillars enabled) dominates simpler configurations on Complex tasks but is dominated by simpler configurations on Simple tasks.

H2: Each pillar contributes positive marginal quality, but with diminishing returns — the first pillar enabled provides more lift than the fourth.

H3: Learning retrieval has the highest quality-per-dollar of any individual pillar, because it reuses prior work rather than generating new computation.

Why This Matters

Multi-agent hierarchical systems are expensive. A hiring manager's first question is: "Is it worth the cost?" This experiment answers that question directly by mapping the cost-quality surface across configurations and task complexities. It also identifies which architectural components provide the most value per dollar — critical for prioritizing implementation work and for making practical deployment recommendations.

Method

Configurations

Config CfA Hierarchy Proxy Learning Description
Baseline No No No No Single agent, single prompt, no protocol
+CfA Yes No No No Three-phase protocol, flat team
+Hierarchy Yes Yes No No Hierarchical teams, always escalate
+Proxy Yes Yes Yes No Proxy-mediated approval
Full Yes Yes Yes Yes All pillars enabled
Learning only No No No Yes Baseline + learning retrieval
Proxy only No No Yes No Baseline + proxy approval

Task Selection

45 tasks: 15 per tier (Simple, Medium, Complex). Each task has a reference solution and an estimated "ideal cost" (tokens required by an expert human giving direct instructions to a single agent).

Measurements

Metric Description
Quality score 1-5 composite
Total cost ($) Dollar cost of all LLM calls
Total tokens Input + output across all agents
Cost efficiency Quality / cost
Overhead ratio Total cost / ideal cost (how much more expensive than minimal)
Human time Minutes the human spends on approvals, feedback, rework
Combined cost LLM cost + (human time * hourly rate)

Analysis Plan

  • Frontier construction: For each task tier, plot quality vs. cost for all configurations. Identify the Pareto frontier — configurations where no other config achieves higher quality at equal or lower cost.
  • Marginal contribution: For each pillar, compute the quality gain from adding it to the configuration that has everything except it. Rank pillars by marginal contribution per tier.
  • Scaling analysis: Regress cost on task complexity for each configuration. Compare scaling exponents — does hierarchy bend the cost curve?
  • Break-even analysis: At what task complexity does the full stack become cheaper than baseline + rework? (Include rework cost in the baseline.)
  • Human cost inclusion: Re-compute frontiers with combined cost (LLM + human time). Does proxy shift the frontier by reducing human time?

Results

Experiment not yet run.

Expected Findings

  • Simple tasks: Baseline is cheapest and sufficient. Each additional pillar adds cost without quality improvement. The overhead ratio for Full config is 5-10x.
  • Medium tasks: CfA provides the largest quality lift. Hierarchy is marginally helpful. Proxy saves human time. Overhead ratio for Full config is 2-4x.
  • Complex tasks: Full config dominates. Baseline requires extensive rework (2-3 iterations), making its effective cost higher than Full config despite lower per-run cost. Hierarchy provides the largest quality lift by enabling parallel execution and preventing context exhaustion.
  • Learning retrieval: Highest quality-per-dollar across all tiers because retrieval cost is negligible compared to generation cost. Even on Simple tasks, relevant learnings prevent known mistakes.
  • Break-even: Full stack breaks even with baseline around the Medium/Complex boundary when rework costs are included.

Threats to Validity

  • Configuration interactions. Pillars may interact non-additively (e.g., proxy is more valuable when learning is enabled). The factorial design partially addresses this but 7 configs * 45 tasks = 315 runs is already expensive.
  • Reference solution bias. Quality is rated against reference solutions, which may reflect one valid approach. Alternative correct solutions may be scored lower.
  • Cost variability. LLM costs depend on model, caching, prompt length. Results are specific to current pricing and model capabilities.
  • Human time estimation. Human time per task varies by evaluator speed and domain familiarity. Mitigation: use consistent evaluators and report inter-evaluator variance.