Scoped vs. Flat Retrieval
Experiment not yet run
This page describes the experiment design. The harness is built but results have not been collected.
Pillar: Learning System
Hypothesis
Team-scoped retrieval — where learnings are weighted by proximity to the requesting agent's scope in the organizational hierarchy — produces more relevant context than undifferentiated global retrieval.
H1: Scoped retrieval produces higher precision@5 (fraction of top-5 retrieved learnings rated relevant) than flat retrieval.
H2: Agents using scoped retrieval make fewer errors attributable to irrelevant or misleading context.
H3: The proximity weighting (team > project > global) correctly reflects the actual relevance distribution.
Why This Matters
The learning system's scoped retrieval is the mechanism that bridges the scoping-blindness tradeoff. Without it, context isolation (the hierarchical team structure) prevents agents from accessing organizational knowledge that should inform their work. But naive "retrieve everything" doesn't work either — irrelevant learnings are noise. The scope multiplier is the architectural bet that spatial proximity in the org hierarchy correlates with relevance. This experiment tests that bet.
Method
Conditions
| Condition | Description |
|---|---|
| Scoped (treatment) | Retrieval with proximity weighting: prominence = importance * recency_decay * (1 + reinforcement_count) * scope_multiplier |
| Flat (control) | Same retrieval formula but scope_multiplier = 1.0 for all learnings regardless of scope |
| No retrieval (baseline) | Agents receive no historical learnings |
Learning Corpus
Pre-populated learning store with: - 50 institutional learnings (org-wide conventions) - 100 task learnings across 5 workgroups (20 per workgroup) - 30 proxy learnings (human preferences)
Learnings are seeded from actual TeaParty POC sessions to ensure realistic content and quality distribution.
Task Selection
20 tasks across 3 workgroups. Each task is chosen to have at least 3 relevant learnings in the corpus (verified by human pre-labeling) and at least 10 irrelevant learnings that a naive retriever might surface.
Procedure
- For each task, run retrieval under both scoped and flat conditions
- Human judges rate each retrieved learning as: relevant, marginally relevant, or irrelevant
- For the agent-performance comparison: run tasks end-to-end with scoped, flat, and no-retrieval conditions
- Human rates output quality and identifies errors attributable to misleading context
Measurements
| Metric | Description |
|---|---|
| Precision@5 | Fraction of top-5 retrieved learnings rated relevant |
| Precision@10 | Fraction of top-10 rated relevant |
| Recall@10 | Fraction of all relevant learnings appearing in top-10 |
| Scope correlation | Spearman correlation between scope proximity and human relevance rating |
| Noise-induced errors | Errors in agent output attributable to irrelevant retrieved context |
| Task quality | 1-5 composite quality score |
| Context efficiency | Relevant tokens / total retrieved tokens |
Analysis Plan
- Precision/recall comparison between scoped and flat retrieval (paired by task)
- Correlation analysis: does scope proximity predict human-judged relevance?
- Error analysis: categorize agent errors as context-induced vs. independent
- Ablation: compare no-retrieval baseline to quantify the value of learning retrieval at all
Results
Experiment not yet run.
Expected Findings
- Precision@5: Scoped retrieval expected to achieve 0.7-0.8 vs. flat at 0.4-0.5. Team-level learnings are disproportionately relevant for team-level tasks.
- Recall@10: Comparable between conditions — relevant global learnings still surface under scoped retrieval, just ranked lower.
- Scope correlation: Moderate positive correlation (r = 0.3-0.5) between scope proximity and relevance. Not perfect — some global learnings are highly relevant to specific teams, and some team learnings generalize poorly.
- No retrieval baseline: Measurably worse quality on tasks where relevant learnings exist, establishing that retrieval adds value.
- Noise-induced errors: Flat retrieval expected to produce 2-3x more noise-induced errors than scoped.
Threats to Validity
- Corpus quality. Seeded learnings may not represent natural learning accumulation. Real learnings may be noisier, more redundant, or differently distributed.
- Scope structure. The experiment assumes a specific org hierarchy. Different structures may change the optimal scope multipliers.
- Cold start. With only 180 total learnings, retrieval may not be challenging enough to differentiate conditions. Larger corpus needed for production-like evaluation.
- Human labeling. Relevance is subjective. Mitigation: inter-rater reliability on subset; clear labeling rubric.