Skip to content

Memory and Learning in AI Agent Systems

The Structural Gap

A large language model, in its native form, has no memory. Each conversation begins without knowledge of previous conversations. Each session is a fresh instantiation of the same frozen weights, capable of reasoning fluently about anything that happened before training but blind to anything that has happened since. For a single-turn question-answering system, this is an acceptable constraint. For an agent that is supposed to learn from its mistakes, adapt to a particular user, and accumulate organizational knowledge across months of operation, it is a structural gap in the architecture.

This distinction matters because the gap is not a limitation to work around — it is a capability to build. The research literature on agent memory is, in one sense, a sustained effort to answer a deceptively simple question: how do you make an agent that gets better over time? That question has two components. The first is representation: what form should accumulated experience take? The second is retrieval: given what the agent knows, how does it find the right knowledge at the right moment? Both components are harder than they appear, and decades of work in cognitive science, followed by several years of intensive work on LLM-based agents, have produced a rich body of findings that now shapes how systems like TeaParty think about memory.

Classical Cognitive Architectures

The problem of memory in intelligent agents was not invented with large language models. Cognitive scientists and AI researchers were building memory systems for symbolic agents long before neural networks dominated the field, and their solutions repay study.

ACT-R (Adaptive Control of Thought — Rational), developed by John Anderson and Christian Lebiere in their 1998 book of the same name, is the most influential framework for declarative memory in cognitive science. The core insight is that retrieval probability should reflect environmental statistics. Chunks of declarative knowledge are assigned a base-level activation that rises each time the chunk is used and decays between uses according to a power law — the same power law that describes forgetting in human memory. The formula is elegant: activation is the log sum of power-law decay terms, one for each past use. A chunk that was used frequently in the recent past has high activation and retrieves easily. A chunk that was used once, long ago, retrieves poorly. This is not an arbitrary design choice but a model calibrated to match the regularities of how frequently things appear in human environments: if something was useful recently, it is likely to be useful again soon.

The implications go beyond mimicking human cognition. ACT-R's activation mechanism means the memory system is adaptive without explicit programming. It gradually surfaces the knowledge that has historically been relevant and lets unused knowledge fade. The system does not need a curator deciding what to remember; relevance emerges from the history of use.

Soar, described comprehensively by John Laird in his 2012 book The Soar Cognitive Architecture, takes a different architectural approach but arrives at a complementary insight: different kinds of knowledge require different memory stores. Soar distinguishes four: working memory (current goals and active context), procedural memory (production rules encoding how to act), semantic memory (facts about the world), and episodic memory (records of past experiences). Learning in Soar happens through chunking — when an agent reaches an impasse in reasoning, it resolves the impasse through problem-solving, then compiles the solution into a new production rule so the same impasse does not recur. This is impasse-driven learning: the agent only generalizes from experience when it fails, not continuously.

Together, ACT-R and Soar establish two principles that the modern LLM memory literature has largely rediscovered. First, the retrieval mechanism matters more than the storage mechanism. ACT-R's activation-based retrieval is more important than the particular format in which chunks are stored. Soar's chunking is more important than the size of its production library. Second, multiple memory types serve different purposes, and collapsing them into a single store loses something real. These are not historical curiosities; they are load-bearing insights.

The LLM-Agent Memory Literature

The 2023-2025 literature on LLM-agent memory represents one of the fastest-moving areas in applied AI research. Several papers have become landmarks.

Park et al.'s Generative Agents (2023), published at UIST and presented at the ACM Symposium on User Interface Software and Technology, demonstrated the first fully realized LLM-agent memory system in a deployed environment. The system populated a simulated town with twenty-five agents, each maintaining a memory stream — a log of observations in natural language. Retrieval from this stream used three signals combined with equal weight: recency (exponential decay since the memory was created), relevance (cosine similarity between the memory's embedding and the current query), and importance (an LLM-assigned score for how significant the memory was when it occurred). The weighted combination determined which memories surfaced to influence current behavior.

The critical ingredient, however, was not the retrieval mechanism but the reflection mechanism. Periodically, agents asked themselves what they had learned from their recent observations, generating higher-level abstractions — "I seem to do better when I prepare in advance" rather than "I prepared at 9am on Tuesday." These reflections were themselves stored in the memory stream and retrieved like any other memory. Without reflection, agents accumulated observations but did not learn from them. With reflection, they exhibited behavior that observers, in a structured evaluation, found more believable as human-like social behavior. This finding — that the generalization step is the critical ingredient, not the accumulation step — is one of the most important in the literature.

Reflexion (Shinn et al., 2023), published at NeurIPS, pursued a similar idea through a different mechanism. Rather than periodically synthesizing memories, Reflexion agents generated verbal self-critiques after each failed attempt at a task, storing these critiques in an episodic memory buffer that informed subsequent attempts. The result was substantial performance improvement on coding, sequential reasoning, and web navigation tasks — without any weight updates. The system achieved 91% pass@1 on HumanEval, surpassing GPT-4's then-state-of-the-art 80%, purely through accumulated verbal reflection. What Reflexion demonstrated is that the gap between "an LLM that has failed once" and "an LLM that has learned from failure" can be closed with nothing more than well-organized text.

CLIN (Majumder et al., 2023/2024) pushed further on the question of what form accumulated experience should take. Where Reflexion stores reflections as narrative text, CLIN stores causal abstractions: "When X, doing Y leads to Z." This structured form turned out to transfer across tasks in ways that unstructured narrative reflection did not. Evaluated on the ScienceWorld benchmark, CLIN outperformed Reflexion by 23 absolute points and demonstrated genuine transfer to new environments. The distinction between narrative and causal representation is consequential: the agent that knows "in environments with locked doors, having a key increases success rate" generalizes better than the agent that knows "I failed to open the door on Tuesday." CLIN also introduced meta-memory — a summary of the best memories across prior episodes — as a mechanism for abstracting from episodic to semantic knowledge, echoing Soar's distinction between those two memory types.

ExpeL (Zhao et al., 2024), published at AAAI, added a dimension that the prior work mostly neglected: contrastive learning from both successes and failures. Rather than reflecting only on errors, ExpeL agents reviewed paired examples of similar situations with different outcomes, extracting insights about what distinguished the successful case. The intuition is straightforward — knowing why something worked is as informative as knowing why something did not — but the implementation required careful design of the experience collection and insight extraction pipeline. Empirically, the contrastive approach outperformed reflection-from-failure alone, suggesting that positive examples carry information that failure analysis cannot recover.

Voyager (Wang et al., 2023) addressed a different aspect of the representation problem. If declarative memory stores facts and episodic memory stores experiences, what stores skills? Voyager's answer was executable code. Deployed in Minecraft as a lifelong learning agent, Voyager maintained a skill library of JavaScript functions — each one a verified, reusable behavior that the agent had learned to execute successfully. Skills were retrieved by embedding similarity between the current task and the skill description. This is procedural memory as Soar defines it, but implemented in a form that is both interpretable and compositional: complex skills can invoke simpler ones. The result was an agent that accumulated capabilities rather than just knowledge, obtaining 3.3 times more unique items and unlocking technology milestones 15.3 times faster than prior approaches. The skill library did not just help Voyager perform better in the current world — it transferred to a new Minecraft world for novel tasks, demonstrating genuine cross-session learning.

MemGPT (Packer et al., 2023), from UC Berkeley and later productized as Letta, approached the problem from a different angle: rather than building a specific memory architecture, it gave agents tools to manage their own memory. Drawing an analogy to operating system memory hierarchies, MemGPT defines a context window as analogous to RAM — fast but small — and external storage as analogous to disk. The agent uses function calls to move information between tiers: searching external memory, loading relevant content into context, and writing new information back to storage. The insight is that memory management need not be automated. If agents are capable of reasoning about what they need to remember, they can manage the process themselves. This positions memory as a first-class agentic capability rather than a passive infrastructure concern.

The Retrieval Problem

Storage is easy. The hard part is retrieval: finding the right knowledge at the right moment, from a potentially large and heterogeneous collection, without knowing in advance what will be relevant.

Every memory system in the literature is ultimately a bet on what retrieval signals matter. ACT-R bets on frequency-weighted recency. Generative Agents bets on a three-way combination of recency, embedding similarity, and importance. Voyager bets on embedding similarity between task description and skill description. CLIN bets on structured causal matching. These are not interchangeable choices — they make different trade-offs between precision (finding exactly the right memory) and recall (not missing relevant memories), and between computational cost and retrieval quality.

The emerging consensus is that no single retrieval signal is sufficient. Embedding similarity alone retrieves semantically related content but misses temporally relevant content and cannot account for the significance of an experience when it occurred. Recency alone would make an agent forgetful of anything it learned more than a few interactions ago. Importance scoring alone would surface only the most memorable events regardless of their current relevance. The combination that Generative Agents introduced — normalize each signal to [0, 1] and take a weighted sum — is simple but surprisingly effective, and several subsequent systems have adopted variants of it.

Recent work has also confirmed a counter-intuitive finding: selective forgetting improves retrieval. FadeMem (2025) implements biologically-inspired decay across a dual-layer memory hierarchy, with retention governed by adaptive exponential decay functions modulated by semantic relevance, access frequency, and temporal patterns. The system retains 82% of critical facts while using only 55% of the storage compared to full retention, and outperforms both MemGPT and Mem0 on long-context reasoning benchmarks. The intuition aligns with what ACT-R established decades earlier: a memory system that lets unused knowledge fade does not lose important information — it clears away noise that would otherwise compete with relevant signal during retrieval.

Mem0 (2025), published on arXiv, addresses retrieval at production scale through a structured extraction and update pipeline. An enhanced variant, Mem0g, adds a graph-based representation that captures relational structure among entities, but the base system uses vector-based extraction without graph structure. Both variants achieve 91% lower p95 latency and over 90% lower token cost compared to full-context approaches, which matters for deployed systems where every context-window insertion has a price. AutoRefine (2025) takes a different approach, extracting dual-form experience patterns from agent execution histories: procedural subtasks become specialized subagents with independent memory, while static knowledge becomes skill patterns as guidelines or code. A maintenance mechanism scores, prunes, and merges patterns to prevent repository degradation as experience accumulates — addressing a failure mode that simpler systems do not anticipate.

The Multi-Agent Problem

The research described above is almost entirely focused on single-agent memory. A single agent, operating across multiple sessions, accumulates experience that improves its future performance. The retrieval problem is hard enough in this setting. In a multi-agent system, the problem compounds: agents do not just need to retrieve what they themselves have learned — they need access to what their teammates have learned, under conditions where different agents may have different roles, different trust levels, and different information needs.

The dominant solution in deployed multi-agent systems has been to ignore the problem: each agent maintains its own memory, and sharing happens only through the conversation itself. This is analogous to a team of human experts who communicate only in meetings and take no notes — it works, but it does not scale.

Stigmergy offers one conceptual alternative. In biological systems, stigmergy refers to indirect coordination through shared environmental traces: ants lay pheromones that other ants follow, without any direct communication. The analog in agent systems is leaving traces in shared artifacts — work products, logs, structured summaries — that other agents can read. This happens informally in any system with shared state, but formalizing it as a coordination mechanism suggests design principles: make traces persistent, make them searchable, and design them for the reader rather than the writer.

Rezazadeh et al.'s Collaborative Memory (2025) is one of the few systems to tackle shared memory directly. It defines private and shared memory tiers, with asymmetric access controls encoded as two bipartite graphs: one linking users to agents and one linking agents to resources. Memory fragments carry provenance attributes — contributing agents, accessed resources, timestamps — supporting retrospective permission checks. The key finding is that memory sharing with even 50% overlap reduces resource usage by up to 61%, demonstrating that the infrastructure cost of shared memory can be substantially lower than maintaining independent stores. The system establishes technical feasibility for dynamic access control in multi-agent memory but leaves open questions about large-scale concurrency and real-world multi-user benchmarks.

Memory as a Service (MaaS, 2025) proposes a more radical architectural shift: rather than treating memory as local state attached to specific agents, decouple memory entirely and expose it as a callable, composable service. This converts the multi-agent memory problem from a synchronization problem into a service design problem — agents do not share a memory store, they call a shared memory service with well-defined access semantics. Whether this abstraction survives contact with the messy reality of deployed multi-agent systems remains to be demonstrated, but the framing clarifies what the goal is.

Where TeaParty Sits

Most agent memory systems in the literature are single-agent and single-session. TeaParty operates across sessions, across agents, and across hierarchy levels — a combination that none of the systems above fully address.

TeaParty's response to this gap has two components. The first is the proxy memory system, which applies ACT-R-style activation-based retrieval to the human proxy agent. The proxy maintains a persistent store of what it has learned about a specific human participant — their preferences, their communication style, the decisions they have made and why — and retrieves this knowledge using an activation function that weights recency and frequency of use. The goal is that the proxy's model of the human becomes more accurate over time, not through fine-tuning but through structured accumulation and retrieval of episodic knowledge. This is the single-agent memory problem, applied to the specific case of modeling a person.

The second component is the hierarchical learning system, which addresses organizational knowledge. Learnings extracted from completed sessions are scoped to the level of the hierarchy that generated them — a session-level insight stays at the session level unless a team lead promotes it to the team level, and team-level insights stay at the team level unless they are promoted to the organizational level. This promotion chain is the mechanism by which validated knowledge moves from particular to general. It mirrors the meta-memory mechanism in CLIN and the reflection-to-reflection synthesis in Generative Agents, but applied across an organizational hierarchy rather than within a single agent.

The gap between these two components — an ACT-R-based proxy memory for one agent and a hierarchical promotion chain for organizational knowledge — is where the most interesting design work lies. Extending activation-based retrieval to all agents in the system, not just the proxy, would require each agent to maintain its own activation-weighted history of what knowledge it has retrieved and when. This is the cognitive architecture proposal that the current system points toward but has not yet implemented. The literature reviewed here suggests it is the right direction: the research consistently shows that frequency-weighted recency, combined with relevance and importance, outperforms simpler retrieval strategies, and that the retrieval mechanism is where the real value of a memory system is created or lost.

The field is converging on a set of principles that TeaParty's architecture largely embodies: multiple memory types with distinct retrieval mechanisms, reflection as the mechanism that converts experience into generalizable knowledge, selective forgetting as a feature rather than a bug, and organizational scoping as the answer to the multi-agent sharing problem. What remains open — both in the literature and in TeaParty — is how to extend these principles across full team-scale deployments where the social dynamics of memory access, trust, and knowledge attribution are as important as the technical retrieval mechanisms.


References

Anderson, J. R., & Lebiere, C. (1998). The Atomic Components of Thought. Lawrence Erlbaum Associates.

Laird, J. E. (2012). The Soar Cognitive Architecture. MIT Press.

Majumder, B. P., Dalvi, B., Clark, P., Sabharwal, A., & Mishra, B. D. (2023). CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization. arXiv:2310.10134.

Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560.

Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST).

Rezazadeh, A., Li, J., et al. (2025). Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control. arXiv:2505.18279.

Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. Advances in Neural Information Processing Systems (NeurIPS 2023).

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., & Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291.

Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y.-J., & Huang, G. (2024). ExpeL: LLM Agents Are Experiential Learners. Proceedings of the AAAI Conference on Artificial Intelligence.

FadeMem: Biologically-Inspired Forgetting for Efficient Agent Memory. (2025). arXiv:2601.18642.

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. (2025). arXiv:2504.19413.

AutoRefine: From Trajectories to Reusable Expertise for Continual LLM Agent Refinement. (2025). arXiv:2601.22758.

Memory as a Service (MaaS): Rethinking Contextual Memory as Service-Oriented Modules for Collaborative Agents. (2025). arXiv:2506.22815.