Agentic Memory: The Paper That Teaches LLMs to Manage Their Own Memory

Q: "What is agentic memory in LLMs?"

" Agentic memory is memory that the agent itself manages through tool calls, instead of an external pipeline managing memory for it. The agent decides when to store a fact, when to update an old one, when to retrieve something, and when to summarize or filter active context. In AgeMem, these decisions are trained with reinforcement learning rather than scripted."

Q: "What benchmarks were used?"

" Five long-horizon benchmarks: ALFWorld (household task planning in a grid world), SciWorld (science experiments as agents), PDDL (classical planning problems), BabyAI (instruction following in a grid world), and HotpotQA (multi-hop question answering). The first four stress action planning, HotpotQA stresses retrieval."

Q: "Is the code released?"

" At the time of writing, the paper is on arXiv but the trained checkpoints do not appear to be publicly released. If you want to run AgeMem yourself you\u0026rsquo;ll need to reproduce the training, which means the three-stage RL pipeline plus step-wise GRPO on top of a Qwen base."

TL;DR

A January 2026 paper from Alibaba and Wuhan University researchers treats memory as something the agent does rather than something bolted on around it. Their system, AgeMem, exposes six memory operations (add, update, delete for long-term; retrieve, summary, filter for short-term) as tools the LLM can call, then trains the calling policy with three-stage reinforcement learning. On five long-horizon benchmarks, it beats Mem0 and A-Mem by 4.82 to 8.57 percentage points. The more interesting story sits underneath that score: a shift from frozen memory pipelines toward learned memory behavior.

The memory problem everyone is trying to solve

LLM agents have a short-term problem and a long-term problem. The short-term problem: context windows are finite, so a long conversation eventually pushes early information past the edge. The long-term problem: nothing the agent learns inside a session persists to the next one. Every fresh session starts from a blank page.

The usual patch is retrieval-augmented memory. Stash past turns in a vector database, do a similarity search when you need them, stuff the top-k chunks back into the prompt. Systems like Mem0 and A-Mem (two of the baselines in this paper) do this well. But the retrieval logic and the write logic live outside the model; a separate script decides what gets stored, when it’s updated, and what similarity threshold triggers a pull. The model is a passenger.

That split is where things break in practice. A frozen retrieval pipeline can’t learn from its mistakes. If the agent fails a task because the retriever pulled the wrong fact, the retriever doesn’t get updated; the agent just eats the wrong answer. And because the agent has no say in what got stored, it can’t reason about its own memory hygiene.

AgeMem (short for Agentic Memory) takes the opposite position: give the model tool calls for memory, train the policy end-to-end, and let the agent decide when to write, when to read, and when to forget.

What the paper actually did

The authors (Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu) define memory as a set of tools alongside the agent’s normal action space. Six of them:

Memory type	Operation	What it does
Long-term	Add	Store a new entry in persistent memory
Long-term	Update	Overwrite an existing entry
Long-term	Delete	Remove an entry that’s no longer useful
Short-term	Retrieve	Pull relevant snippets into active context
Short-term	Summary	Compress a block of context into a summary
Short-term	Filter	Drop irrelevant content from active context

The agent sees these the same way it sees any other tool, like search_web or run_python. At each step it can call a memory op, a task action, or both, and the next step sees the result. The policy itself handles what used to be a separate write pipeline and retriever script.

The hard part is training that policy. Memory rewards are sparse: you can write 50 entries across a long task and only find out at the very end whether any of them paid off. Dense task signal gets smothered by the noise of memory operations. Ordinary RL setups stall.

The three-stage training recipe

The paper breaks training into three sequential stages, each with a narrower objective than the last:

Long-term memory construction. The agent is dropped into casual conversational contexts and rewarded for storing salient facts into long-term memory. This stage is all about learning what is worth remembering. Task-solving is out of scope here; the agent is just learning to be a good librarian.
Short-term control under distractors. The long-term memory from stage 1 is preserved, the short-term context is reset, and the agent is fed semantically close but irrelevant distractors. It has to learn when to trust long-term memory over noisy short-term context (the retrieve / summary / filter trio).
Integrated reasoning. Now the agent gets real queries that require both memories working together. Rewards flow from final task success, and the model has to coordinate storage, retrieval, and reasoning simultaneously.

Each stage uses a step-wise variant of GRPO (Group Relative Policy Optimization). GRPO normally rewards an entire trajectory, which is brutal for memory where useful operations happen mid-trajectory. The step-wise version gives credit to individual memory actions based on their downstream effect, which the authors argue is what makes the sparse-reward problem tractable.

The numbers

The team ran AgeMem against Mem0 and A-Mem on two Qwen backbones, Qwen2.5-7B-Instruct and Qwen3-4B-Instruct, across five long-horizon benchmarks: ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA. The smaller models are a deliberate choice. Memory management is most painful at the low end, where context windows are cramped and you can’t brute-force the problem by stuffing everything into the prompt.

Model	Method	ALFWorld	SciWorld	PDDL	BabyAI	HotpotQA	Average
Qwen2.5-7B	AgeMem	41.07	35.55	17.31	61.42	54.44	41.96
Qwen2.5-7B	Mem0	37.49	26.99	13.96	60.58	46.66	37.14
Qwen2.5-7B	A-Mem	34.68	28.06	18.39	58.82	43.95	36.78
Qwen3-4B	AgeMem	48.97	59.48	35.07	72.56	55.49	54.31
Qwen3-4B	A-Mem	34.31	50.14	34.41	61.35	48.48	45.74
Qwen3-4B	Mem0	41.17	51.38	31.72	60.05	39.16	44.70

The headline delta is 4.82 percentage points over the strongest baseline on Qwen2.5-7B, and 8.57 points on Qwen3-4B. A couple of things stand out:

The 4B model with AgeMem beats the 7B model with either baseline. Learned memory buys you more than 3B parameters on these tasks. Which is the headline finding, if you had to pick one.
HotpotQA shows the biggest gap. Multi-hop question answering is exactly the setting where getting the retrieval right outweighs raw model capacity, which tracks with the hypothesis.
PDDL barely moves on Qwen3-4B (35.07 vs 34.41). Pure symbolic planning doesn’t lean on memory the same way; the bottleneck is elsewhere. The authors report it honestly rather than dropping it from the table.

Why this approach is different

The cleanest way to see what’s new here is to compare the memory stack across the three systems:

System	Memory logic lives where	Updates from experience?	What the agent controls
Mem0	External pipeline with LLM-driven extraction, fixed write heuristics	No, static policy	Nothing, really
A-Mem	External pipeline with some LLM-authored decisions	Partial; prompted, not trained	Write triggers via prompt
AgeMem	Inside the policy, trained end-to-end	Yes, via stage-3 RL	Every memory action

Mem0 is the production-ready baseline. It works, it scales, and the code is clean. But the decisions about what to store and when are hard-coded or heuristic. A-Mem introduces some LLM-driven reasoning about memory but still treats it as a side-channel.

AgeMem’s claim: once memory becomes just another action, the usual RL machinery starts to work on it. You can measure how much a single update call improved a downstream task, and you can train the model to make more of the good calls and fewer of the bad ones. Think of it as a re-framing more than a new architecture, with memory treated as learned behavior rather than fixed plumbing.

If you’re already familiar with the MemPalace review I wrote last week, that product is a consumer-facing memory layer with a fixed write policy. AgeMem is the research direction that, if it holds up, could eventually replace the write policy in something like MemPalace with a learned one.

What the paper doesn’t show

A few limitations worth noting, in the spirit of reading the paper rather than the press release:

Only Qwen backbones. No Llama, no Mistral, no closed models. The authors hint that the three-stage recipe generalizes, but we don’t know that yet. If you rebuild this with GPT-4o or Claude as the base, does the RL stage still move the needle, or is most of the gain already baked into a frontier model’s tool-use policy?
Evaluation is all on text-heavy, single-agent tasks. ALFWorld and BabyAI are grid-world proxies for embodiment, not real embodiment. HotpotQA is classical multi-hop QA. Nothing in the eval stack tests the messiest real-world memory setting: multi-user, multi-session, cross-device, with privacy constraints.
No latency or cost numbers. The summary and filter operations are LLM calls. At inference time, a step where the agent calls summary on a 4k-token chunk is a real GPU budget hit. The paper doesn’t break down how many extra tokens AgeMem spends to earn those 4-8 percentage points. This is the number production engineers will want next.
Memory is still just text blobs under the hood. No structured schemas, no graph, no embeddings-plus-metadata. If your app needs to query memory by relationship (who knows who, when did what happen), you’re back to bolting on a graph store.

None of these are fatal. They are where the follow-up work will land.

What this means if you build agents

If you’re wiring up a LangGraph or CrewAI agent today and reaching for Mem0 or A-Mem, AgeMem isn’t something you can drop in; the training code is what makes it work, and the checkpoints aren’t released as of this writing. But the lesson is portable:

Stop treating memory as plumbing. If your agent is failing because it remembered the wrong thing, more vectors and smarter retrieval heuristics probably won’t fix it. Usually the write policy itself is what’s broken.
Give the model explicit memory tools. Even without RL training, exposing remember_fact(key, value) and forget_fact(key) as tool calls and describing when to use them in the system prompt moves real perf on long tasks. I’ve tested this informally with Claude and GPT-4 class models and the effect shows up even without fine-tuning, just from better prompting.
Log the memory trace. When your agent fails, the retrieval log is usually where the smoking gun is. If you don’t have one, build one.

The broader trend across the memory-for-agents literature (this paper, TurboQuant’s KV-cache compression, the TriAttention work on long reasoning, and recent architecture surveys) is that “memory” is turning from a fixed module into a learnable sub-system. Fifteen years of NLP work treated parsing the same way: first rules, then statistical, then neural, then learned end-to-end inside the main model. Memory appears to be following the same trajectory.

FAQ

What is agentic memory in LLMs?

Agentic memory is memory that the agent itself manages through tool calls, instead of an external pipeline managing memory for it. The agent decides when to store a fact, when to update an old one, when to retrieve something, and when to summarize or filter active context. In AgeMem, these decisions are trained with reinforcement learning rather than scripted.

How is AgeMem different from Mem0?

Mem0 is a production memory layer for LLM agents with a mostly fixed write-and-retrieve policy: it stores everything the agent sees, retrieves by similarity, and lets the developer configure thresholds. AgeMem trains the agent to decide those things itself. On five benchmarks, AgeMem beat Mem0 by 4-10 percentage points, though Mem0 wins on production readiness (it ships, AgeMem is research code).

What benchmarks were used?

Five long-horizon benchmarks: ALFWorld (household task planning in a grid world), SciWorld (science experiments as agents), PDDL (classical planning problems), BabyAI (instruction following in a grid world), and HotpotQA (multi-hop question answering). The first four stress action planning, HotpotQA stresses retrieval.

Will this work with GPT-4 or Claude?

Unclear. The paper only tested Qwen2.5-7B and Qwen3-4B, and the results come from reinforcement learning fine-tuning on those specific backbones. Frontier models likely have some of these memory behaviors baked in already from their own post-training, but nobody has run the full three-stage recipe on a closed model.

Is the code released?

At the time of writing, the paper is on arXiv but the trained checkpoints do not appear to be publicly released. If you want to run AgeMem yourself you’ll need to reproduce the training, which means the three-stage RL pipeline plus step-wise GRPO on top of a Qwen base.

Does this replace RAG?

Not yet, and maybe not ever. RAG retrieves from external knowledge: the corporate wiki, the API docs, the PDF store. AgeMem manages the agent’s own memory of past interactions. The two are complementary: retrieve facts from RAG, remember preferences and past context with agentic memory.

Bottom line

The AgeMem paper isn’t the first work to argue that LLM agents should own their memory, but it’s one of the first to put a concrete RL recipe behind the argument and get clean benchmark wins. The 8.57-point gap on Qwen3-4B is real, and the fact that a 4B model with learned memory outperforms a 7B model with frozen memory is the kind of result that should change how people design agent stacks this year.

The limitations are typical for this stage of research: narrow backbones, clean benchmarks, no cost accounting. Even so, the core move of training memory decisions end-to-end with the agent’s policy looks like the right direction, and the next generation of production agent frameworks will probably steal it whether they cite the paper or not. It pairs naturally with inference-time work like recursive language models, which push the same bet in the other direction: let the model decide at runtime what context to pull in.

TL;DR#

The memory problem everyone is trying to solve#

What the paper actually did#

The three-stage training recipe#

The numbers#

Why this approach is different#

What the paper doesn’t show#

What this means if you build agents#

FAQ#

What is agentic memory in LLMs?#

How is AgeMem different from Mem0?#

What benchmarks were used?#

Will this work with GPT-4 or Claude?#

Is the code released?#

Does this replace RAG?#

Bottom line#

Don't miss what's next

Related Articles

THINC: How a 4B Model Beat 235B Qwen3 by Reasoning in Code

Project Glasswing: 10,000 Critical Bugs Found by an AI Nobody Can Use

Anthropic's Agentic Coding Report: 8 Trends, Dissected

Efficient LLM Reasoning: 7 Papers That Cut Token Costs by Up to 84%