TL;DR

MemPalace is a free, open-source AI memory system that stores your LLM conversations locally and retrieves them with impressive accuracy. The “palace” architecture — organizing memories into wings, halls, and rooms — is a clever idea that beats flat vector search by 34% in retrieval tests. But the project’s marketing oversells it badly. The headline “100% LongMemEval” score was gamed, the README lists features the code doesn’t have, and the MCP integration ships with a stdout bug that breaks Claude Desktop. If you can look past the hype, there’s a solid local memory tool underneath. Just don’t trust the box art.

The Viral Launch

On April 5, 2026, actress Milla Jovovich pushed a Python repository to her personal GitHub account. Within 48 hours, MemPalace had 7,000 stars. By April 8, it crossed 23,000 stars and nearly 3,000 forks, making it the #1 trending repo on GitHub.

The project was actually built by Ben Sigman, a crypto CEO, with Jovovich serving as co-creator and public face. Jovovich herself had 7 commits across 2 days in her GitHub history, a detail that immediately raised eyebrows in developer communities. Kotaku ran a piece calling it “Snake Oil.” Hacker News threads debated whether the benchmarks were fabricated.

I spent the last few days digging into the code, running the benchmarks, and testing it with Claude Code.

What MemPalace Actually Does

Most AI memory systems work the same way: your conversations get chunked, embedded into vectors, and thrown into a database. When you ask a question, the system does a similarity search and pulls back relevant chunks. Mem0, Zep, Letta all follow this pattern, with varying levels of LLM-assisted summarization on top.

MemPalace does something different. It borrows from the ancient memory palace mnemonic technique and applies it to vector retrieval. Instead of a flat index, your conversations are organized into a spatial hierarchy:

  • Wings — people and projects (e.g., “Work with Sarah,” “Side project Rust CLI”)
  • Halls — types of memory (fact recall, temporal events, multi-hop reasoning, knowledge updates, synthesis)
  • Rooms — specific ideas or conversation threads
  • Tunnels — connections between rooms across wings
  • Drawers — individual memory entries stored verbatim in ChromaDB
flowchart TD
    P[Palace] --> W1[Wing: Work]
    P --> W2[Wing: Side Project]
    W1 --> H1[Hall: Facts]
    W1 --> H2[Hall: Decisions]
    W2 --> H3[Hall: Code]
    W2 --> H4[Hall: Bugs]
    H1 --> R1[Room: API Keys]
    H1 --> R2[Room: Team Roster]
    H3 --> R3[Room: Auth Module]
    R1 -.tunnel.-> R3

When a query comes in, MemPalace runs a two-pass retrieval. First pass: classify the question into one of five halls, search only that hall for high-precision results. Second pass: search the full corpus with hall-based score bonuses to catch miscategorized sessions. The result is a structured recall that consistently outperforms flat semantic search.

The entire stack runs offline: ChromaDB for vectors, SQLite for the knowledge graph and metadata, and optionally a local Llama model for reranking. Zero cloud calls, zero API costs.

The Benchmark Mess

The benchmarks are where the story falls apart. MemPalace claims the “highest score on LongMemEval ever benchmarked.” The numbers they publish:

BenchmarkMemPalaceMem0ZepLetta
LongMemEval (raw)96.6%~85%~82%~80%
LongMemEval (hybrid)100%N/AN/AN/A
LoCoMo100%~90%~87%~85%

Those numbers look incredible. And if you stop there, MemPalace seems like it destroyed the competition. But each claim has a catch.

The 100% LongMemEval score was hand-tuned. A GitHub issue (#29) revealed that the team identified which specific questions the system got wrong, engineered fixes for those exact questions, and retested on the same set. They then reported a perfect score. They overfitted to the test set. To their credit, after community pushback, they revised the headline number to 96.6% (the pre-tuning score).

The 100% LoCoMo score is trivially achievable. LoCoMo conversation sessions contain 19–32 items. MemPalace ran it with top_k=50. When your retrieval window is bigger than the entire candidate pool, you retrieve everything by default. You aren’t testing the retrieval system at that point — you’re testing whether ChromaDB can return a list.

There’s also a subtler problem with what these benchmarks actually measure. LongMemEval scores use recall_any@5, which measures whether the correct memory appears somewhere in the top 5 retrieved chunks. That’s a very different question from “did the system answer correctly.” One developer reported that when you plug MemPalace into an LLM and actually ask questions, you get the right answer about 17% of the time.

What’s Genuinely Impressive

Strip away the inflated claims and the honest numbers still beat every free alternative. 96.6% raw recall on LongMemEval is the highest published score for any local-only, zero-cost memory system. Mem0 and Zep hover around 82-85%, and they charge $19-249/month and $25+/month respectively.

The palace structure itself works. Their internal tests on 22,000+ stored conversation memories show a 34% retrieval improvement over flat ChromaDB semantic search. That matters. Spatial organization helps disambiguation in ways that pure embedding similarity can’t handle.

96.6%
LongMemEval (raw, honest)
$0
Monthly cost
23K+
GitHub stars in 48h
34%
Retrieval gain vs flat search

Setting It Up With Claude Code

Installation takes about two minutes:

pip install mempalace
mempalace init

MemPalace ships as an MCP server with 19 tools. If you’re running Claude Code, it auto-discovers the tools covering search, storage, knowledge graph queries, and agent diaries. You can also configure a Stop hook in ~/.claude/settings.json that triggers every 15 messages, performing structured saves and rebuilding the L1 layer (a key-facts index).

The stdout bug: As of April 9, MemPalace writes human-readable startup text to stdout instead of stderr. When Claude Desktop launches it as an MCP subprocess, the startup text corrupts the JSON message stream and you get parse errors. Issue #225 is tracking this. For now, Claude Code works fine — the bug only hits Claude Desktop’s MCP integration.

The AAAK compression layer deserves attention too. It’s a custom 30x compression system that packs entity names and relationships into a shorthand dialect readable by any LLM. A typical 6-month conversation history (~19.5M tokens) compresses to about 650K tokens. The trade-off: AAAK mode scores 84.2% on LongMemEval vs 96.6% in raw mode. You’re trading recall for storage efficiency, and at scale that might be worth it.

The Code vs. The README

I read through several of the open GitHub issues, and Issue #27 flagged something important: multiple README claims don’t match the actual codebase.

The README advertises “contradiction detection” that automatically flags inconsistencies against the knowledge graph. But knowledge_graph.py doesn’t contain any contradiction logic — the only deduplication blocks identical open triples. It’s a feature that’s described but not built.

More concerning: the palace structure (wings, rooms, halls) isn’t actualy used in the benchmark measurements. The LongMemEval scores measure ChromaDB’s default embedding model performance. The palace routing happens in the application layer above, but the published benchmarks bypass it. So when they say “MemPalace scores 96.6%,” what they mean is “ChromaDB with our embedding config scores 96.6%, and MemPalace adds spatial routing on top of that for real-world use.”

It’s not fraud. The retrieval layer does work well. But it’s a disconnect between what’s being marketed and what’s being measured.

MemPalace vs Mem0 vs Zep

I put MemPalace side by side with the two biggest paid alternatives:

FeatureMemPalaceMem0Zep
PriceFree$19-249/mo$25+/mo
HostingLocal onlyCloud (managed)Cloud (managed)
Storage approachVerbatim (everything)LLM-summarizedLLM-summarized
LongMemEval96.6%~85%~82%
Team/shared memoryNoYesYes
PrivacyFull (offline)Data leaves machineData leaves machine
Enterprise SLAsNoYesYes
MCP integration19 toolsLimitedLimited
Maturity5 days old2+ years2+ years

Pick MemPalace if you’re a solo developer who wants maximum accuracy, full privacy, and zero cost. Pick Mem0 or Zep if you need shared team memory, enterprise SLAs, or don’t want to manage local infrastructure.

The verbatim storage approach is a real differentiator. Mem0 and Zep use LLMs to decide what information to keep and what to discard. That saves storage but also throws away the reasoning behind decisions. MemPalace keeps every word and lets the retrieval system figure out what’s relevant at query time. For long-running projects where context from months ago might suddenly matter, that retrieval strategy pays off.

Who Should Use This

Good fit:

  • Solo devs who want persistent memory across Claude Code sessions
  • Privacy-conscious users who won’t send conversations to cloud APIs
  • People building personal AI assistants that remember long-term context
  • Anyone tired of re-explaining project context to their AI tools every morning

Bad fit:

  • Teams that need shared memory across multiple users
  • Anyone who needs production uptime guarantees
  • If you need the system to work with Claude Desktop right now (stdout bug)
  • People who expect the README to match the code (give it a few weeks)

FAQ

Is MemPalace actually built by Milla Jovovich?

The core engineering was done by Ben Sigman. Jovovich co-designed the palace metaphor architecture and is listed as co-creator. She had 7 commits across 2 days of GitHub history at launch. The project is real and functional regardless of attribution questions.

Does MemPalace really score 100% on benchmarks?

No. The honest number is 96.6% on LongMemEval in raw mode. The 100% claim was achieved by hand-tuning fixes for specific failing questions and retesting on the same set. The team revised this after community pushback.

Can I use MemPalace with GPT or Gemini instead of Claude?

Yes. MemPalace stores memories in ChromaDB and the AAAK compression format is readable by any LLM. The MCP server integration is most polished for Claude Code, but the underlying memory system is model-agnostic.

How much disk space does MemPalace use?

A 6-month conversation history (~19.5M tokens) takes roughly 50-100MB with AAAK compression enabled. Without compression, expect more since MemPalace stores conversations verbatim rather than summarizing them.

Should I switch from Mem0 to MemPalace?

If you’re a solo dev and privacy matters to you, yes. MemPalace’s retrieval accuracy is higher and it’s free. But if you need team features, managed infrastructure, or enterprise support, Mem0 is the safer choice. MemPalace is 5 days old — don’t bet your production system on it yet.

My Take

MemPalace is an interesting project wrapped in bad marketing. The palace architecture works. Organizing memories spatially instead of dumping them in a flat index produces better retrieval results. The 96.6% LongMemEval score is real and beats every other free tool I’ve tested. Zero API costs, full privacy, 19 MCP tools for Claude Code.

But the benchmark manipulation was dumb and unnecessary. The honest numbers are already impressive. Claiming 100% and getting caught overfitting to a test set undermined trust before the project had a chance to build it. The README listing features that don’t exist in the codebase makes it worse. And the celebrity marketing angle, however attention-grabbing, invited exactly the kind of scrutiny that exposed these gaps.

Give it a month. Let the community fix the bugs (the stdout issue, the missing contradiction detection), let the benchmarks get independently verified, and let the hype cycle cool down. If the 96.6% holds up and the palace architecture proves out at scale, MemPalace could be the default local memory system for AI coding workflows. It just needs to stop trying so hard to convince you it already is.