Lab Experiment · Agent Memory

What agents remember.

Memory is the difference between a stateless script and an intelligent agent. We're testing every major memory architecture — in-context buffers, vector stores, structured entity memory, and hybrid routers — to find where each one breaks and what it actually costs to get recall right.

In progressExperiment · EXP-AM-001
Key Findings · EXP_AM
F_0184%Vector recall at k=3With structured fact summarisation
F_0261%Recall with raw turnsNo summarisation before embedding
F_030%Entity memory hallucination ratevs 6.2% vector-only in CRM workflows
F_04120msRetrieve latency (vector)Median, Pinecone + OpenAI embeddings
F_0560%Cost reduction (hybrid router)vs always-on vector retrieval
F_0610Break-even session lengthTurns before vector memory earns its cost
Active Experiments

Four architectures under test.

AM-001 Live

In-Context Window Buffer

Naive approach: keep recent turns in context. Cheap and fast for short workflows. Degrades sharply past 8k tokens — context compression is required.

In-ContextGPT-4oClaude 3.5
AM-002 Live

Vector Store Episodic Memory

Embed conversation summaries into a vector DB. Retrieve top-k on new query. Recall accuracy: 84% at k=3, 91% at k=6. Latency adds ~120ms per retrieve.

PineconeOpenAI EmbeddingsLangGraph
AM-003 In progress

Structured Entity Memory

Extract named entities and facts from conversations, store as typed records. Agents query memory as a structured DB. Zero hallucinated facts in 200 test runs.

PostgresStructured ExtractionTool Call
AM-004 In progress

Hybrid Memory Router

Route memory reads to in-context, vector, or structured store based on query type. Classifier adds 18ms overhead but reduces retrieve cost by 60%.

LangGraphClassifierHybrid
Signal So Far

What the data says.

Three findings with enough runs behind them to publish. More land in our weekly newsletter as we validate them.

FIND_01Vector memory recall degrades without explicit summarisation at handoff.

Storing raw conversation turns verbatim produces recall rates of 61% at k=3. Summarising each turn into a structured fact bundle before embedding raises recall to 84%. The compression step costs ~200ms but pays off within the second retrieval.

FIND_02Entity memory outperforms vector retrieval for relationship-heavy workflows.

In CRM-style agents (track a contact across 20 sessions), structured entity memory had zero hallucinated facts vs. 6.2% hallucination rate from vector-only retrieval. Cost: a Postgres row vs. an embedding API call per fact.

FIND_03Memory architecture should match workflow length, not LLM capability.

The biggest mistake we see is over-engineering memory for short workflows. If an agent handles tasks under 10 turns and doesn't need cross-session recall, in-context buffer + a conversation summary at session end is sufficient and 80% cheaper than a vector store.

Under the Hood

The test harness.

memory.bench · AM-002 · vector store recall test
$ memory bench --arch vector --sessions 50 --k 3
— loading 50 historical sessions (2,400 turns) —
 
Embedding 2,400 turns → 2,400 vectors (text-embedding-3-small)
Upserted to Pinecone index: agent-memory-bench
 
— running 200 recall probes —
Recall@3 (raw turns): 61.0% ↓ below threshold
Recall@3 (fact summaries): 84.0% ✓ above threshold
Recall@6 (fact summaries): 91.0%
Median retrieve latency: 118ms
 
$ results → /data/memory/am-002-run-012.json
Building an agent that needs cross-session memory? We scope the right architecture in 30 minutes.
See agent services