Lab Experiment · Agent Memory

What agents remember.

Memory is the difference between a stateless script and an intelligent agent. We're testing every major memory architecture — in-context buffers, vector stores, structured entity memory, and hybrid routers — to find where each one breaks and what it actually costs to get recall right.

In progressExperiment · EXP-AM-001

Key Findings · EXP_AM

F_0184%Vector recall at k=3With structured fact summarisation

F_0261%Recall with raw turnsNo summarisation before embedding

F_030%Entity memory hallucination ratevs 6.2% vector-only in CRM workflows

F_04120msRetrieve latency (vector)Median, Pinecone + OpenAI embeddings

F_0560%Cost reduction (hybrid router)vs always-on vector retrieval

F_0610Break-even session lengthTurns before vector memory earns its cost

Active Experiments

Four architectures under test.

AM-001 Live

In-Context Window Buffer

Naive approach: keep recent turns in context. Cheap and fast for short workflows. Degrades sharply past 8k tokens — context compression is required.

In-ContextGPT-4oClaude 3.5

AM-002 Live

Vector Store Episodic Memory

Embed conversation summaries into a vector DB. Retrieve top-k on new query. Recall accuracy: 84% at k=3, 91% at k=6. Latency adds ~120ms per retrieve.

PineconeOpenAI EmbeddingsLangGraph

AM-003 In progress

Structured Entity Memory

Extract named entities and facts from conversations, store as typed records. Agents query memory as a structured DB. Zero hallucinated facts in 200 test runs.

PostgresStructured ExtractionTool Call

AM-004 In progress

Hybrid Memory Router

Route memory reads to in-context, vector, or structured store based on query type. Classifier adds 18ms overhead but reduces retrieve cost by 60%.

LangGraphClassifierHybrid

Signal So Far

What the data says.

Three findings with enough runs behind them to publish. More land in our weekly newsletter as we validate them.

FIND_01Vector memory recall degrades without explicit summarisation at handoff.

Storing raw conversation turns verbatim produces recall rates of 61% at k=3. Summarising each turn into a structured fact bundle before embedding raises recall to 84%. The compression step costs ~200ms but pays off within the second retrieval.

FIND_02Entity memory outperforms vector retrieval for relationship-heavy workflows.

In CRM-style agents (track a contact across 20 sessions), structured entity memory had zero hallucinated facts vs. 6.2% hallucination rate from vector-only retrieval. Cost: a Postgres row vs. an embedding API call per fact.

FIND_03Memory architecture should match workflow length, not LLM capability.

The biggest mistake we see is over-engineering memory for short workflows. If an agent handles tasks under 10 turns and doesn't need cross-session recall, in-context buffer + a conversation summary at session end is sufficient and 80% cheaper than a vector store.

Under the Hood

The test harness.

memory.bench · AM-002 · vector store recall test

$ memory bench --arch vector --sessions 50 --k 3

— loading 50 historical sessions (2,400 turns) —

✓ Embedding 2,400 turns → 2,400 vectors (text-embedding-3-small)

✓ Upserted to Pinecone index: agent-memory-bench

— running 200 recall probes —

✓ Recall@3 (raw turns): 61.0% ↓ below threshold

✓ Recall@3 (fact summaries): 84.0% ✓ above threshold

✓ Recall@6 (fact summaries): 91.0% ✓

✓ Median retrieve latency: 118ms

$ results → /data/memory/am-002-run-012.json

Building an agent that needs cross-session memory? We scope the right architecture in 30 minutes.

See agent services →