Retrieval-Augmented Structured Memory for Test-Time Learning

Dynamic Ledger

Two extensions to the Dynamic Cheatsheet framework — Strategic Chunk Retrieval and Dynamic Ledger — that replace monolithic memory with structured, chunk-level stores and selective curation, achieving up to +10 pp accuracy gains on math reasoning benchmarks.

Final Report

Documentation

View on GitHub

Key Contributions

Two new memory architectures extending the Dynamic Cheatsheet framework, evaluated on GPT-4o and GPT-5 across 4 benchmarks

Strategic Chunk Retrieval

Replaces the monolithic cheatsheet with a structured, chunk-level memory store of self-contained strategy units, retrieved by content similarity and updated selectively — leaving unrelated strategies untouched.

Dynamic Ledger

Reframes the memory store as a lightweight database with explicit CRUD operations and dual-embedding retrieval over both strategy text and source-problem embeddings, recovering relevant entries that strategy-only retrieval misses.

Confidence-Weighted Retrieval

Re-ranks retrieved examples by an ensemble-based trust score derived without ground-truth labels, filtering low-confidence examples from the candidate pool.

New Benchmarks

Added IneqMath (competition-style inequality proofs) and DataSIR (sensitive information recognition with 1.6M+ examples) to the DC evaluation suite.

Sensitivity Analysis

Oracle-memory experiments confirm that reasoning performance degrades by up to 29.8% under strategy dilution, empirically validating the necessity of selective curation.

Probabilistic Framework

Formal information-theoretic analysis of all DC variants using Fano bounds, rate-distortion theory, and mutual information to characterize when and why structured memory helps.

Results

Accuracy comparison across benchmarks

AIME 2020–2024 (GPT-4o) — 30.8% (+6.0 pp)

Team

Jerry Gu

Stanford University

"Led implementation of SCR and Dynamic Ledger, and project fundraising. Also contributed to experiments, result analysis, dataset adaptation, and writing."

Shurui Liu

Stanford University

"Led experiments, result analysis, codebase management, website, writing, and the probabilistic formalization. Also led IneqMath/DataSIR adaptation and contributed to DL implementation."

Sabrina Yen-Ko

Stanford University

"Led ablation/sensitivity analysis, qualitative result analysis, and confidence score implementation via ensembling. Also contributed to experiments, result analysis, and writing."

Mirac Suzgun

Stanford University