Dynamic Ledger
Two extensions to the Dynamic Cheatsheet framework — Strategic Chunk Retrieval and Dynamic Ledger — that replace monolithic memory with structured, chunk-level stores and selective curation, achieving up to +10 pp accuracy gains on math reasoning benchmarks.
Key Contributions
Two new memory architectures extending the Dynamic Cheatsheet framework, evaluated on GPT-4o and GPT-5 across 4 benchmarks
Strategic Chunk Retrieval
Replaces the monolithic cheatsheet with a structured, chunk-level memory store of self-contained strategy units, retrieved by content similarity and updated selectively — leaving unrelated strategies untouched.
Dynamic Ledger
Reframes the memory store as a lightweight database with explicit CRUD operations and dual-embedding retrieval over both strategy text and source-problem embeddings, recovering relevant entries that strategy-only retrieval misses.
Confidence-Weighted Retrieval
Re-ranks retrieved examples by an ensemble-based trust score derived without ground-truth labels, filtering low-confidence examples from the candidate pool.
New Benchmarks
Added IneqMath (competition-style inequality proofs) and DataSIR (sensitive information recognition with 1.6M+ examples) to the DC evaluation suite.
Sensitivity Analysis
Oracle-memory experiments confirm that reasoning performance degrades by up to 29.8% under strategy dilution, empirically validating the necessity of selective curation.
Probabilistic Framework
Formal information-theoretic analysis of all DC variants using Fano bounds, rate-distortion theory, and mutual information to characterize when and why structured memory helps.
Results
Accuracy comparison across benchmarks
Team
Stanford University
"Led implementation of SCR and Dynamic Ledger, and project fundraising. Also contributed to experiments, result analysis, dataset adaptation, and writing."
Stanford University
"Led experiments, result analysis, codebase management, website, writing, and the probabilistic formalization. Also led IneqMath/DataSIR adaptation and contributed to DL implementation."
Stanford University
"Led ablation/sensitivity analysis, qualitative result analysis, and confidence score implementation via ensembling. Also contributed to experiments, result analysis, and writing."
Stanford University