Documentation

Dynamic Ledger: Retrieval-Augmented Structured Memory for Test-Time Learning

Our work extends the Dynamic Cheatsheet (DC) framework by Suzgun et al. with two new memory architectures — Strategic Chunk Retrieval (SCR) and Dynamic Ledger (DL) — that replace monolithic memory with structured, chunk-level stores and selective curation. Evaluated on GPT-4o and GPT-5 across 6 benchmarks, Dynamic Ledger achieves up to +10 pp accuracy gains on math reasoning tasks. See the final report for full details.

Team

Jerry Gu*

Jerry Gu*

Stanford University

Shurui Liu*

Shurui Liu*

Stanford University

Sabrina Yen-Ko*

Sabrina Yen-Ko*

Stanford University

Mirac Suzgun

Mirac Suzgun

Stanford University

*Equal contribution. Mentored by Mirac Suzgun.


Installation

git clone https://github.com/srliu3264/dynamic_ledger.git
cd dynamic_ledger
pip install -r requirements.txt
cp config.env.example config.env  # add your API keys

Supported Approaches

ApproachDescription
defaultNo cheatsheet; single-pass generation
DynamicCheatsheet_CumulativeAppend-only flat text cheatsheet (original DC)
DynamicCheatsheet_RetrievalSynthesisRetrieve past examples, synthesize a query-specific cheatsheet
Dynamic_RetrievalRetrieve top-k chunks, no curation step
FullHistoryAppendingFull conversation history appended as context
DynamicCheatsheet_StrategicChunkRetrieval[NEW] Retrieve top-k strategy chunks; curator refines only those chunks
DynamicCheatsheet_DynamicLedger[NEW] Dynamic Ledger — structured JSON store with per-entry CRUD updates

Supported Models

openai/gpt-4o, openai/gpt-4o-mini, openai/gpt-3.5-turbo
openai/gpt-5-2025-08-07
openai/o1, openai/o3-mini
anthropic/claude-3-5-sonnet-latest, anthropic/claude-3-7-sonnet-latest
anthropic/claude-3-5-haiku-latest
xai/grok-3, xai/grok-3-mini, xai/grok-4-fast-non-reasoning
together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo
together_ai/deepseek-ai/DeepSeek-R1, together_ai/Qwen/QwQ-32B
gemini/gemini-2.0-flash

Supported Benchmarks

TaskDescriptionSize
IneqMath_allCompetition-style inequality problems (train + dev merged)1,352
IneqMathIneqMath dev split only100
DataSIRFull sensitive information recognition dataset1,647,501
DataSIR400DataSIR 400-problem subset used in our evaluation400
AIME_2025AIME 2025 problemsvaries
AIME_2024AIME 2024 problemsvaries
AIME_2020_2024AIME 2020–2024 problemsvaries
GPQA_DiamondGraduate-level science QAvaries
MMLU_Pro_PhysicsMMLU-Pro Physics subset1,299
MMLU_Pro_EngineeringMMLU-Pro Engineering subset969
MathEquationBalancerEquation balancing taskvaries
GameOf24Game of 24varies