Documentation
Dynamic Ledger: Retrieval-Augmented Structured Memory for Test-Time Learning
Our work extends the Dynamic Cheatsheet (DC) framework by Suzgun et al. with two new memory architectures — Strategic Chunk Retrieval (SCR) and Dynamic Ledger (DL) — that replace monolithic memory with structured, chunk-level stores and selective curation. Evaluated on GPT-4o and GPT-5 across 6 benchmarks, Dynamic Ledger achieves up to +10 pp accuracy gains on math reasoning tasks. See the final report for full details.
Team
*Equal contribution. Mentored by Mirac Suzgun.
Installation
git clone https://github.com/srliu3264/dynamic_ledger.git
cd dynamic_ledger
pip install -r requirements.txt
cp config.env.example config.env # add your API keys
Supported Approaches
| Approach | Description |
|---|---|
default | No cheatsheet; single-pass generation |
DynamicCheatsheet_Cumulative | Append-only flat text cheatsheet (original DC) |
DynamicCheatsheet_RetrievalSynthesis | Retrieve past examples, synthesize a query-specific cheatsheet |
Dynamic_Retrieval | Retrieve top-k chunks, no curation step |
FullHistoryAppending | Full conversation history appended as context |
DynamicCheatsheet_StrategicChunkRetrieval | [NEW] Retrieve top-k strategy chunks; curator refines only those chunks |
DynamicCheatsheet_DynamicLedger | [NEW] Dynamic Ledger — structured JSON store with per-entry CRUD updates |
Supported Models
openai/gpt-4o, openai/gpt-4o-mini, openai/gpt-3.5-turbo
openai/gpt-5-2025-08-07
openai/o1, openai/o3-mini
anthropic/claude-3-5-sonnet-latest, anthropic/claude-3-7-sonnet-latest
anthropic/claude-3-5-haiku-latest
xai/grok-3, xai/grok-3-mini, xai/grok-4-fast-non-reasoning
together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo
together_ai/deepseek-ai/DeepSeek-R1, together_ai/Qwen/QwQ-32B
gemini/gemini-2.0-flash
Supported Benchmarks
| Task | Description | Size |
|---|---|---|
IneqMath_all | Competition-style inequality problems (train + dev merged) | 1,352 |
IneqMath | IneqMath dev split only | 100 |
DataSIR | Full sensitive information recognition dataset | 1,647,501 |
DataSIR400 | DataSIR 400-problem subset used in our evaluation | 400 |
AIME_2025 | AIME 2025 problems | varies |
AIME_2024 | AIME 2024 problems | varies |
AIME_2020_2024 | AIME 2020–2024 problems | varies |
GPQA_Diamond | Graduate-level science QA | varies |
MMLU_Pro_Physics | MMLU-Pro Physics subset | 1,299 |
MMLU_Pro_Engineering | MMLU-Pro Engineering subset | 969 |
MathEquationBalancer | Equation balancing task | varies |
GameOf24 | Game of 24 | varies |