Results
Experimental Results
System Illustration
Main Results
We evaluate on four benchmarks spanning different reasoning modalities. The first three benchmarks use GPT-4o; DataSIR uses GPT-5. Dynamic Ledger (DC-DL) achieves the highest accuracy on all four tasks.
| Method | AIME (133, 4o) | IneqMath (100, 4o) | MathEqBal (250, 4o) | DataSIR400 (400, 5) |
|---|---|---|---|---|
| Baseline / Default | 9.8% | 48.0% | 47.2% | 87.0% |
| EmptyCheatsheet | 24.1% | — | 83.2% | — |
| DC-Cu | 18.0% | 47.0% | 94.4% | 84.0% |
| FullHistoryAppend | — | — | — | 88.0% |
| Dynamic Retrieval | 24.1% | — | 94.4% | — |
| DC-RS | 24.8% | 47.0% | 94.0% | 87.0% |
| DC-SCR (ours) | 28.2% | 53.0% | 100.0% | 75.0% |
| DC-SCR p=0.8 (ours) | 20.6% | 49.0% | — | — |
| DC-DL (ours) | 30.8% | 58.0% | 100.0% | 91.0% |
Benchmark Summaries
AIME 2020–2024 (GPT-4o)
DC-DL achieves 30.8%, a 3.1× improvement over the stateless Baseline (9.8%) and a 6.0 pp gain over DC-RS (24.8%). DC-SCR (28.2%) also surpasses all baselines. The probability-threshold variant DC-SCR p=0.8 (20.6%) underperforms the top-k variant, suggesting that adaptive cardinality introduces too many marginally relevant chunks.
IneqMath (GPT-4o)
DC-DL leads at 58.0%, a 10 pp gain over the Default baseline (48.0%). DC-Cu and DC-RS both underperform the Default (47.0% each), indicating that monolithic curation offers no benefit on inequality proof tasks. The strong performance of chunk-based methods suggests that fine-grained strategy retrieval is particularly advantageous when problems require diverse, specialized proof techniques.
MathEquationBalancer (GPT-4o)
Both DC-SCR and DC-DL achieve perfect accuracy (100.0%), a dramatic improvement over the 47.2% Baseline. The large jump from Baseline to EmptyCheatsheet (83.2%) indicates that even a minimal structured prompt substantially helps. Prior DC variants plateau around 94%, unable to close the remaining gap.
DataSIR (GPT-5)
DC-DL achieves 91.0%, outperforming all methods. However, DC-SCR drops to 75.0%, well below the Default (87.0%). This reversal highlights the importance of dual-embedding retrieval: when strategy text alone does not capture structural similarity between problems, the problem-embedding channel in DC-DL compensates.
Cumulative Learning Curves
These plots show cumulative accuracy as a function of problems seen, revealing the test-time learning dynamics of each method. DC-DL consistently maintains the highest cumulative accuracy after an initial learning phase.
AIME 2020–2024
The Baseline flatlines near 10% while DC-DL and DC-SCR separate from DC baselines around problem 40 and maintain a widening gap.
IneqMath
DC-Cu starts strong but gradually converges toward the Default as curation noise accumulates, while DC-DL maintains its advantage throughout the sequence.
MathEquationBalancer
DC-SCR and DC-DL reach near-perfect accuracy within the first 20 problems, while DC-Cu and DC-RS plateau near ~94% and the Baseline stalls at ~47%.
DataSIR
DC-SCR remains below all methods throughout, while DC-DL climbs to the top by problem 30, confirming that dual-embedding retrieval is decisive when strategy text alone is a poor retrieval signal.
Sensitivity Analysis: Strategy Dilution
To isolate the impact of contextual interference, we conduct a controlled sensitivity analysis on AIME 2021–2025 using an oracle retrieval protocol: the generator is provided with exactly one "gold" memory item (the correct strategy), plus a varying number of distractor strategies from unrelated problems.
| Configuration | Distractors (n) | Accuracy (%) | Relative Decay |
|---|---|---|---|
| Oracle Strategy | 0 | 22.5% | — |
| Low Distraction | 10 | 19.2% | −14.7% |
| High Distraction | 50 | 15.8% | −29.8% |
Accuracy exhibits consistent decay as distractors increase, establishing a concrete empirical upper bound on distractor tolerance and validating the necessity of selective curation.
Limitation Analysis: MMLU-Pro
To probe the limits of our approach, we evaluate on two MMLU-Pro subsets — Engineering and Physics — each containing 250 multiple-choice questions drawn from graduate-level professional exams. Unlike AIME or IneqMath, these questions primarily test domain knowledge and factual recall rather than transferable problem-solving strategies.
| Method | Eng. (GPT-4o) | Eng. (GPT-5) | Phys. (GPT-4o) | Phys. (GPT-5) |
|---|---|---|---|---|
| Default / Baseline | 53.6% | 64.8% | 76.0% | 83.2% |
| EmptyCheatsheet | 52.8% | — | 75.2% | — |
| DC-Cu | 46.1% | 63.6% | 76.0% | 85.2% |
| FullHistoryAppend | — | 72.0% | — | 89.6% |
| Dynamic Retrieval | 48.8% | 72.0% | — | 89.6% |
| DC-RS | 51.6% | 72.4% | 75.6% | 89.2% |
| DC-SCR (ours) | 53.6% | 66.8% | 73.2%† | 82.8%† |
| DC-DL (ours) | 51.6% | 67.2% | 72.0%† | 85.2% |
† Below the Default baseline.
MMLU-Pro Engineering (GPT-5)
MMLU-Pro Physics (GPT-5)
Key Takeaways
- Our methods do not outperform DC baselines on knowledge-recall tasks. DC-RS reaches 72.4% on GPT-5 Engineering while DC-DL achieves only 67.2%.
- Simple retrieval methods dominate. FullHistoryAppend and Dynamic Retrieval (72.0% / 89.6% on GPT-5) outperform all structured-memory methods, suggesting that raw past examples help more than distilled strategies on factual tasks.
- Memory overhead without accuracy gains. DC-SCR accumulates over 1,200 KB of memory and DC-DL over 700 KB, while DC baselines remain below 50 KB — all for no accuracy benefit.
These results confirm that Dynamic Ledger is most effective when the task distribution contains recurring, transferable problem-solving patterns. Adapting the framework to knowledge-centric tasks remains an open direction.