Results

Experimental Results

System Illustration

Dynamic Ledger Pipeline Overview

Main Results

We evaluate on four benchmarks spanning different reasoning modalities. The first three benchmarks use GPT-4o; DataSIR uses GPT-5. Dynamic Ledger (DC-DL) achieves the highest accuracy on all four tasks.

MethodAIME (133, 4o)IneqMath (100, 4o)MathEqBal (250, 4o)DataSIR400 (400, 5)
Baseline / Default9.8%48.0%47.2%87.0%
EmptyCheatsheet24.1%83.2%
DC-Cu18.0%47.0%94.4%84.0%
FullHistoryAppend88.0%
Dynamic Retrieval24.1%94.4%
DC-RS24.8%47.0%94.0%87.0%
DC-SCR (ours)28.2%53.0%100.0%75.0%
DC-SCR p=0.8 (ours)20.6%49.0%
DC-DL (ours)30.8%58.0%100.0%91.0%

Benchmark Summaries

AIME 2020–2024 (GPT-4o)

DC-DL achieves 30.8%, a 3.1× improvement over the stateless Baseline (9.8%) and a 6.0 pp gain over DC-RS (24.8%). DC-SCR (28.2%) also surpasses all baselines. The probability-threshold variant DC-SCR p=0.8 (20.6%) underperforms the top-k variant, suggesting that adaptive cardinality introduces too many marginally relevant chunks.

AIME 2020-2024 GPT-4o Summary

IneqMath (GPT-4o)

DC-DL leads at 58.0%, a 10 pp gain over the Default baseline (48.0%). DC-Cu and DC-RS both underperform the Default (47.0% each), indicating that monolithic curation offers no benefit on inequality proof tasks. The strong performance of chunk-based methods suggests that fine-grained strategy retrieval is particularly advantageous when problems require diverse, specialized proof techniques.

IneqMath GPT-4o Summary

MathEquationBalancer (GPT-4o)

Both DC-SCR and DC-DL achieve perfect accuracy (100.0%), a dramatic improvement over the 47.2% Baseline. The large jump from Baseline to EmptyCheatsheet (83.2%) indicates that even a minimal structured prompt substantially helps. Prior DC variants plateau around 94%, unable to close the remaining gap.

MathEquationBalancer GPT-4o Summary

DataSIR (GPT-5)

DC-DL achieves 91.0%, outperforming all methods. However, DC-SCR drops to 75.0%, well below the Default (87.0%). This reversal highlights the importance of dual-embedding retrieval: when strategy text alone does not capture structural similarity between problems, the problem-embedding channel in DC-DL compensates.

DataSIR GPT-5 Summary

Cumulative Learning Curves

These plots show cumulative accuracy as a function of problems seen, revealing the test-time learning dynamics of each method. DC-DL consistently maintains the highest cumulative accuracy after an initial learning phase.

AIME 2020–2024

The Baseline flatlines near 10% while DC-DL and DC-SCR separate from DC baselines around problem 40 and maintain a widening gap.

AIME 2020-2024 Accuracy Curve

IneqMath

DC-Cu starts strong but gradually converges toward the Default as curation noise accumulates, while DC-DL maintains its advantage throughout the sequence.

IneqMath Accuracy Curve

MathEquationBalancer

DC-SCR and DC-DL reach near-perfect accuracy within the first 20 problems, while DC-Cu and DC-RS plateau near ~94% and the Baseline stalls at ~47%.

MathEquationBalancer Accuracy Curve

DataSIR

DC-SCR remains below all methods throughout, while DC-DL climbs to the top by problem 30, confirming that dual-embedding retrieval is decisive when strategy text alone is a poor retrieval signal.

DataSIR Accuracy Curve

Sensitivity Analysis: Strategy Dilution

To isolate the impact of contextual interference, we conduct a controlled sensitivity analysis on AIME 2021–2025 using an oracle retrieval protocol: the generator is provided with exactly one "gold" memory item (the correct strategy), plus a varying number of distractor strategies from unrelated problems.

ConfigurationDistractors (n)Accuracy (%)Relative Decay
Oracle Strategy022.5%
Low Distraction1019.2%−14.7%
High Distraction5015.8%−29.8%

Accuracy exhibits consistent decay as distractors increase, establishing a concrete empirical upper bound on distractor tolerance and validating the necessity of selective curation.


Limitation Analysis: MMLU-Pro

To probe the limits of our approach, we evaluate on two MMLU-Pro subsets — Engineering and Physics — each containing 250 multiple-choice questions drawn from graduate-level professional exams. Unlike AIME or IneqMath, these questions primarily test domain knowledge and factual recall rather than transferable problem-solving strategies.

MethodEng. (GPT-4o)Eng. (GPT-5)Phys. (GPT-4o)Phys. (GPT-5)
Default / Baseline53.6%64.8%76.0%83.2%
EmptyCheatsheet52.8%75.2%
DC-Cu46.1%63.6%76.0%85.2%
FullHistoryAppend72.0%89.6%
Dynamic Retrieval48.8%72.0%89.6%
DC-RS51.6%72.4%75.6%89.2%
DC-SCR (ours)53.6%66.8%73.2%†82.8%†
DC-DL (ours)51.6%67.2%72.0%†85.2%

† Below the Default baseline.

MMLU-Pro Engineering (GPT-5)

MMLU-Pro Engineering Summary MMLU-Pro Engineering Accuracy Curve MMLU-Pro Engineering Memory Cost

MMLU-Pro Physics (GPT-5)

MMLU-Pro Physics Summary MMLU-Pro Physics Accuracy Curve MMLU-Pro Physics Memory Cost

Key Takeaways

  • Our methods do not outperform DC baselines on knowledge-recall tasks. DC-RS reaches 72.4% on GPT-5 Engineering while DC-DL achieves only 67.2%.
  • Simple retrieval methods dominate. FullHistoryAppend and Dynamic Retrieval (72.0% / 89.6% on GPT-5) outperform all structured-memory methods, suggesting that raw past examples help more than distilled strategies on factual tasks.
  • Memory overhead without accuracy gains. DC-SCR accumulates over 1,200 KB of memory and DC-DL over 700 KB, while DC baselines remain below 50 KB — all for no accuracy benefit.

These results confirm that Dynamic Ledger is most effective when the task distribution contains recurring, transferable problem-solving patterns. Adapting the framework to knowledge-centric tasks remains an open direction.