The archai of palimpsestic memorization

A palimpsest: Codex Nitriensis, with multiple layers of Syriac text visible to varying degrees.

By Christopher Potts – January 18, 2026

Kuditipudi, Huang, Zhu et al. (2025) provide an extremely effective test for whether one language model M₁ is a derivative of another language model M₂. The basic form of the test works like this: measure the correlation between the training data order for M₁ and the likelihood (logprobs) that M₂ assigns to those training examples.

Strikingly, you don’t need to use the entire training data order (which would be expensive), the test works even if M₂ has been modified in numerous ways (e.g., fine-tuning, model souping), you don’t need the exact logprobs from M₂ (estimating them from text samples suffices), and a variant of the test uses only texts generated by M₂ (no requirement that we can run M₂ ourselves). These tests do not require you to keep any information about M₁ private, or to mess with M₁ or its training data. Overall, then, this is a powerful, lightweight way of tracking model provenance, detecting tampering, and spotting theft.

Fascinating… but how on earth did the authors come up with this? You just correlate the training data order with the logprobs, and this tells you about model provenance? Okay, but why?

I am an author on Kuditipudi, Huang, Zhu et al. 2025. The test was initially presented to me as a finding that the lead authors (Rohith, Jing, and Sally; henceforth RJS) had already basically achieved. They assured me that this was all a natural consequence of prior work (theirs and others’) on how LMs memorize, but they always reviewed this very quickly. I am flattered that they thought I could follow their exegesis.

The goal of this post is to unpack their argument – to show that the necessary ingredients for proposing this test were basically present in the existing literature already, and to identify the places where RJS must still have taken a leap of faith. This is meant to be a rational reconstruction of whatever combination of deep reading and cosmic inspiration actually led RJS to their proposal.

I’ll refer to the family of tests from Kuditipudi, Huang, Zhu et al. 2025 as “palimpsestic tests”. A palimpsest is a piece of material that has been written on and erased many times, with the effect that the earlier texts remain visible underneath the later ones. It turns out that LMs are like palimpsests with respect to their training data examples – each one is etched into it, with early examples fainter than later ones. The evidence reviewed below is in large part evidence for this observation.

The relationship between memorization and logprobs

Carlini et al. 2019 is a pioneering study of LM memorization. The authors observe, among many other things, that memorized sequences have lower log perplexity, i.e., higher average logprobs – see their Table 1.

For better or worse, almost all subsequent work on memorization has adopted discrete notions of memorization, many of them based on the definitions in Carlini et al. 2023. For example, we can ask whether a given prefix string p produces a specific continuation s using our chosen sampling procedure, where ps is in the training data.

Such measures relate directly to intuitive ideas people have about memorization, and so they feel natural. However, they will depend on our sampling procedure (greedy decoding could say “memorized” while top-k sampling says “not memorized”; see Hayes et al. 2025). More seriously, they will hide parts of the evidence, by treating near misses and complete misses identically. In addition, these measures introduce some gaps in the path to the palimpsestic tests, since we have to guess about the nature of the logprobs behind the discrete measures we see in the literature.

Luckily, Prashanth et al. 2024 did not take this connection for granted. In their Figure 2, they enhance Carlini et al.’s (2019) evidence by showing that memorized strings have lower prompt/prefix perplexity, lower continuation perplexity, lower overall perplexity, and lower loss:

Fur small side-by-side histogram plots comparing two groups labeled “Not Memorized” (blue) and “Memorized” (orange), with a legend centered below the plots. Each plot displays overlapping distributions with the vertical axis showing percentage and the horizontal axis labeled, from left to right: “Prompt PPL,” “Continuation PPL,” “Sequence PPL,” and “Loss.” In the Prompt PPL plot, the orange distribution is concentrated at lower values while the blue distribution peaks at higher values. In the Continuation PPL plot, the orange distribution is sharply concentrated near the lowest values, whereas the blue distribution is broader and centered at higher values. In the Sequence PPL plot, the orange distribution again clusters at lower values and the blue distribution peaks at higher values with a wider spread. In the Loss plot, both distributions largely overlap, with the blue distribution slightly shifted to higher loss values than the orange. Overall, the figure visually contrasts memorized versus not memorized examples across four metrics, showing that memorized examples tend to have lower perplexity and slightly lower loss. — Selected results from Prashanth et al. 2024, Figure 2, showing that memorization is correlated with lower perplexity and lower loss.

This evidence shows that memorization generally implies higher logprobs, which helps with the step/leap from discrete measures to logprobs. Note, though, that the evidence does not show the converse, i.e., that higher logprobs implies memorization. Intuitively, it seems like many non-memorized strings could have high logprobs, and Prashanth et al.’s (2024) results just above indicate that this is indeed the case (the “Memorized” and “Not memorized” areas overlap a lot).

In the next section, I will encourage us to think expansively about what counts as memorization, which makes this logical nicety less important. However, you needn’t be as relaxed as I am about this. All we need to continue our journey to the palimpsestic tests is the following potentially very loose heuristic:

Memorization correlates with higher logprobs.

There are many kinds of memorization

We are accustomed to thinking of memorization as verbatim memorization: you can either perfectly recite the lyrics to “Sparks fly” by Waxahatchee, or you can’t; the LM either reproduces Section 3 of Walt Whitman’s “Song of the open road”, or it doesn’t.

I would argue that our standards for memorization should be more relaxed. Even people focused on issues of copyright should accept lower standards, since minor mistakes or perturbations to a text will not necessarily get you out of hot water in the context of copyright law.

There are also different extremes when it comes to (perfect or imperfect) memorization. Rote memorization requires a brute-force effort. If you know the first 10 digits of pi by heart, it is probably because you simply learned the sequence as a primitive – no shortcuts, no higher-level patterns to leverage.

Memorization can also involve what Prashanth et al. 2024 call reconstruction: if you are memorizing rhyming song lyrics in a language you speak, many parts of the string will be more or less fully determined for you by the preceding context, so that you may only have to rote-memorize a few keywords and other cues to appear to have rote-memorized the entire piece. For LMs, this kind of reconstruction is so pervasive and powerful that it can give rise to memorization illusions of the sort documented in Huang et al. 2024, Section 4.2: what looks for all the world like a memorized string is in fact a natural consequence of prior things the model has learned.

Memorization can also be more semantic. If you learn the names of all the bones in the human body, you aren’t memorizing strings per se, but rather pieces of content. For an LM, this knowledge will be acquired via strings, and such acquisition will in turn impact the logprobs it assigns to the relevant strings. This will be more diffuse than for verbatim memorization, but it could still lead to strong string-level effects.

There are probably other notions that are worth identifying here. The above suffices to show that there is a complex landscape for memorization, and that we would do well to think in terms of degrees of memorization and semantic memorization. Once again, though, it is fine if you want to be more conservative about what counts as memorization. All we need in order to continue confidently ahead is the following heuristic:

Results for verbatim memorization represent an extreme of a phenomenon that is pervasive in how LMs process text.

Memorization profiles

The next body of evidence is the most critical. Whatever your views about the above two sections, you now have to be willing to look at each one of the plots in this section, with its own particular discrete memorization measures defined over very specific sets of strings, and think, “That same thing will play out if we use logprobs defined over all strings”. If you do that, you will emerge from this segment of the journey at most one step away from the discovery of the palimpsestic tests.

Tirumala et al. (2022) provide an early glimpse of the memorization profiles we are working toward. In their Figure 9, they summarize the results of three separate training runs in which a special batch of examples (a validation batch, presumably from the same distribution as the training examples) was injected at epochs 14, 39, and 63:

A line plot showing a metric labeled “M(f)” on the vertical axis, ranging from 0 to 1, versus “Number of Epochs” on the horizontal axis, ranging from 0 to about 180. A prominent blue curve (training examples) starts around 0.35 at epoch 0, rises quickly to about 0.6 by roughly 20 epochs, continues increasing more gradually to around 0.78 near 60 epochs, briefly flattens, and then slowly increases further to about 0.9 by the final epoch. Several shorter curves (the “forgetting curves”) in orange, green, and red appear at epochs 14, 39, and 63; they start around 0.4 and steadily decline toward approximately 0.3 as epochs increase. — “Forgetting curves” from Tirumala et al. 2022, Figure 9. The model is a 125M parameter GPT-2-style model trained in Wikitext-103 (103M tokens). In three independent training runs, the authors injected a special held-out batch at epochs 14 (orange), 39 (green), and 63 (red). The model partially memorizes, and then rapidly forgets, each special batch. There is a small downward trend in memorization rate at the point of exposure, and M(f) correlates with injection site, at least over a number of epochs. (The blue line is for the training data, which is small and seen repeatedly by the model in this setting, and thus it is eventually verbatim-memorized.)

M(f) is the memorization rate: the rate at which the model correctly predicts the next token to complete sequences from the special batch. Tirumala et al. refer to the orange, green, and red lines as “forgetting curves” because they show that the model partially memorizes, and then rapidly forgets, the special batch sequences.

In Tirumala et al.’s assessment, this figure shows that the precise injection point for the special batch doesn’t matter. However, there is a noteworthy downward trend – earlier points start higher. In addition, though the effect appears to be small in the plot, it is clear that injection site correlates with M(f); at least for a while, the orange line fits neatly under the green line, which fits neatly under the red line. Similar trends are discernable in the findings of Jagielski et al. (2023); see especially their Figure 8c. This begins to suggest a complex interplay between memorization and data ordering, though it is hard to make out precisely what the relationship is, possibly because the models studied here are small and thus not very capable memorizers. In retrospect, we can see that these patterns are essentially the palimpsest.

The original Pythia paper (Biderman et al. 2023) is noteworthy in this context, because it suggests that we are unlikely to see such trends. The authors report that “memorized sequences are not spaced more densely toward the beginning or end of training, and that between each checkpoint roughly the same number of memorized sequences can be found”. This would seem to entail that palimpsestic tests will not work, and indeed my understanding is that at least one member of RJS was discouraged by this.

I initially thought we could explain this away by noting that Biderman et al. use only the first 64 tokens of every training sequence, to simplify their statistical analysis. I assumed such sequences would tend to be frequent and repetitive, leading them to be memorized better but in ways that would be independent of training data ordering. Jing informed me that this is incorrect; Figure 5 of Kuditipudi, Huang, Zhu et al. 2025 shows that the palimpsest is stronger for example prefixes. Thus, the more likely explanation for Biderman et al.’s negative result is that their very strict verbatim memorization criteria hide the effects of training data order.

In Lesci et al. 2024, a rich picture of the relationship between training data order and memorization comes into view. They seem to have coined the phrase “memorization profile”. Here is their Figure 1, which is basically the entire palimpsestic picture through a discrete lens:

Two vertically stacked line-and-scatter plots illustrating training dynamics over time. For both panels, the x-axis gives model checkpoint steps, ranking from 0 to 140K. In the top panel, the vertical axis is labeled “Treatment Step,” ranging from about 10k to 140k; thousands of small blue rectangular marks form a dense diagonal band rising from lower left to upper right, indicating that treatment step generally increases with checkpoint step, with more spread at higher values. In the bottom panel, we see a detail from the top panel at about 60K steps. The vertical axis is labeled “Memorisation,” ranging approximately from −0.02 to 0.08. Black dots connected by a thin line show memorisation values over checkpoints, with vertical error bars at each point; memorisation fluctuates around zero up to about 60k, then shows a sharp spike to around 0.06 near 60k–65k, followed by a drop and stabilization around 0.015–0.02 for the remainder of training. A red dashed horizontal line marks zero memorisation, and a black arrow on the far right points to the final memorisation level. — Lesci et al.’s (2024) memorization profile for Pythia 6.9B (their Figure 1). The earliest checkpoints have the highest instantaneous memorization rates, later checkpoints memorize more robustly, and all memorization paths decay rapidly (as we saw in Tirumala et al. 2022).

This figure shows data for Pythia 6.9B. The diagonal in the top panel tracks instantaneous memorization: the ability of the model to memorize the examples from the current batch. Instantaneous memorization is strongest for early batches (not just for this model but for the entire Pythia series, from 70M to 12B; see their Figure 2). This is very likely the same observation as the downward trend we see in Tirumala et al.’s (2022) forgetting curves above.

The persistent memorization trends are evident in the off-diagonal elements to the lower right (the space where the measurements take place after the exposure). The memorization appears to stop abruptly because of Lesci et al.’s filters for statistical significance.

So: imagine we were seeing all the raw logprobs. It’s a safe bet that the persistent memorization trends would fade gradually. You might even be able to guess where a given memorized sequence occurred in the training regime by comparing its logprobs with the logprobs of other sequences…

Another clue: Chang et al. (2024) study how models acquire factual knowledge. This is a kind of memorization, and they pose the question in terms of whether the model generates the correct next token under greedy decoding, which approximates factual recall in terms of specific strings. Here is their Figure 1:

A line chart showing how the peformance of a probe for factual recall changes over training steps. The x-axis is training steps and runs from about –30 on the left to about 90 on the right. The y-axis is probe performance and ranges roughly from –0.2 to 0.8. A blue line traces the loss for the target token. It is on a scale that is different from that of the y-axis. Before training step 0 (negative t values), the blue line fluctuates slightly around zero, sometimes positive and sometimes slightly negative. At training step 0 (the injection site for the factual knowledge), the blue line rises sharply, reaching is maximum at about 30 steps after injection (also the point of peak probe performance), and then drops and begins to fluctuate at a lower level. — Chang et al. 2024, Figure 1. The blue line (logprobs of the memorized token) is a Lesci et al.-style memorization profile using logprobs: after exposure, the logprobs of the correct token go up sharply before regressing back to the mean.

The start of the green section is where the model was exposed to the factual knowledge for the first time. The plot measures various aspects of discrete memorization, but the blue line is the logprob of the target word, so we can focus on that, since we know we are headed towards the palimpsestic tests. This logprob reaches its maximum at about 30 steps after exposure (the red line) and then drops down. If you imagine doing this same experiment with different knowledge at different time steps, you would likely end up with blue lines that had this same shape, but earlier checkpoints would reach higher maxima and the logprobs would decay at different rates. This is what the Lesci et al. 2024 profiles suggest.

An aside: The offset between exposure and peak memorization is worth thinking about. Huang et al. (2024) observe the same thing and attribute it to the momentum term of the Adam optimizer used to train the Pythia models. The same optimizer was used by Tirumala et al. (2022), but their Figure 9 (given above as Figure B) likely hides this offset because of its epoch-level reporting. Lesci et al.’s (2024) findings do not show the offset either (see their Figure C above), probably for the same reason: they relied on existing Pythia checkpoints, which exist at 1K intervals for all but the earliest iterations. This is not fine-grained enough to show the offset.

Huang et al. 2024 is historically important for the development of the palimpsestic tests in part because its lead author is the J of RJS. The paper is focused on understanding where verbatim memorization occurs and what its underlying mechanisms are like, so memorization profiles are not really in the spotlight. However, we nonetheless get two glimpses of these profiles.

First, in their Figure 3, they observe that better models – as measured by size and checkpoint – memorize more. The checkpoint-based metric reflects the persistent memorization patterns from Lesci et al. 2024.

Second, and more intriguing from my perspective, is their Figure 2, shown here:

A line plot showing how “Verbatim Memorization Length” (y-axis) changes over training steps for language models under different conditions. The x-axis gives training steps and runs from 0 to 200 (marking steps after the target sequence was injected). The y-axis runs roughly from 4 to 35. Three main color-coded groups represent batch sizes: blue for batch size 8, green for batch size 32, and orange for batch size 128. Within each color, solid lines represent a larger model (6.9B), dotted lines represent a smaller model (2.8b), and dashed lines near the bottom represent a shuffled sequence condition for both models. Shaded regions around the solid and dotted lines indicate variability or uncertainty. Overall, the blue curves (batch size 8) have the highest memorization lengths, fluctuating roughly between the low 20s and high 20s, the green curves (batch size 32) are intermediate, around the mid-teens to high teens, and the orange curves (batch size 128) are lowest, around 9 to 12. The shuffled baselines for all batch sizes stay much lower and nearly flat, around 4 to 6, indicating minimal memorization. Across steps, the curves fluctuate but do not show strong upward or downward trends, suggesting relatively stable memorization behavior over training, with smaller batch sizes consistently associated with higher verbatim memorization. — Single-shot memorization patterns from Huang et al. 2024, Figure 2. Here, 2.8B and 6.B parameter models, using batch sizes 8, 32, and 128, are compared for their ability to memorize a string after a single occurrence. Such memorization happens only with very small batch sizes and is mostly not very successful even there.

This figure was intended to show that single-shot memorization is possible only for settings that are unrealistic when it comes to training frontier LMs because only tiny batches show any evidence of it. The authors also note the consistent delay in peak memorization that I mentioned above. I myself don’t see much else happening here. However, for RJS, this apparently also showed a secondary trend: single-shot memorization decays gradually. I myself don’t really see steady decay, but I now realize that this is probably an artifact of the discrete measure of verbatim memorization used on the y-axis. For RJS, all those wavy lines were enough to indicate that the underlying logprobs were on a downward trajectory.

When I expressed my wonder/concern/bewilderment about this to Jing, she said something like, “Of course the loss will shoot down and then rise gradually”, and she reminded me that she used loss-based metrics in her earliest explorations of LM memorization and so had seen the pattern many times before. She then sent me the following new supporting data for single-shot memorization, using OLMo-2 7B:

A line chart showing how training loss changes as a language model (OLMo-7B) continues pretraining for additional steps. The x-axis is training steps and ranges from +0 to +200 steps, indicating how far we are from the injection site. The y-axis is loss and ranges approximately from 2.47 to 2.78. There are two plotted lines, identified in a legend titled “Training Data.” The light blue line with square markers represents the “Baseline” condition, and the darker blue line with square markers represents the “Injected” condition. At step +0, both conditions start with similar loss values around 2.74. The Baseline line remains relatively flat across all steps, fluctuating slightly around a loss of approximately 2.73 to 2.76, with no strong upward or downward trend. In contrast, the Injected line shows a sharp drop in loss during the first 25 steps, decreasing from about 2.74 down to roughly 2.48. After this initial drop, the Injected loss gradually increases again, stabilizing and fluctuating between about 2.50 and 2.55 through steps +50 to +200.
Overall, the chart shows that the Injected condition consistently achieves substantially lower loss than the Baseline condition after the first few steps, with a large early improvement followed by partial regression but sustained separation between the two curves. — Single-shot memorization pattern for OLMo-2 7B, with memorization assessed by model loss (y-axis). The loss drops rapidly and then begins a gradual rise. This is, in effect, a tightly controlled, loss-based (logprobs-based) version of the various memorization profiles seen in prior work. (A big thank-you to Jing Huang for creating this plot.)

This plot tracks the logprobs for a single-shot string (“Injected”). The y-axis measures the loss for the model, which is the negative of the logprobs. As Jing predicted, the loss drops to its lowest point about 25 steps after exposure, and then it trends upward gradually, regressing towards the mean loss for the model. The pattern is virtually the same as the blue line from Chang et al. (2024), reproduced in Figure D above, and (modulo the offset in the peak rate) the same as the memorization profile in the bottom panel of Lesci et al. (2024), reproduced in Figure C above. The overarching heuristic can be given as follows:

A model’s logprobs for a sequence it is trained on will peak soon after the relevant training step and then decay predictably over subsequent steps.

Seeing the palimpsest

The first major step toward the palimpsestic tests is simply checking the above heuristic rigorously. Given LM checkpoints C₁…C_n and aligned batches B₁…B_n, will the average logprobs that C_n assigns to B₁…B_n be correlated with the order 1…n? The following diagram (produced by RJS but oddly cut from the final paper) shows that the answer is yes:

A line chart showing how model loss decreases over training. The horizontal axis is labeled “Checkpoint at Step (K)” and ranges roughly from 20K to 140K steps, while the vertical axis is labeled “Loss” and ranges from about 2.98 to 3.25. Five colored lines with square markers represent different training examples, labeled in a legend as training example at step 5K, 25K, 50K, 75K, and 100K. All lines start near a loss of about 3.24 at the earliest checkpoint and steadily decline as the checkpoint step increases. The curves are similar to one another, but the loss for batches before a given checkpoint are always organized in the order of those batches, with the earliest batches having the largest loss and the latest having the smallest batrches. — The palimpsest.

The y-axis of this plot is the loss (the negative of the logprobs). To see the significance of this plot, let’s zoom in on the column of points above checkpoint 80. The lowest blue dot (lowest loss, highest logprobs) is for the set of examples that come from checkpoint 75. The next lowest is for batch 50, then 25, then 5, and then 100:

Loss vs. checkpoint detail: At checkpoint 80, the loss values for the batches are ordered (highest to lowest) 5, 25, 50, 75, perfectly reflecting their order in the model's training regime.

In other words, for the data this checkpoint has seen, there is a perfect correlation between loss and batch order. This is true for every checkpoint. For example, at checkpoint 25, the checkpoint 25 data is by far the lowest. The next lowest is 5. The rest (the ones from the future) are clustered together. By the time we get to checkpoint 140, we have seen all the batches under consideration here, and the loss (logprobs) perfectly mirrors the order of the batches. In each case, we get a regression to the mean (the points cluster back together), but the ordering is preserved.

As a control: what happens if we run the same experiment but using a model that was trained on a different sequence of batches? That is shown here, and we see there is no separation effect – no correlation between loss and batch index:

We now have everything we need for the palimpsestic tests. The only remaining ingredient is the creative spark to see that the above can support model provenance tests.

Palimpsestic tests

I’ll just briefly review the tests at a high level here, since the details are given in Kuditipudi, Huang, Zhu et al. (2025), Section 3, and the code for running the tests is also available.

For the Query setting, we measure the correlation between the training data order of model M₁ and the logprobs assigned to those training data examples by M₂. This is a direct application of what we see in Figure G and Figure H, but we now work at the example level rather than aggregating over batches. To enhance the power of the test, the authors subtract out the logprobs from a reference model, which helps control for general variation in how likely specific texts are. If it is not possible to get logprobs directly from M₂, they can be estimated via text samples and the test still works well (see Appendix A.6 of the paper).

For the Observational setting, we assume we only have a text sample T from M₂, so we train an ordered series of n-gram models L₁…L_k on contiguous batches of data from M₁’s training order, and then we compare the likelihoods assigned to T by each L_i with the order 1…k. The paper considers using both the probabilities from the n-gram models and simple counts of overlapping n-grams. The experiments primarily use simple counts, which seems like a throwback to traditional notions of verbatim memorization.

The paper uses the Spearman correlation coefficient, which reduces the comparison to one between the rank of the data indices and the rank of the likelihoods (logprobs, probabilities, or counts). The null hypothesis is that the training data ordering and the likelihood ranks are independent of each other, and the p-values from the correlation test give us an estimate of the probability of the observed correlation given that null hypothesis.

The paper shows that the tests are robust to lots of ways in which one might mess with M₂ to try to hide its origins as a copy of M₁: fine-tuning, model souping, and continued training on shuffled versions of the original data. Basically, in order to successfully hide from the palimpsestic tests, you need to do so much additional work that it makes the original model theft pretty pointless.

This raises the issue of how exactly to set up the test to avoid false positives. For the Query setting, we can ask how many samples we need. For the Observational setting, we can ask how many n-gram models we need and how long the text sample from M₂ needs to be. In both settings, we have to decide which p-value to use to decide whether we can reject the null hypothesis. The precise answers will depend on the scenario, and the paper offers a lot of detailed guidance. Both tests benefit from larger text samples, and the Observational setting is considerably more demanding, as one might expect given how little access one has to M₂ in that setting.

Looking ahead

I have tried to methodically assemble precedents and rationally reconstruct the path to the palimpsestic tests. Our odyssey is complete. I feel I can map the route well now, but I am still surprised and delighted by where it leads. I have a persistent worry that I wasn't going to get here on my own. Well, I am fortunate to have brilliant students.

The tests we have developed so far seem not to help with the highly salient question of whether M₂ was post-trained on examples distilled from M₁. RJS conducted pilot experiments on this question and found that they could get signal only with truly massive distillation sets – possibly as large as the training data for M₁. This seems not to have much practical utility, but there may be variants of the Query and Observational settings that are less demanding in the right ways.

Are there inexpensive ways to cheat the palimpsestic tests and thereby hide one’s model theft? The weight editing methods of Merullo et al. (2025) might point the way to some camouflage. Do all models memorize in the same way? The findings in Bonnaire et al. (2025) suggest not. What other surprising metadata about their nature and origins do LMs acquire during the course of training? Clearly, we have merely reached an interim stop on the much larger expedition of mapping out LM memorization and understanding its implications.

Thanks

A huge thanks to Jing Huang for extensive discussion and detailed feedback on this chronicle. My thanks also to Rohith Kuditipudi, Jing Huang, Sally Zhu, Diyi Yang, and Percy Liang for contributing so many critical epistrata to the palimpsest. Any mistakes are my own.