Kuditipudi, Huang, Zhu et al. (2025) provide an extremely effective test for whether one language model M1 is a derivative of another language model M2. The basic form of the test works like this: measure the correlation between the training data order for M1 and the likelihood (logprobs) that M2 assigns to those training examples.
Strikingly, you don’t need to use the entire training data order (which would be expensive), the test works even if M2 has been modified in numerous ways (e.g., fine-tuning, model souping), you don’t need the exact logprobs from M2 (estimating them from text samples suffices), and a variant of the test uses only texts generated by M2 (no requirement that we can run M2 ourselves). These tests do not require you to keep any information about M1 private, or to mess with M1 or its training data. Overall, then, this is a powerful, lightweight way of tracking model provenance, detecting tampering, and spotting theft.
Fascinating… but how on earth did the authors come up with this? You just correlate the training data order with the logprobs, and this tells you about model provenance? Okay, but why?
I am an author on Kuditipudi, Huang, Zhu et al. 2025. The test was initially presented to me as a finding that the lead authors (Rohith, Jing, and Sally; henceforth RJS) had already basically achieved. They assured me that this was all a natural consequence of prior work (theirs and others’) on how LMs memorize, but they always reviewed this very quickly. I am flattered that they thought I could follow their exegesis.
The goal of this post is to unpack their argument – to show that the necessary ingredients for proposing this test were basically present in the existing literature already, and to identify the places where RJS must still have taken a leap of faith. This is meant to be a rational reconstruction of whatever combination of deep reading and cosmic inspiration actually led RJS to their proposal.
I’ll refer to the family of tests from Kuditipudi, Huang, Zhu et al. 2025 as “palimpsestic tests”. A palimpsest is a piece of material that has been written on and erased many times, with the effect that the earlier texts remain visible underneath the later ones. It turns out that LMs are like palimpsests with respect to their training data examples – each one is etched into it, with early examples fainter than later ones. The evidence reviewed below is in large part evidence for this observation.
The relationship between memorization and logprobs
Carlini et al. 2019 is a pioneering study of LM memorization. The authors observe, among many other things, that memorized sequences have lower log perplexity, i.e., higher average logprobs – see their Table 1.
For better or worse, almost all subsequent work on memorization has adopted discrete notions of memorization, many of them based on the definitions in Carlini et al. 2023. For example, we can ask whether a given prefix string p produces a specific continuation s using our chosen sampling procedure, where ps is in the training data.
Such measures relate directly to intuitive ideas people have about memorization, and so they feel natural. However, they will depend on our sampling procedure (greedy decoding could say “memorized” while top-k sampling says “not memorized”; see Hayes et al. 2025). More seriously, they will hide parts of the evidence, by treating near misses and complete misses identically. In addition, these measures introduce some gaps in the path to the palimpsestic tests, since we have to guess about the nature of the logprobs behind the discrete measures we see in the literature.
Luckily, Prashanth et al. 2024 did not take this connection for granted. In their Figure 2, they enhance Carlini et al.’s (2019) evidence by showing that memorized strings have lower prompt/prefix perplexity, lower continuation perplexity, lower overall perplexity, and lower loss:
This evidence shows that memorization generally implies higher logprobs, which helps with the step/leap from discrete measures to logprobs. Note, though, that the evidence does not show the converse, i.e., that higher logprobs implies memorization. Intuitively, it seems like many non-memorized strings could have high logprobs, and Prashanth et al.’s (2024) results just above indicate that this is indeed the case (the “Memorized” and “Not memorized” areas overlap a lot).
In the next section, I will encourage us to think expansively about what counts as memorization, which makes this logical nicety less important. However, you needn’t be as relaxed as I am about this. All we need to continue our journey to the palimpsestic tests is the following potentially very loose heuristic:
Memorization correlates with higher logprobs.
There are many kinds of memorization
We are accustomed to thinking of memorization as verbatim memorization: you can either perfectly recite the lyrics to “Sparks fly” by Waxahatchee, or you can’t; the LM either reproduces Section 3 of Walt Whitman’s “Song of the open road”, or it doesn’t.
I would argue that our standards for memorization should be more relaxed. Even people focused on issues of copyright should accept lower standards, since minor mistakes or perturbations to a text will not necessarily get you out of hot water in the context of copyright law.
There are also different extremes when it comes to (perfect or imperfect) memorization. Rote memorization requires a brute-force effort. If you know the first 10 digits of pi by heart, it is probably because you simply learned the sequence as a primitive – no shortcuts, no higher-level patterns to leverage.
Memorization can also involve what Prashanth et al. 2024 call reconstruction: if you are memorizing rhyming song lyrics in a language you speak, many parts of the string will be more or less fully determined for you by the preceding context, so that you may only have to rote-memorize a few keywords and other cues to appear to have rote-memorized the entire piece. For LMs, this kind of reconstruction is so pervasive and powerful that it can give rise to memorization illusions of the sort documented in Huang et al. 2024, Section 4.2: what looks for all the world like a memorized string is in fact a natural consequence of prior things the model has learned.
Memorization can also be more semantic. If you learn the names of all the bones in the human body, you aren’t memorizing strings per se, but rather pieces of content. For an LM, this knowledge will be acquired via strings, and such acquisition will in turn impact the logprobs it assigns to the relevant strings. This will be more diffuse than for verbatim memorization, but it could still lead to strong string-level effects.
There are probably other notions that are worth identifying here. The above suffices to show that there is a complex landscape for memorization, and that we would do well to think in terms of degrees of memorization and semantic memorization. Once again, though, it is fine if you want to be more conservative about what counts as memorization. All we need in order to continue confidently ahead is the following heuristic:
Results for verbatim memorization represent an extreme of a phenomenon that is pervasive in how LMs process text.
Memorization profiles
The next body of evidence is the most critical. Whatever your views about the above two sections, you now have to be willing to look at each one of the plots in this section, with its own particular discrete memorization measures defined over very specific sets of strings, and think, “That same thing will play out if we use logprobs defined over all strings”. If you do that, you will emerge from this segment of the journey at most one step away from the discovery of the palimpsestic tests.
Tirumala et al. (2022) provide an early glimpse of the memorization profiles we are working toward. In their Figure 9, they summarize the results of three separate training runs in which a special batch of examples (a validation batch, presumably from the same distribution as the training examples) was injected at epochs 14, 39, and 63:
M(f) is the memorization rate: the rate at which the model correctly predicts the next token to complete sequences from the special batch. Tirumala et al. refer to the orange, green, and red lines as “forgetting curves” because they show that the model partially memorizes, and then rapidly forgets, the special batch sequences.
In Tirumala et al.’s assessment, this figure shows that the precise injection point for the special batch doesn’t matter. However, there is a noteworthy downward trend – earlier points start higher. In addition, though the effect appears to be small in the plot, it is clear that injection site correlates with M(f); at least for a while, the orange line fits neatly under the green line, which fits neatly under the red line. Similar trends are discernable in the findings of Jagielski et al. (2023); see especially their Figure 8c. This begins to suggest a complex interplay between memorization and data ordering, though it is hard to make out precisely what the relationship is, possibly because the models studied here are small and thus not very capable memorizers. In retrospect, we can see that these patterns are essentially the palimpsest.
The original Pythia paper (Biderman et al. 2023) is noteworthy in this context, because it suggests that we are unlikely to see such trends. The authors report that “memorized sequences are not spaced more densely toward the beginning or end of training, and that between each checkpoint roughly the same number of memorized sequences can be found”. This would seem to entail that palimpsestic tests will not work, and indeed my understanding is that at least one member of RJS was discouraged by this.
I initially thought we could explain this away by noting that Biderman et al. use only the first 64 tokens of every training sequence, to simplify their statistical analysis. I assumed such sequences would tend to be frequent and repetitive, leading them to be memorized better but in ways that would be independent of training data ordering. Jing informed me that this is incorrect; Figure 5 of Kuditipudi, Huang, Zhu et al. 2025 shows that the palimpsest is stronger for example prefixes. Thus, the more likely explanation for Biderman et al.’s negative result is that their very strict verbatim memorization criteria hide the effects of training data order.
In Lesci et al. 2024, a rich picture of the relationship between training data order and memorization comes into view. They seem to have coined the phrase “memorization profile”. Here is their Figure 1, which is basically the entire palimpsestic picture through a discrete lens:
This figure shows data for Pythia 6.9B. The diagonal in the top panel tracks instantaneous memorization: the ability of the model to memorize the examples from the current batch. Instantaneous memorization is strongest for early batches (not just for this model but for the entire Pythia series, from 70M to 12B; see their Figure 2). This is very likely the same observation as the downward trend we see in Tirumala et al.’s (2022) forgetting curves above.
The persistent memorization trends are evident in the off-diagonal elements to the lower right (the space where the measurements take place after the exposure). The memorization appears to stop abruptly because of Lesci et al.’s filters for statistical significance.
So: imagine we were seeing all the raw logprobs. It’s a safe bet that the persistent memorization trends would fade gradually. You might even be able to guess where a given memorized sequence occurred in the training regime by comparing its logprobs with the logprobs of other sequences…
Another clue: Chang et al. (2024) study how models acquire factual knowledge. This is a kind of memorization, and they pose the question in terms of whether the model generates the correct next token under greedy decoding, which approximates factual recall in terms of specific strings. Here is their Figure 1:
The start of the green section is where the model was exposed to the factual knowledge for the first time. The plot measures various aspects of discrete memorization, but the blue line is the logprob of the target word, so we can focus on that, since we know we are headed towards the palimpsestic tests. This logprob reaches its maximum at about 30 steps after exposure (the red line) and then drops down. If you imagine doing this same experiment with different knowledge at different time steps, you would likely end up with blue lines that had this same shape, but earlier checkpoints would reach higher maxima and the logprobs would decay at different rates. This is what the Lesci et al. 2024 profiles suggest.
An aside: The offset between exposure and peak memorization is worth thinking about. Huang et al. (2024) observe the same thing and attribute it to the momentum term of the Adam optimizer used to train the Pythia models. The same optimizer was used by Tirumala et al. (2022), but their Figure 9 (given above as Figure B) likely hides this offset because of its epoch-level reporting. Lesci et al.’s (2024) findings do not show the offset either (see their Figure C above), probably for the same reason: they relied on existing Pythia checkpoints, which exist at 1K intervals for all but the earliest iterations. This is not fine-grained enough to show the offset.
Huang et al. 2024 is historically important for the development of the palimpsestic tests in part because its lead author is the J of RJS. The paper is focused on understanding where verbatim memorization occurs and what its underlying mechanisms are like, so memorization profiles are not really in the spotlight. However, we nonetheless get two glimpses of these profiles.
First, in their Figure 3, they observe that better models – as measured by size and checkpoint – memorize more. The checkpoint-based metric reflects the persistent memorization patterns from Lesci et al. 2024.
Second, and more intriguing from my perspective, is their Figure 2, shown here:
This figure was intended to show that single-shot memorization is possible only for settings that are unrealistic when it comes to training frontier LMs because only tiny batches show any evidence of it. The authors also note the consistent delay in peak memorization that I mentioned above. I myself don’t see much else happening here. However, for RJS, this apparently also showed a secondary trend: single-shot memorization decays gradually. I myself don’t really see steady decay, but I now realize that this is probably an artifact of the discrete measure of verbatim memorization used on the y-axis. For RJS, all those wavy lines were enough to indicate that the underlying logprobs were on a downward trajectory.
When I expressed my wonder/concern/bewilderment about this to Jing, she said something like, “Of course the loss will shoot down and then rise gradually”, and she reminded me that she used loss-based metrics in her earliest explorations of LM memorization and so had seen the pattern many times before. She then sent me the following new supporting data for single-shot memorization, using OLMo-2 7B:
This plot tracks the logprobs for a single-shot string (“Injected”). The y-axis measures the loss for the model, which is the negative of the logprobs. As Jing predicted, the loss drops to its lowest point about 25 steps after exposure, and then it trends upward gradually, regressing towards the mean loss for the model. The pattern is virtually the same as the blue line from Chang et al. (2024), reproduced in Figure D above, and (modulo the offset in the peak rate) the same as the memorization profile in the bottom panel of Lesci et al. (2024), reproduced in Figure C above. The overarching heuristic can be given as follows:
A model’s logprobs for a sequence it is trained on will peak soon after the relevant training step and then decay predictably over subsequent steps.
Seeing the palimpsest
The first major step toward the palimpsestic tests is simply checking the above heuristic rigorously. Given LM checkpoints C1…Cn and aligned batches B1…Bn, will the average logprobs that Cn assigns to B1…Bn be correlated with the order 1…n? The following diagram (produced by RJS but oddly cut from the final paper) shows that the answer is yes:
The y-axis of this plot is the loss (the negative of the logprobs). To see the significance of this plot, let’s zoom in on the column of points above checkpoint 80. The lowest blue dot (lowest loss, highest logprobs) is for the set of examples that come from checkpoint 75. The next lowest is for batch 50, then 25, then 5, and then 100:

In other words, for the data this checkpoint has seen, there is a perfect correlation between loss and batch order. This is true for every checkpoint. For example, at checkpoint 25, the checkpoint 25 data is by far the lowest. The next lowest is 5. The rest (the ones from the future) are clustered together. By the time we get to checkpoint 140, we have seen all the batches under consideration here, and the loss (logprobs) perfectly mirrors the order of the batches. In each case, we get a regression to the mean (the points cluster back together), but the ordering is preserved.
As a control: what happens if we run the same experiment but using a model that was trained on a different sequence of batches? That is shown here, and we see there is no separation effect – no correlation between loss and batch index:
We now have everything we need for the palimpsestic tests. The only remaining ingredient is the creative spark to see that the above can support model provenance tests.
Palimpsestic tests
I’ll just briefly review the tests at a high level here, since the details are given in Kuditipudi, Huang, Zhu et al. (2025), Section 3, and the code for running the tests is also available.
For the Query setting, we measure the correlation between the training data order of model M1 and the logprobs assigned to those training data examples by M2. This is a direct application of what we see in Figure G and Figure H, but we now work at the example level rather than aggregating over batches. To enhance the power of the test, the authors subtract out the logprobs from a reference model, which helps control for general variation in how likely specific texts are. If it is not possible to get logprobs directly from M2, they can be estimated via text samples and the test still works well (see Appendix A.6 of the paper).
For the Observational setting, we assume we only have a text sample T from M2, so we train an ordered series of n-gram models L1…Lk on contiguous batches of data from M1’s training order, and then we compare the likelihoods assigned to T by each Li with the order 1…k. The paper considers using both the probabilities from the n-gram models and simple counts of overlapping n-grams. The experiments primarily use simple counts, which seems like a throwback to traditional notions of verbatim memorization.
The paper uses the Spearman correlation coefficient, which reduces the comparison to one between the rank of the data indices and the rank of the likelihoods (logprobs, probabilities, or counts). The null hypothesis is that the training data ordering and the likelihood ranks are independent of each other, and the p-values from the correlation test give us an estimate of the probability of the observed correlation given that null hypothesis.
The paper shows that the tests are robust to lots of ways in which one might mess with M2 to try to hide its origins as a copy of M1: fine-tuning, model souping, and continued training on shuffled versions of the original data. Basically, in order to successfully hide from the palimpsestic tests, you need to do so much additional work that it makes the original model theft pretty pointless.
This raises the issue of how exactly to set up the test to avoid false positives. For the Query setting, we can ask how many samples we need. For the Observational setting, we can ask how many n-gram models we need and how long the text sample from M2 needs to be. In both settings, we have to decide which p-value to use to decide whether we can reject the null hypothesis. The precise answers will depend on the scenario, and the paper offers a lot of detailed guidance. Both tests benefit from larger text samples, and the Observational setting is considerably more demanding, as one might expect given how little access one has to M2 in that setting.
Looking ahead
I have tried to methodically assemble precedents and rationally reconstruct the path to the palimpsestic tests. Our odyssey is complete. I feel I can map the route well now, but I am still surprised and delighted by where it leads. I have a persistent worry that I wasn't going to get here on my own. Well, I am fortunate to have brilliant students.
The tests we have developed so far seem not to help with the highly salient question of whether M2 was post-trained on examples distilled from M1. RJS conducted pilot experiments on this question and found that they could get signal only with truly massive distillation sets – possibly as large as the training data for M1. This seems not to have much practical utility, but there may be variants of the Query and Observational settings that are less demanding in the right ways.
Are there inexpensive ways to cheat the palimpsestic tests and thereby hide one’s model theft? The weight editing methods of Merullo et al. (2025) might point the way to some camouflage. Do all models memorize in the same way? The findings in Bonnaire et al. (2025) suggest not. What other surprising metadata about their nature and origins do LMs acquire during the course of training? Clearly, we have merely reached an interim stop on the much larger expedition of mapping out LM memorization and understanding its implications.
Thanks
A huge thanks to Jing Huang for extensive discussion and detailed feedback on this chronicle. My thanks also to Rohith Kuditipudi, Jing Huang, Sally Zhu, Diyi Yang, and Percy Liang for contributing so many critical epistrata to the palimpsest. Any mistakes are my own.