From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence
From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence Marc Finzi∗1Shikai Qiu∗2Yiding Jiang∗1Pavel Izmailov2J. Zico Kolter1 Andrew Gordon Wilson2 1Carnegie Mellon University2New York University Abstract Can we learn more from data than existed in the generating process itself? Can new and useful information be constructed from merely applying deterministic transformations to existing data? Can the learnable content in data be evaluated without considering a downstream task? On these questions, Shannon information and Kolmogorov complexity come up nearly empty-handed, in part because they assume observers with unlimited computational capacity and do not target the useful information content. In this work, we identify and exemplify three seeming paradoxes in information theory: (1) information cannot be increased by deterministic transformations; (2) information is independent of the order of data; (3) likelihood modeling is merely distribution matching. To shed light on the tension between these results and modern practice, and to quantify the value of data, we introduceepiplexity†, a formalization of information capturing what computationally bounded observers can learn from data. Epiplexity captures the structural content in data while excluding time-bounded entropy, the random unpredictable content exemplified by pseudorandom number generators and chaotic dynamical systems. With these concepts, we demonstrate how information can be created with computation, how it depends on the ordering of the data, and how likelihood modeling can produce more complex programs than present in the data generating process itself. We also present practical procedures to estimate epiplexity which we show capture differences across data sources, track with downstream performance, and highlight dataset interventions that improve out-of-distribution generalization. In contrast to principles of model selection, epiplexity provides a theoretical foundation fordata selection, guiding how to select, generate, or transform data for learning systems. 1 Introduction As AI research progresses towards more general-purpose intelligent systems, cracks are beginning to show in mechanisms for grounding mathematical intuitions.Much of learning theory is built around controlling generalization error with respect to a given distribution, treating the training distribution as fixed and focusing optimization effort on the choice of model. Yet modern systems are expected to transfer across tasks, domains, and objectives that were not specified at training time, often after large-scale pretraining on diverse and heterogeneous data. In this regime, success or failure frequently hinges less on architectural choices than on what data the model was exposed to in the first place. Pursuing broad generalization to diverse out-of-distribution tasks forces a shift in perspective: instead of treating data as given and optimizing for in-distribution performance, we need to choose and curate data to facilitate generalization to unseen tasks. This shift makes the value of data itself a central question—how much usable, transferable information can a model acquire from training? In other words, instead of model selection, how do we performdata selection? On this question, existing theory offers little guidance and often naively contradicts empirical observations. *Equal contribution. †Code available athttps://github.com/shikaiqiu/epiplexity. 1 arXiv:2601.03220v2 [cs.LG] 16 Mar 2026
Entropy Epiplexity Computation Information can be created by computationRandom vs structural information Structural informationOOD generalization Entropy Epiplexity Entropy Epiplexity OOD performance Initial ConditionEmergent Structure Apparent Randomness Deterministic Dynamics API_KEY="sk_7aF2jK1ycP9LmvYzz34" USER_ID="usr_4f8a2c1e9b7d3065" BUCKET="s3://data-8a3f1b-west-prod" SAVE_DIR="/mnt/marc/exp_7f2a/ckpts" SAVE_CKPT=True DEBUG=False SEED=9284715 ... High random info, low structural info Low random info, low structural info defis_even(n): ifn==0:returnTrue elifn==1:returnFalse elifn==2:returnTrue elifn==3:returnFalse elifn==4:returnTrue elifn==5:returnFalse ... Reuse sharedcircuits, subprograms, ... Moderate random info, high structural info defdijkstra(g,s): D=defaultdict(lambda:float('inf')) D[s]=0;q=[(0,s)] whileq: d,u=pop(q) ifd==D[u]: forv,wing.get(u,[]): if(nd:=d+w)<D[v]: D[v]=nd;push(q,(nd,v)) returnDFigure 1:Random vs structural information for computationally bounded observers.(Left) Illustration of random vs structural information of different data for computationally bounded observers, which we formalize with time-bounded entropy and epiplexity (Section 3) and can be estimated from loss curves of neural networks trained on that data (Section 4). (Top Right) Unlike other forms of information, time-bounded entropy and epiplexity can be increased through computational processes, such as simulating dynamical systems (cellular automation, Lorenz equations) and interventions like changing the data ordering, which can produce apparent randomness but also learnable, emergent structures like gliders and the Lorenz attractor invariant measure (Section 5). (Bottom Right) Whereas time-bounded entropy captures the in-distribution randomness and unpredictability, epiplexity measures the amount of structural information the model extracts from the data to its weights, which can be useful for OOD tasks such as by reusing learned circuits shared between the in-distribution and OOD tasks. Consider synthetic data, crucial for further developing model capabilities (Abdin et al., 2024; Maini et al., 2024) when existing natural data are exhausted.Existing concepts in information theory like the data processing inequality appear to suggest that synthetic data adds no additional value. Questions about what information is transferred to a given model seem naturally within the purview of information theory, yet, quantifying this information with existing tools proves to be elusive. Even basic questions, such as the source of the information in the weights of an AlphaZero game-playing model (Silver et al., 2018), are surprisingly tricky to answer. AlphaZero takes in zero human data, learning merely from the deterministic rules of the game and the AlphaZero RL algorithm, both of which are simple to describe. Yet the resulting models achieve superhuman performance and are large in size. To assert that AlphaZero has learned little to no information in this process is clearly missing the mark, and yet both Shannon and algorithmic information theory appear to say so. In this paper, we argue that the amount of structural information acomputationally bounded observer can extract from a dataset is a fundamental concept that underlies many observed empirical phenomena. As we will show, existing notions from Shannon and algorithmic information theory are inadequate when forced to quantify this type of information. These frameworks often lend intuitive or mathematical support to beliefs that, in fact, obscure important aspects of empirical phenomena. To highlight the limitations of classical frameworks and motivate the role of computational constraints in quantifying information, we identify and demonstrate threeapparent paradoxes: statements which can be justified mathematically by Shannon and algorithmic information theory, and yet are in tension with intuitions and empirical phenomena. 2
Paradox 1:Information cannot be increased by deterministic processes.For both Shannon entropy and Kolmogorov complexity, deterministic transformations cannot meaningfully increase the information content of an object.And yet, we use pseudorandom number generators to produce randomness, synthetic data improves model capabilities, mathematicians can derive new knowledge by reasoning from axioms without external information, dynamical systems produce emergent phenomena, and self-play loops like AlphaZero learn sophisticated strategies from games (Silver et al., 2018). Paradox 2:Information is independent of factorization order.A property of both Shannon entropy and Kolmogorov complexity is that total information content is invariant to factoriza- tion: the information from observing firstXand thenYis the same as observingYfollowed byX. On the other hand, LLMs learn better on English text ordered left-to-right than reverse ordered text, picking out an “arrow of time” (Papadopoulos et al., 2024; Bengio et al., 2019), and we have cryptography built on the existence of functions that are computationally hard to predict in one direction and easy in another. Paradox 3:Likelihood modeling is merely distribution matching.Maximizing the likelihood is often equated with matching the training data generating process: the true data-generating process is a perfect model of itself, and no model can achieve a higher expected likelihood. As a consequence, it is often assumed that a model trained on a dataset cannot extract more structure or learn useful features that were not used in generating the data. However, we show that a computationally-limited observer can in fact uncover much more structure than is in the data generating process. For example, in Conway’s game of life the data are generated via simple programmatic rules that operate on two-dimensional arrays of bits. Applying these simple rules sequentially, we see emergent structures, such as different species of objects that move and interact in a predictable way. While an unbounded observer can simply simulate the evolution of the environment exactly, a computationally bounded observer would make use of the emergent structures and learn the different types of objects and their behaviors. The tension between these theoretical statements and empirical phenomena can be resolved by impos- ing computational constraints on the observer and separating the random content from the structural content. Drawing on ideas from cryptography, algorithmic information theory, and these unexplained empirical phenomena, we define a new information measure,epiplexity(epistemic complexity), which formally defines the amount of structural information that a computationally bounded observer can extract from the data (Section 3, Definition 8). Briefly, epiplexity is the information in the model that minimizes the description length of data under computational constraints. A simple heuristic measurement is the area under the loss curve above the final loss, while a more rigorous approach uses the cumulative KL divergence between a teacher and student model (Section 4, Figure 2). Our definitions capture the intuition that an object contains both random, inherently unpredictable information (entropy), and predictable structured information that enables observers to generalize by identifying patterns (epiplexity). In Figure 1 (left) we illustrate this divide. In the top row, we have highly redundant and repetitive code and simple color gradients, which have little information content, be it structural or random. In the middle row, we have the inner workings of an algorithm and pictures of animals, showing complex, long-range interdependencies between the elements from which a model can learn complex features and subcircuits that are helpful even for different tasks. In contrast, on the bottom, we have random data with little structure: configuration files with randomly generated API keys, file paths, hashes, arbitrary boolean flags have negligible learnable content and no long-range dependencies or complex circuits that result from learning on this task. Similarly, uniformly shuffled pixels from the animal pictures have high entropy but are fundamentally unpredictable, and no complex features or circuits arise from training on these data. 3
An essential property of our formulation is that information isobserver dependent:the same object may appear random or structured depending on the computational resources of the observer. For instance, the output of a strong pseudorandom generator appears indistinguishable from true randomness to any polynomial-time observer lacking the secret key (seed), regardless of the algorithm or function class. In other situations, such as chaotic dynamical systems, both apparently random behavior is produced along with structure: the state of the system cannot be predicted precisely over long time-scales, but such observers may still learn meaningful predictive distributions, as shown by the invariant measure in Figure 1 (top right). Models trained to represent these distributions are computer programs, and substructures within these programs, like circuits for performing specific tasks, or induction heads (Olsson et al., 2022), can be reused even for seemingly unrelated data. This view motivates selecting high epiplexity data that induces more structural information in the model, since these structures can then be reused for unseen out-of-distribution (OOD) tasks, as illustrated in Figure 1 (bottom right). We emphasize, however, that epiplexity is a measure of information,nota guarantee of OOD generalization to specific tasks. Epiplexity quantifies the amount of structural information a model extracts, while being agnostic to whether these structures are relevant to aspecificdownstream task. To build intuition, we explore a range of phenomena and provide experimental evidence for behaviours that are poorly accounted for by existing information-theoretic tools, yet naturally accommodated by epiplexity. We show that informationcanbe created purely through computation, giving insights into synthetic data (subsection 5.1). We examine how certain factorizations of the same data can increase structural information and downstream OOD performance—even as they result in worse training loss (subsection 5.2). We show why likelihood modeling is more than distribution matching, identifying induction and emergence as two settings where the observer can learn more information than was present in the data generating process (subsection 5.3). By measuring epiplexity, we can better understand why pre-training on text data transfers more broadly than image data, and why certain data selection strategies for LLMs are empirically successful (Section 6).Together, our results provide clarity on the motivating questions: the information content of data can be compared independently of a specific task, new information can be created by computation, and models can learn more information than their generating processes contain. In short, we identify a disparity between existing concepts in information theory and modern practice, embodied by three apparent paradoxes, and introduce epiplexity as a measurement of structural information acquired by a computationally bounded observer to help resolve them. We formally define epiplexity in Section 3 (Definition 8) and present measurement procedures in Section 4. In Section 5, we show how epiplexity and time-bounded entropy shed light on these paradoxes, including induction and emergent phenomena. Finally, in Section 6, we demonstrate that epiplexity correlates with OOD generalization, helping explain why certain data enable broader generalization than others. 2 Background In order to define the interesting, structural, and predictive component of information, we must separate it out from random information—that which is fundamentally unpredictable given the computational constraints of the observer. Along the way, we will review algorithmic randomness as developed in algorithmic information theory as well as notions of pseudo-randomness used in cryptography, and how these concepts crucially depend on the observer. 4
2.1 What Does it Mean for An Object to Be Random? Random Variables and Shannon Information.Many common intuitions about randomness start from random variables and Shannon information. A random variable defines a map from a given measurable probability space to different outcomes, with probabilities corresponding to the measure of the space that lead to a certain outcome. Shannon information assigns to each outcome xa self-information (or surprisal)log 1/P(x)based on the probabilityP, and an entropy for the random variableH(X) =E[log 1/P(X)], which provides a lower bound on the average code length needed tocommunicatesamples to another party (Shannon, 1948). In Shannon’s theory, information comes only from distributions and random variables—objects that are not random must contain no information. As a result, non-random information is seemingly contradictory, and thus we must draw from a broader mathematical perspective to describe such concepts. In the mid 1900s, mathematicians were interested in formalizing precisely what it means for a given sample to be a random draw from a given distribution, to ground the theory of probability and random variables (Shafer and Vovk, 2006). A central consideration involves a uniformly sampled binary sequenceu1:∞from which other distributions of interest can be constructed. This sequence can also be interpreted as the binary expression of a number[0,1). Intuitively, one might think that all sequences should be regarded as equally random, as they are all equally likely according to the probability distribution:1111111. . .has the same probability mass as10011101. . .and also the same self-information. However, looking at statistics on these sequences reveals something missing from this perspective; from the law of large numbers, for example, it must be thatlimN→∞1 N PN i=1ui= 0.5, which is clearly not satisfied by the first sequence of1s. Martin-Löf Randomness: No algorithm exists to predict the sequence.Initial attempts were made to formalize randomness as sequences which pass all statistical tests for randomness, such as the law of large numbers for selected substrings. However, under such definitions all sequences fail to be random since tests likeu1:∞̸=y1:∞for any particular sequenceymust also be included (Downey and Hirschfeldt, 2019). The solution to these issues was found by defining random sequences not as those that pass all tests of randomness, but those that pass allcomputabletests of randomness, in a formalization known as Martin-Löf randomness (Martin-Löf, 1966).As it turned out, this definition is equivalent to a number of seemingly distinct definitions, such as the inability for any gambler to exploit properties of the sequence to make a profit, or that all prefixes of the random sequence should be nearly incompressible (Terwijn, 2016). For this last definition, we must invoke Kolmogorov complexity, a notion of compressibility and a key concept in this paper. Definition 1 (Prefix Kolmogorov complexity (Kolmogorov, 1968; Chaitin, 1975))Fix a universal prefix-free Turing machineU. The (prefix) Kolmogorov complexity of a finite binary string xisK(x) = min{ |p|:U(p) =x}. That is,K(x)is the length of the shortest self-delimiting program (a program which also encodes its length) that outputsxand halts. The conditional complexityK(x|y) is the length of the shortest program that outputsxand halts when providedyas input. Due to the universality of Turing machines, the Kolmogorov complexity for two Turing machines (or programming languages)U1andU2differ by at most a constant,|KU1(x)−KU2(x)|≤C, where the constantCdepends only onU1,U2, but not onx(Li et al., 2008). Definition 2 (Martin–Löf random sequence (Martin-Löf, 1966))An infinite sequence x1:∞∈ {0,1}Nis Martin–Löf random iff there exists a constantcsuch that for alln,K(x1:n)≥n−c. Using this criterion, all computable randomness tests are condensed into a single incomputable randomness test concerning Kolmogorov complexity. 5
Highlight & Ask
Select any part of the paper to ask specific questions
Add Context
Type @ to reference other papers and expand the discussion
Additional
• Explore the paper's Github implementation
• See how others cite this work
• Literature reviews
• Community context
Try asking "What's the intuition behind section 3.2?"