We're hiring
PaperBlogResources
/ 65

From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence

From Entropy to Epiplexity: Rethinking Information for
Computationally Bounded Intelligence
Marc Finzi∗1 Shikai Qiu∗2 Yiding Jiang∗1 Pavel Izmailov2 J. Zico Kolter1
Andrew Gordon Wilson2
1Carnegie Mellon University 2New York University
Abstract
Can we learn more from data than existed in the generating process itself? Can new and useful
information be constructed from merely applying deterministic transformations to existing data?
Can the learnable content in data be evaluated without considering a downstream task? On these
questions, Shannon information and Kolmogorov complexity come up nearly empty-handed, in part
because they assume observers with unlimited computational capacity and do not target the useful
information content. In this work, we identify and exemplify three seeming paradoxes in information
theory: (1) information cannot be increased by deterministic transformations; (2) information is
independent of the order of data; (3) likelihood modeling is merely distribution matching. To shed
light on the tension between these results and modern practice, and to quantify the value of data,
we introduce epiplexity†, a formalization of information capturing what computationally bounded
observers can learn from data. Epiplexity captures the structural content in data while excluding
time-bounded entropy, the random unpredictable content exemplified by pseudorandom number
generators and chaotic dynamical systems. With these concepts, we demonstrate how information
can be created with computation, how it depends on the ordering of the data, and how likelihood
modeling can produce more complex programs than present in the data generating process itself.
We also present practical procedures to estimate epiplexity which we show capture differences
across data sources, track with downstream performance, and highlight dataset interventions that
improve out-of-distribution generalization. In contrast to principles of model selection, epiplexity
provides a theoretical foundation for data selection, guiding how to select, generate, or transform
data for learning systems.
1 Introduction
As AI research progresses towards more general-purpose intelligent systems, cracks are beginning
to show in mechanisms for grounding mathematical intuitions. Much of learning theory is built
around controlling generalization error with respect to a given distribution, treating the training
distribution as fixed and focusing optimization effort on the choice of model. Yet modern systems
are expected to transfer across tasks, domains, and objectives that were not specified at training
time, often after large-scale pretraining on diverse and heterogeneous data. In this regime, success or
failure frequently hinges less on architectural choices than on what data the model was exposed to in
the first place. Pursuing broad generalization to diverse out-of-distribution tasks forces a shift in
perspective: instead of treating data as given and optimizing for in-distribution performance, we need
to choose and curate data to facilitate generalization to unseen tasks. This shift makes the value of
data itself a central question—how much usable, transferable information can a model acquire from
training? In other words, instead of model selection, how do we perform data selection? On this
question, existing theory offers little guidance and often naively contradicts empirical observations.
*Equal contribution.
†Code available at https://github.com/shikaiqiu/epiplexity.
1
arXiv:2601.03220v2 [cs.LG] 16 Mar 2026
Entropy
Epiplexity
Computation
Information can be created by computationRandom vs structural information
Structural information OOD generalization
Entropy
Epiplexity
Entropy
Epiplexity
OOD performance
Initial
Condition Emergent
Structure
Apparent
Randomness
Deterministic
Dynamics
API_KEY = "sk_7aF2jK1ycP9LmvYzz34"
USER_ID = "usr_4f8a2c1e9b7d3065"
BUCKET = "s3://data-8a3f1b-west-prod"
SAVE_DIR = "/mnt/marc/exp_7f2a/ckpts"
SAVE_CKPT = True
DEBUG = False
SEED = 9284715
...
High random info, low structural info
Low random info, low structural info
def is_even(n):
if n == 0: return True
elif n == 1: return False
elif n == 2: return True
elif n == 3: return False
elif n == 4: return True
elif n == 5: return False
...
Reuse
shared circuits,
subprograms,
...
Moderate random info, high structural info
def dijkstra(g, s):
D = defaultdict(lambda:float('inf'))
D[s] = 0; q = [(0, s)]
while q:
d, u = pop(q)
if d == D[u]:
for v, w in g.get(u, []):
if (nd := d + w) < D[v]:
D[v] = nd; push(q, (nd, v))
return DFigure 1: Random vs structural information for computationally bounded observers. (Left)
Illustration of random vs structural information of different data for computationally bounded observers,
which we formalize with time-bounded entropy and epiplexity (Section 3) and can be estimated from loss
curves of neural networks trained on that data (Section 4). (Top Right) Unlike other forms of information,
time-bounded entropy and epiplexity can be increased through computational processes, such as simulating
dynamical systems (cellular automation, Lorenz equations) and interventions like changing the data ordering,
which can produce apparent randomness but also learnable, emergent structures like gliders and the Lorenz
attractor invariant measure (Section 5). (Bottom Right) Whereas time-bounded entropy captures the
in-distribution randomness and unpredictability, epiplexity measures the amount of structural information
the model extracts from the data to its weights, which can be useful for OOD tasks such as by reusing learned
circuits shared between the in-distribution and OOD tasks.
Consider synthetic data, crucial for further developing model capabilities (Abdin et al., 2024; Maini
et al., 2024) when existing natural data are exhausted. Existing concepts in information theory
like the data processing inequality appear to suggest that synthetic data adds no additional value.
Questions about what information is transferred to a given model seem naturally within the purview
of information theory, yet, quantifying this information with existing tools proves to be elusive. Even
basic questions, such as the source of the information in the weights of an AlphaZero game-playing
model (Silver et al., 2018), are surprisingly tricky to answer. AlphaZero takes in zero human data,
learning merely from the deterministic rules of the game and the AlphaZero RL algorithm, both of
which are simple to describe. Yet the resulting models achieve superhuman performance and are
large in size. To assert that AlphaZero has learned little to no information in this process is clearly
missing the mark, and yet both Shannon and algorithmic information theory appear to say so.
In this paper, we argue that the amount of structural information a computationally bounded
observer can extract from a dataset is a fundamental concept that underlies many observed empirical
phenomena. As we will show, existing notions from Shannon and algorithmic information theory are
inadequate when forced to quantify this type of information. These frameworks often lend intuitive or
mathematical support to beliefs that, in fact, obscure important aspects of empirical phenomena. To
highlight the limitations of classical frameworks and motivate the role of computational constraints
in quantifying information, we identify and demonstrate three apparent paradoxes: statements which
can be justified mathematically by Shannon and algorithmic information theory, and yet are in
tension with intuitions and empirical phenomena.
2
Paradox 1: Information cannot be increased by deterministic processes. For both Shannon
entropy and Kolmogorov complexity, deterministic transformations cannot meaningfully
increase the information content of an object. And yet, we use pseudorandom number
generators to produce randomness, synthetic data improves model capabilities, mathematicians
can derive new knowledge by reasoning from axioms without external information, dynamical
systems produce emergent phenomena, and self-play loops like AlphaZero learn sophisticated
strategies from games (Silver et al., 2018).
Paradox 2: Information is independent of factorization order. A property of both Shannon
entropy and Kolmogorov complexity is that total information content is invariant to factoriza-
tion: the information from observing first X and then Y is the same as observing Y followed
by X. On the other hand, LLMs learn better on English text ordered left-to-right than reverse
ordered text, picking out an “arrow of time” (Papadopoulos et al., 2024; Bengio et al., 2019),
and we have cryptography built on the existence of functions that are computationally hard
to predict in one direction and easy in another.
Paradox 3: Likelihood modeling is merely distribution matching. Maximizing the likelihood
is often equated with matching the training data generating process: the true data-generating
process is a perfect model of itself, and no model can achieve a higher expected likelihood. As
a consequence, it is often assumed that a model trained on a dataset cannot extract more
structure or learn useful features that were not used in generating the data. However, we
show that a computationally-limited observer can in fact uncover much more structure than is
in the data generating process. For example, in Conway’s game of life the data are generated
via simple programmatic rules that operate on two-dimensional arrays of bits. Applying these
simple rules sequentially, we see emergent structures, such as different species of objects that
move and interact in a predictable way. While an unbounded observer can simply simulate
the evolution of the environment exactly, a computationally bounded observer would make
use of the emergent structures and learn the different types of objects and their behaviors.
The tension between these theoretical statements and empirical phenomena can be resolved by impos-
ing computational constraints on the observer and separating the random content from the structural
content. Drawing on ideas from cryptography, algorithmic information theory, and these unexplained
empirical phenomena, we define a new information measure, epiplexity (epistemic complexity),
which formally defines the amount of structural information that a computationally bounded observer
can extract from the data (Section 3, Definition 8). Briefly, epiplexity is the information in the model
that minimizes the description length of data under computational constraints. A simple heuristic
measurement is the area under the loss curve above the final loss, while a more rigorous approach
uses the cumulative KL divergence between a teacher and student model (Section 4, Figure 2).
Our definitions capture the intuition that an object contains both random, inherently unpredictable
information (entropy), and predictable structured information that enables observers to generalize
by identifying patterns (epiplexity). In Figure 1 (left) we illustrate this divide. In the top row, we
have highly redundant and repetitive code and simple color gradients, which have little information
content, be it structural or random. In the middle row, we have the inner workings of an algorithm
and pictures of animals, showing complex, long-range interdependencies between the elements from
which a model can learn complex features and subcircuits that are helpful even for different tasks.
In contrast, on the bottom, we have random data with little structure: configuration files with
randomly generated API keys, file paths, hashes, arbitrary boolean flags have negligible learnable
content and no long-range dependencies or complex circuits that result from learning on this task.
Similarly, uniformly shuffled pixels from the animal pictures have high entropy but are fundamentally
unpredictable, and no complex features or circuits arise from training on these data.
3
An essential property of our formulation is that information is observer dependent: the same
object may appear random or structured depending on the computational resources of the observer.
For instance, the output of a strong pseudorandom generator appears indistinguishable from true
randomness to any polynomial-time observer lacking the secret key (seed), regardless of the algorithm
or function class. In other situations, such as chaotic dynamical systems, both apparently random
behavior is produced along with structure: the state of the system cannot be predicted precisely over
long time-scales, but such observers may still learn meaningful predictive distributions, as shown by
the invariant measure in Figure 1 (top right).
Models trained to represent these distributions are computer programs, and substructures within
these programs, like circuits for performing specific tasks, or induction heads (Olsson et al., 2022),
can be reused even for seemingly unrelated data. This view motivates selecting high epiplexity data
that induces more structural information in the model, since these structures can then be reused for
unseen out-of-distribution (OOD) tasks, as illustrated in Figure 1 (bottom right). We emphasize,
however, that epiplexity is a measure of information, not a guarantee of OOD generalization to
specific tasks. Epiplexity quantifies the amount of structural information a model extracts, while
being agnostic to whether these structures are relevant to a specific downstream task.
To build intuition, we explore a range of phenomena and provide experimental evidence for behaviours
that are poorly accounted for by existing information-theoretic tools, yet naturally accommodated
by epiplexity. We show that information can be created purely through computation, giving insights
into synthetic data (subsection 5.1). We examine how certain factorizations of the same data can
increase structural information and downstream OOD performance—even as they result in worse
training loss (subsection 5.2). We show why likelihood modeling is more than distribution matching,
identifying induction and emergence as two settings where the observer can learn more information
than was present in the data generating process (subsection 5.3). By measuring epiplexity, we can
better understand why pre-training on text data transfers more broadly than image data, and why
certain data selection strategies for LLMs are empirically successful (Section 6). Together, our
results provide clarity on the motivating questions: the information content of data can be compared
independently of a specific task, new information can be created by computation, and models can
learn more information than their generating processes contain.
In short, we identify a disparity between existing concepts in information theory and modern practice,
embodied by three apparent paradoxes, and introduce epiplexity as a measurement of structural
information acquired by a computationally bounded observer to help resolve them. We formally
define epiplexity in Section 3 (Definition 8) and present measurement procedures in Section 4. In
Section 5, we show how epiplexity and time-bounded entropy shed light on these paradoxes, including
induction and emergent phenomena. Finally, in Section 6, we demonstrate that epiplexity correlates
with OOD generalization, helping explain why certain data enable broader generalization than others.
2 Background
In order to define the interesting, structural, and predictive component of information, we must
separate it out from random information—that which is fundamentally unpredictable given the
computational constraints of the observer. Along the way, we will review algorithmic randomness
as developed in algorithmic information theory as well as notions of pseudo-randomness used in
cryptography, and how these concepts crucially depend on the observer.
4
2.1 What Does it Mean for An Object to Be Random?
Random Variables and Shannon Information. Many common intuitions about randomness
start from random variables and Shannon information. A random variable defines a map from a
given measurable probability space to different outcomes, with probabilities corresponding to the
measure of the space that lead to a certain outcome. Shannon information assigns to each outcome
x a self-information (or surprisal) log 1/P (x) based on the probability P , and an entropy for the
random variable H(X) = E[log 1/P (X)], which provides a lower bound on the average code length
needed to communicate samples to another party (Shannon, 1948). In Shannon’s theory, information
comes only from distributions and random variables—objects that are not random must contain no
information. As a result, non-random information is seemingly contradictory, and thus we must draw
from a broader mathematical perspective to describe such concepts.
In the mid 1900s, mathematicians were interested in formalizing precisely what it means for a given
sample to be a random draw from a given distribution, to ground the theory of probability and
random variables (Shafer and Vovk, 2006). A central consideration involves a uniformly sampled
binary sequence u1:∞ from which other distributions of interest can be constructed. This sequence
can also be interpreted as the binary expression of a number [0, 1). Intuitively, one might think that
all sequences should be regarded as equally random, as they are all equally likely according to the
probability distribution: 1111111 . . . has the same probability mass as 10011101 . . . and also the same
self-information. However, looking at statistics on these sequences reveals something missing from this
perspective; from the law of large numbers, for example, it must be that limN →∞ 1
N
PN
i=1 ui = 0.5,
which is clearly not satisfied by the first sequence of 1s.
Martin-Löf Randomness: No algorithm exists to predict the sequence. Initial attempts
were made to formalize randomness as sequences which pass all statistical tests for randomness, such
as the law of large numbers for selected substrings. However, under such definitions all sequences
fail to be random since tests like u1:∞̸ = y1:∞ for any particular sequence y must also be included
(Downey and Hirschfeldt, 2019). The solution to these issues was found by defining random sequences
not as those that pass all tests of randomness, but those that pass all computable tests of randomness,
in a formalization known as Martin-Löf randomness (Martin-Löf, 1966). As it turned out, this
definition is equivalent to a number of seemingly distinct definitions, such as the inability for any
gambler to exploit properties of the sequence to make a profit, or that all prefixes of the random
sequence should be nearly incompressible (Terwijn, 2016). For this last definition, we must invoke
Kolmogorov complexity, a notion of compressibility and a key concept in this paper.
Definition 1 (Prefix Kolmogorov complexity (Kolmogorov, 1968; Chaitin, 1975)) Fix a
universal prefix-free Turing machine U. The (prefix) Kolmogorov complexity of a finite binary string
x is K(x) = min{ |p|: U(p) = x }. That is, K(x) is the length of the shortest self-delimiting program
(a program which also encodes its length) that outputs x and halts. The conditional complexity K(x|y)
is the length of the shortest program that outputs x and halts when provided y as input.
Due to the universality of Turing machines, the Kolmogorov complexity for two Turing machines (or
programming languages) U1 and U2 differ by at most a constant, |KU1 (x) − KU2 (x)|≤ C, where the
constant C depends only on U1, U2, but not on x (Li et al., 2008).
Definition 2 (Martin–Löf random sequence (Martin-Löf, 1966)) An infinite sequence
x1:∞ ∈ {0, 1}N is Martin–Löf random iff there exists a constant c such that for all n, K(x1:n) ≥ n−c.
Using this criterion, all computable randomness tests are condensed into a single incomputable
randomness test concerning Kolmogorov complexity.
5
Highlight & Ask

Select any part of the paper to ask specific questions

Add Context

Type @ to reference other papers and expand the discussion

Additional
• Explore the paper's Github implementation

• See how others cite this work

• Literature reviews

• Community context

Try asking "What's the intuition behind section 3.2?"