Your Transformer is Secretly an EOT Solver
Elon Litman
Nine months ago, or was it ten? I've learned not to trust my sense of time when deep in a problem. Anyone faced with a major deadline or enveloped in the mire of one particular question for an extended period intuitively understands that Time becomes something else then. Not the measured progression of hours, but rather a kind of thickness, a medium through which one moves with increasing difficulty, like a swimmer who has ventured too far from shore and finds the water has become viscous, resistant.
I was doing what I had seen others do, what perhaps I had been doing all along without quite admitting it to myself: trying to invent yet another subquadratic attention mechanism. Yes, another one. As if the world needed more of these thingsThis is perhaps too cynical. If you are the future inventor of FlashAttention 4, thank you. The world does need efficient implementations, and even asymptotic improvements in memory complexity, but this is best measured in reducing constant factors rather than shaving
But mine, of course, was going to be different. Mine was going to be new and quirky. Mine was going to reflect practical considerations; real constants and implementation details rather than mere Big-O algorithmic obscurantism.
The idea involved an approximation. This was to be an approximation that became asymptotically accurate as
Why Asymptotically Accurate Subquadratic Attention is Impossible
Let me explain the nature of that unanticipated problem I mentioned. It is mathematically impossible for any subquadratic attention mechanism to be asymptotically accurate on all inputs.
The intuition is a
More formally, define a subquadratic algorithm as one computing
By subquadraticity, the average number of keys inspected per query is
The contradiction arises from the global normalization inherent in softmax: computing accurate attention weights requires complete information about all query-key similarities in each row. A subquadratic algorithm cannot provide this complete information for all rows simultaneously, making asymptotic accuracy impossible.
So I walked away. I came back. I walked away again. And then, I had a thought. The thought was this: perhaps there is no way forward through brute force. Perhaps the only way through is to find some hidden mathematical structure, some symmetry or equivalence that's been sitting there all along, waiting for someone to notice it.
That thought, as it happens, was correct. Which is how I ended up writing this paperThis blog post is a companion to my arXiv paper "Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport" (arXiv:2508.08369). That paper extends this analysis to show that the backward pass gradient is mathematically identical to an advantage-based policy gradient, and demonstrates how the EOT formulation induces a specific information geometry characterized by the Fisher Information Matrix. The unified view reveals attention as a mechanism where the forward pass performs optimal inference and the backward pass implements a manifold-aware learning update.. The paper, to be clear, has nothing to do with subquadratic attention. So I'm telling the story here: how trying to build something uninteresting led, through impossibility, to discovering something genuinely interesting: an exact equivalence between scaled dot-product attention and entropic optimal transport.
Attention as Entropic Optimal Transport
We will demonstrate the capstone result of the paper: that the scaled dot-product attention weights for a collection of queries are equivalent to the unique solution of a particular one-sided entropic optimal transport problem. Specifically, one where the costs are given by the negative key-query dot products, the source space is equipped with the counting measure, and the target space is left unconstrained.
We begin by constructing this optimal transport problem. Consider the standard setup for computing attention weights for a set of
These attend over a collection of
Our objective is to formulate an optimal transport problem whose unique solution is exactly the attention weight matrix
Each row of
The formulation proceeds by specifying the source and target index sets, the measures on them, the feasible couplings, the cost, and the entropic regularizationThe entropic regularization of optimal transport, which makes the problem computationally tractable, was introduced by Marco Cuturi in his seminal 2013 paper "Sinkhorn Distances: Lightspeed Computation of Optimal Transport." This regularization trades exact optimality for numerical stability and scalability..
Query source space and measure
The source index set is
Key target space
The target index set is
A critical choice is how to treat the target marginal
Transport plans and feasibility
A transport plan is a nonnegative matrix
for all
This row-sum constraint applies for all
Cost and entropy
We need a transport cost
The total cost of a plan
so minimizing cost is the same as maximizing the similarity-weighted sum.
To entropically regularize the problem, we use the matrix Shannon entropy
with the convention
Objective and constrained problem
Combine cost and negative entropy to form
Substituting the expressions for cost and entropy gives
The generalized, one-sided entropic optimal transport problem that matches full scaled dot-product attention is
Existence and Uniqueness (theorem only)
Let the source be
Here too, the row-sum constraint holds for all
For any
attains a unique minimum
EOT solution equals the attention matrix
Having the existence and uniqueness of the minimizer for the generalized entropic optimal transport problem over all queries and keys, we now derive its analytical form and show it equals the scaled dot-product attention matrix.
Let
Consider the one-sided entropic OT problem
with
Proof
We minimize
Take the partial derivative with respect to an arbitrary
Solve for
Let
Enforce the row-sum constraint to determine
Substitute back:
Set
So the unique minimizer
The Discovery
I should also mention that this is not, in the strictest sense, a full transport problem. In classical optimal transport, you fix both where the mass starts and where it ends up. But here, we only constrain the sources. Each query emits exactly one unit (a Dirac unit impulse) of attention, and we let the optimization decide how that mass gets distributed across the keys. There are no constraints on the target side. This makes it a one-sided transport problem, which turns out to be exactly what attention needs. If we had fixed both marginals, we would have gotten something else entirely, something that wouldn't match the softmax structure at all.
Memory is unreliable, especially when it comes to invention. I don't remember exactly the precise
Apparently this is not uncommon. It reminds me of Unix. Brian Kernighan admits he coined "Unics" as a pun during the early Bell Labs days, but even he and his collaborators couldn't determine exactly who decided on the final spelling "Unix." Ritchie and McIlroy credited Kernighan; Peter Neumann recalled that Bell Labs PR likely influenced the change to avoid the "eunuchs" joke. No one remembers. (Wikipedia)
What happens is that thinking is computationally expensive: you explore dead ends, trace connections that lead nowhere, hold contradictory possibilities simultaneously. When you arrive at a solution, you amortize the cost by compressing the search path into the final answer and discarding intermediate states. This is your mind pruning its search history after convergence, like AlphaGo discarding its tree search after the move is chosen. The moment you see it, the path to seeing it disappears. Once the connection is made, the confusion that preceded it becomes incomprehensible because the information-theoretic cost of preserving that confusion exceeds any utility it might have held.