Article

Conversation

Image
The "Mismanaged Geniuses" Hypothesis
tldr; AI models are already good enough for the next leap in capabilities.
By: Alex Zhang (), Zhening (Zed) Li (), Omar Khattab ().
For the last decade, scaling the size and data of AI models has led to groundbreaking, super-human achievements in the capabilities of these systems. The recent success of RL and reasoning in particular implies that models can be trained to generalize on tasks we have never even solved ourselves. It is natural to believe that continuing this trend of scaling across a single neural model will be the recipe that gets us to the next jump in AI capabilities.
We have an alternate hypothesis on what will take us to the next inflection point of AI systems.
It can be said that frontier language models (LMs) are “geniuses” at solving the broad range of tasks they’ve been trained on. Nowadays, this represents virtually all the advanced subjects and content we learn throughout higher education to prepare ourselves for researching unsolved problems. Yet despite the fact that these models outperform even the brightest humans on the hardest exams like IMO and IOI and are super-human at general software engineering, they oddly also struggle to reliably tackle long-horizon and iterative reasoning problems that may seem “easy” to us. It is an interesting thought experiment to consider whether this is an inherent limitation of the LM, or the way in which we use them.
The mismanaged geniuses hypothesis (MGH) posits that existing frontier language models are severely underutilized due to sub-optimal use of individual language model calls. We believe that the next leap in "language model” capabilities will come not from continued scaling of existing LMs, but from enabling language models to “manage” themselves, i.e. natively decompose tasks and act on these decompositions. In particular, we believe that existing systems that let LMs decompose tasks are the limiting bottleneck, and the first step would be to define the space of decompositions the LM has access to. Upon figuring out this space of decompositions, the “bitter-lesson”-pilled allocation of compute would go towards training models to perform the correct decompositions
Image

You and I are not good managers.

It is worth articulating the “mismanagement” of language models.
Nearly all modern agent scaffolds are human-engineered, task-specific decomposition strategies that use language models. These systems rely on our intuition about how individual language model calls can be used together to solve a larger problem, and are often brittle with respect to different models and different problems. The outcome is a diverse set of agent scaffolds that can only solve narrow problems and must frequently be updated, leading to a misrepresentation of how good language models “actually are” at any given time. As an example, is it really true that frontier language models cannot play certain video games at a human level, or is it just that we haven’t put in the effort to build a good scaffold around them?
Coding agents like Claude Code are a first step in enabling the language model itself to decompose a problem into sub-tasks, then launch subagents to solve each sub-task. These “orchestrator-subagent” systems, where the orchestrator LM outputs a rough plan of how its going to go about solving a task, and then executes this plan using subagents, have been shown to work extremely well for general human-like workflows (e.g. for software engineering). Furthermore, it turns out that the plans that these models generate tend to be intuitive and easy to describe: the model does not need to know the exact solution to a problem to outline how it may go about decomposing it!
The success of these more general scaffolds like Claude Code, OpenClaw, Hermes Agent, etc. suggest that LMs are perfectly capable of managing other LMs to solve longer-horizon tasks. Furthermore, it is natural to ask whether the “orchestrator-subagent” scaffold is sufficient for longer running tasks, with recent works like Recursive Language Models (RLMs) proposing a more expressive mechanism for describing “plans” through code execution with recursive sub-calls / tools as functions, enabling fully recursive task decomposition. In particular, RLMs show how expanding the space of decompositions used to manage LM sub-calls beyond API-based tool calling unlocks length generalization capabilities for LMs.
Whether it be RLMs, coding agents, or undiscovered systems, a key unknown is the right general scaffold to train over that fully enables LMs to properly manage LMs.

Using composition to get around the out-of-distribution (OOD) problem.

So where do we go from here, and how can we fix the “mismanagement” issue?
To preface, it is well known that neural network language models have a generalization problem. Rather unsurprisingly, they naturally struggle to generalize to longer lengths (i.e. context rot) and low-resource tasks (e.g. as of the time of writing, writing GPU kernels on Blackwell).
One interpretation of the mismanaged geniuses hypothesis is that within the bounds of what is considered “in-distribution” for frontier language models, there already exists a powerful general “language model” system that can solve OOD problems in which its individual LM calls only see in-distribution inputs. Based on our intuition for scaffolds that currently work (e.g. Claude Code, RLMs, etc.), this loosely involves decomposing tasks into sub-tasks that the LM can solve, where the act of “decomposing the task” itself must also be an “in-distribution” task for the LM!
More generally, composition is an efficient way to solve OOD tasks in a learning-based system that is sufficiently capable. To be specific, the MGH posits that modern LMs are so good yet so expensive to further train, that directly learning the operator to compose LMs is a significantly more efficient strategy for reaching these OOD tasks than continuing to scale current LMs.
Assuming the MGH is actually true, we believe there are two main research / engineering directions in creating these systems:
  1. Defining “decomposition”. Defining the space of decompositions the LM is allowed to express is important for ensuring the individual LM calls stay “in-distribution”. How we define “decomposition” has an exponentially large impact (with respect to depth) on the tasks solvable via decomposition. In long-context tasks, for example, tool-call-style subagents prevent the root LM from decomposing the context into arbitrarily many chunks, inhibiting its ability to scale. In RLMs, the space of decompositions is expanded so as to allow an efficient representation of decomposition into arbitrarily many subtasks (e.g. using a for loop), which suddenly enables the system to handle near-infinite context. Similarly, simple expansions to the space of decompositions, compounded by the effect of recursion, may suddenly unlock generalization to near-infinite long-horizon tasks, self-improvement through near-infinite in-context learning, and more.
  2. Training and scaling the ability to compose. LMs need to be trained to correctly decompose tasks under any scaffold, but the correct decompositions are likely already within the distribution of what LMs can generate. To provide an example, we examine MRCRv2 1M context with 8 needles, a commonly reported long-context benchmark for frontier models. We find that while RLM(Qwen3-4B-Instruct) solves nearly 0% of the tasks, it gets 100% after only RL training on a significantly simpler setting (32k context, 1 needle). Despite being a small model, it learns purely through its own rollouts the correct decomposition that generalizes.
Image
An exciting corollary of this hypothesis is that it implies that most of the necessary behavior that the model needs to learn during pre-training and mid-training is likely already there. Given a sufficiently well-designed scaffold that supports composition (e.g. RLMs), training out such a system through bootstrapping may be enough to draw out a general task solving system.
Language models have gotten to the point where they’re ridiculously powerful, and the bottlenecks to creating fancy things like long-horizon solvers or self-improving systems seem sort of silly (i.e. is length generalization really a bottleneck). Should the MGH be true, the problem that remains is managing the geniuses (with guardrails, of course).
Acknowledgements. We thank Armando Solar-Lezama and Matthew Ho for helpful feedback.
Amitav Krishna
Post your reply

Yeah ever since i started training and evaluating autoregressive Transformers, I've been telling folks "the current models are already extremely good. It's all in there. But we still suck at getting it out of them" I like your phrasing here a lot more than mine though :)
Yeah I think genuinely there's so much we don't squeeze out of models, and while I'm def not a believer that they solve everything and all the world's problems rn, we often clearly underestimate how good they actually are
Sounds interesting but without clear criteria for what makes a “good manager” (beyond final metrics, otherwise it would be just a proxy of the benchmarks), the idea is unfalsifiable. You can always claim failures come from poor “management” for any model
There really isn't a particular criteria (we basically care about the raw performance), in some sense the claim is that the way in which we have done "management" (i.e. designing our own agents around models and then claiming these agents cannot solve XYZ tasks ==> these models
funnily enough gemini plays pokemon and the progress of the scaffolds was an example I originally wanted to include in the blog but felt it would take too much time to explain to someone who wasn't familiar with it probably one of my favorite examples of how models actually can
excellent article and strong agree I’m curious, do you think task-decomposition focused training is a necessary pre-req here or mostly an efficiency / optimization thing getting the sense that the new frontier of models will be naturally much stronger at this out of the box
It’s sort of necessary beyond efficiency in that LMs are super sensitive to their inputs wrt what they were trained on I suspect though that the amount of training needed is a lot less than if you were to naively scale though!
This idea resonates with me a lot! One question that comes up when talking about decomposition is how fine-grained it needs to go. For example, on a task that may take days, do have multiple depths of decomposition? What are your thoughts on that?
I think the key is that the decomposition doesn't need to be that fine-grained (although this doesn't mean it's not long or deep). This wasn't true in the past, but is more true now. I think when people think of "composition", they naturally think of very strict, symbolic
This is a great article. I'm not technical but I've always considered talking with an llm as like shining a torch around inside a vast, hyperdimensional knowledge cave. The quality of what the model can know depends on the torch illuminating the right things. Your proposal sounds
Frontier LLMs can function as effective orchestrators, but they still need initial guidance or creative sparks to operate efficiently. Maybe they can surpass us in raw idea generation and recombination.
A good manager should be effective at managing LM’s and humans to cover the full space of real world task it can solve through humans working with intelligence at its current capabilities.
Great read, thanks for sharing. It inspired me to resume a project I had started when DSPy first integrated RLM into the framework.
Very interesting idea. I’ve often wondered if we need models to get more capable through scaling or if we can get powerful AI by learning to use the models we have.

Discover more

Sourced from across X
New must-read blog by on the future of language models. Buried nugget: doing GRPO for RLM-Qwen3-4B on short (32k token) and easy (single-needle) MRCRv2 long-context tasks generalizes *automatically* and with perfect (100%) reliability to 1M-token, 8-needle tasks!!
Image
Quote
alex zhang
@a1zhang
Article cover image
The "Mismanaged Geniuses" Hypothesis
tldr; AI models are already good enough for the next leap in capabilities. By: Alex Zhang (@a1zhang), Zhening (Zed) Li (@zli11010), Omar Khattab (@lateinteraction). For the last decade, scaling the...
Another blogpost, probably my favorite so far! AI is really good, to the point where we think it’s severely underutilized. Give it a read and let me know what you think :) Website version (which personally I prefer): alexzhang13.github.io/blog/2026/mgh/
Quote
alex zhang
@a1zhang
Article cover image
The "Mismanaged Geniuses" Hypothesis
tldr; AI models are already good enough for the next leap in capabilities. By: Alex Zhang (@a1zhang), Zhening (Zed) Li (@zli11010), Omar Khattab (@lateinteraction). For the last decade, scaling the...
I'm releasing the 34 slides on how we design and train best-in-class edge models at I presented these slides yesterday at They cover model architecture, pre-training, scaling laws, post-training, and even a solution to fix doom loops Special thanks to
Image

Live on X

Trending now

What’s happening

Trending
#Islamabad
Trending in Canada
Alex Jones
Trending in Canada
poeltl
Entertainment · Trending
#olandriaxbarbie