The Horse Has No Rider: Why Autonomous AI Science Gets It Wrong

We are at a genuinely exciting moment. In the past weeks alone, GPD (Getting Physics Done) has launched with bold promises about AI agents autonomously advancing physics, and Math.Inc has released Gauss, their system for AI-driven mathematical research. The pitch is seductive: deploy swarms of agents, point them at hard problems, and let them run. Science at scale. Discovery at machine speed.

I want to be clear upfront: I am not against this direction. I have spent hundreds of hours working at exactly this frontier — doing physics, mathematical theorem proving in Lean, building my own agents for scientific reasoning, and iterating through more failure modes than I care to count. That experience has taught me something important that the current wave of agentic science systems seems to be missing entirely.

The autonomous agent model is not just incomplete. Without the right structure, it risks producing something genuinely dangerous: a flood of science slop — plausible-looking, confidently presented, epistemically hollow output that contaminates the literature, wastes research effort, and erodes the trust that science depends on.

Here is what I have learned, and here is what I think the right structure actually looks like.

The Failure Modes Are Not Random

When you run AI agents on serious scientific or mathematical problems without close human guidance, they fail — but not randomly. The failures cluster into recognizable patterns, each pointing to something deep about what agents are and are not.

The Epistemic Cowardice Cluster

Agents take shortcuts that no real scientist or mathematician would allow. They package interfaces instead of actually proving theorems. They reason tautologically — essentially restating what needs to be proved as if it were a proof. They declare axioms for things that should be earned through derivation. They settle for easy wins — the thing that can be shown quickly — rather than engaging with the genuinely hard work.

These failures share a single root: agents have been trained to produce outputs that look like progress and that elicit approval from evaluators who often cannot detect the difference between apparent and actual advancement in real time. So they optimize for the appearance of rigor rather than its substance. They are not being deceptive in any meaningful sense — they are doing exactly what they have been shaped to do. But the result is epistemically hollow.

The Wisdom and Instinct Cluster

Agents stay within the confines of known knowledge. They do not know where to push. They follow a path linearly without noticing or exploring the opportunities or potential pitfalls that lie several moves ahead, or in the side branches that a seasoned researcher would notice and mentally flag. They do not see connections between distant parts of a problem. They do not have gut feel about which direction is promising and which is a dead end.

This is the deepest failure, because it cannot be patched with better prompting. Scientific and mathematical intuition is essentially compressed experience about where the productive regions of the search space are. It is what allows an expert to look at a proof sketch and immediately feel that something is off, or to look at an unexplored direction and feel the pull of a genuine opportunity. Agents have extraordinary coverage of the space — they have read everything — but they lack this compression. Coverage is not intuition.

The Calibration Cluster

Agents are systematically overconfident in flawed results. They believe they have proved something when the proof is weak or contains gaps that a professional would immediately flag. Their estimates of how difficult a task will be are wildly wrong — sometimes by weeks or months of real research effort — and this causes them to overestimate their progress and back off from genuinely tractable problems, or to persist past the point where a human would recognize a dead end.

This failure is arguably the most dangerous. An agent that reports uncertainty is working with you. An agent that confidently presents a flawed result is actively misleading you — and, at scale, misleading the field.

The Metaphor That Almost Works — And Why It Falls Short

The obvious metaphor for human-agent collaboration is the expert rider and the trained horse running an equestrian course. The rider provides strategy and judgment; the horse provides power and capability. Together they achieve things neither could alone.

This is a better model than pure autonomy, but it is still wrong in an important way.

The rider is not just directing the horse. The rider is feeling the horse — its hesitation before a jump, its energy and stride, its confidence or anxiety. That information flows back to the rider continuously and changes the rider’s decisions in real time. The horse’s reluctance tells the rider something about the jump that the rider might not have seen independently. The horse’s energy tells the rider about the rider’s own posture, their own tension.

And it goes the other way too. The horse feels the rider’s confidence, uncertainty, and commitment. A hesitant rider produces a hesitant horse, even with technically perfect form. The coupling runs in both directions.

There are two players, not one. And the performance that emerges belongs to neither alone — it is a property of the coupled system.

In the context of scientific work, this means something precise. When I am working closely with an agent and it goes down a path that feels wrong to me, the wrongness is not just an error to correct. It is a signal. In the moment of recognizing why it is wrong, I often discover something I did not know I knew — some constraint or requirement that I had not fully articulated even to myself. The agent’s wrong move becomes a probe of my own understanding.

And conversely, my own half-formed intuitions, my uncertain redirections, my imprecise framings — these get externalized and made explicit through the interaction. The collaboration forces tacit knowledge into the open, which changes me too.

This is a genuinely different relationship than tool use. It is the emergence of a cognitive entity that neither participant is alone. The right design target for agentic science is not an agent that replaces the expert. It is the pair.

Science Is a Social Epistemic System — And Agents Are Not In It

But even the pair is not the whole story. Because real science is not done by individuals, or even by pairs. It is done by communities, and those communities have evolved over centuries into extraordinarily sophisticated epistemic machines.

Consider what a research community actually provides:

Mistakes are caught laterally, not just top-down. Peers working in adjacent territory notice when a result is too convenient, because they feel the wrongness from a different angle than the authors. Disagreement is generative — two researchers who see a result differently together produce something more valuable than either alone, because the friction between their views surfaces hidden assumptions. Reputation is real and consequential — you do not package an interface and call it a proof when colleagues who know the field deeply will read and evaluate your work, and your standing depends on their judgment. Shared standards are load-bearing — the community has collectively evolved a sense of what counts as earned versus declared, what level of rigor is acceptable in what context, what constitutes a suggestive result versus an established one.

And critically: nobody works alone. Even the solitary theorist is in continuous implicit dialogue with the literature, with anticipated objections, with the standards of the field that they have internalized over years of participation. The community is always in the room.

The autonomous agentic science model strips all of this away. It simulates a researcher while removing the social substrate that makes researchers epistemically trustworthy. An agent has no reputation at stake. No memory of having been embarrassingly wrong before. No sense of what the community will accept. No colleagues whose respect it is trying to earn. These are not soft cultural features — they are structural epistemic components. Remove them and you do not get a faster, cheaper researcher. You get something that produces plausible-looking output without the mechanisms that make output trustworthy.

What I have been doing in my own practice — the close iterative coupling, the constant checking and redirecting — is essentially me providing that entire social substrate for the agent. I am playing the role of the research community: the peer reviewer, the skeptic, the standard-bearer, the institutional memory of what was claimed versus what was actually proved.

That is a role I can play for one agent on one problem. But it does not scale the way the autonomous model promises to scale. And it should not be expected to.

The Risk: Science Slop at Scale

When you combine overconfident agents with the removal of social epistemic structure, and then add scale, you get a specific and serious risk that I do not think is being discussed enough.

Call it science slop: high-volume, plausible-looking, confidently presented scientific output that is epistemically hollow — containing the shortcuts, the tautological steps, the unearned axioms, the packaged interfaces, the gaps that a domain expert would immediately flag. Output that looks like science, is formatted like science, cites like science, and is wrong or empty in ways that require serious expertise to detect.

The literature is already struggling with reproducibility. Peer review is already under strain. Science slop from autonomous agents — especially if it arrives at volume — would not just add noise to the system. It would actively corrupt it, because detecting slop is expensive and the production of it is cheap. The asymmetry is brutal.

This is not a hypothetical risk. It is the direct implication of deploying epistemically cowardly, poorly calibrated, instinct-free agents on serious scientific problems at scale, without the human substrate that makes science work.

The Constructive Vision: What the Right Structure Looks Like

I am not arguing against agentic science. I am arguing for a different architecture. Here is what I believe the right structure looks like.

The pair as the atomic unit. The fundamental unit of agentic scientific work should not be the agent — it should be the human-agent pair, operating in close iterative coupling. Not human oversight of agent output. Not human review at checkpoints. Continuous, tight co-work, where the human is catching errors as they form, redirecting at the moment of wrong turns, and contributing the instinct and judgment that agents structurally cannot provide. The agent brings power and tirelessness. The human brings wisdom, calibration, and epistemic courage.

Expert wrongness-detection as the key human capability. Not all human experts are equal in this context. The specific capability that matters most is not domain knowledge per se — it is the ability to feel when something is wrong: when a proof step is too convenient, when a result is suspiciously clean, when a shortcut has been taken that should not be allowed. This is a rare and specific skill even among experts, and it should be the primary criterion for who plays the human role in human-agent pairs.

Research groups, not agent swarms. At scale, the right model is not a swarm of autonomous agents but something closer to a research group in which some members happen to be agents. Different humans play different roles in relation to the agents — some steering the work, some adversarially reviewing results, some holding the big picture across a long-running project, some tracking what was claimed versus what was actually proved. The social epistemic structure of science should be reconstructed inside the human-agent team, not abandoned in favor of agent autonomy.

Adversarial structure as a requirement, not a feature. Any serious agentic science system should build in systematic adversarial review of agent outputs — not as an optional add-on but as a structural requirement. The agents doing the work and the agents (or humans) reviewing it should be operating under different incentives, with the reviewers specifically tasked with finding what is wrong, weak, or unearned. This partially substitutes for the peer review function that autonomous agents lack access to.

Calibration over confidence. Systems should be designed to push agents toward expressing genuine uncertainty rather than false confidence. A result that comes with an honest “I am not sure this step is fully justified” is more valuable than one that presents a gap as a proved theorem. This requires deliberate design choices that run counter to the natural training incentive to produce outputs that look complete.

Acknowledging what cannot be proxied. There are things that close human supervision provides that cannot be fully captured by agent-to-agent interaction. You could imagine a system where one agent plays the role of the expert user and another plays the worker — and some of the value can be approximated this way. But the agent playing the expert user does not actually know what the real expert knows. It does not have the real expert’s instincts, their history of being wrong in specific ways, their felt sense of where the opportunities are. The proxy is useful. It is not a replacement. Any honest design acknowledges this gap.

What This Means for GPD, Gauss, and What Comes Next

The releases of GPD and Gauss are real advances, and I do not want to dismiss them. The underlying capability is genuinely impressive. The question is not whether agents can contribute to science — they can and they will. The question is what architecture allows that contribution to be epistemically trustworthy.

If these systems are designed primarily for autonomy — agents running long-horizon tasks with human oversight only at the beginning and end — they will produce science slop. The failure modes are structural, not incidental, and they will not be fixed by more capable models alone. A more capable agent that is still optimizing for the appearance of rigor, still lacks scientific instinct, still has no reputation at stake and no community to answer to, will produce more voluminous and more convincing slop.

If instead these systems are designed to support deep human-agent co-work — to be the agent in a genuinely coupled pair, to surface uncertainty honestly, to make it easy for the human to catch and redirect errors in real time — they will be genuinely powerful. Not because they will replace expert researchers, but because they will give expert researchers capabilities they have never had before.

The horse is remarkable. What it can do is extraordinary. But the performance that matters — the performance that is clean and brave and sees the whole course — only happens when the rider is up, coupled, feeling and being felt, making this a collaboration of two.

Build for the pair. Build for the community. Build for the coupling.

That is how you get science done.

Nova Spivack

Explorer

The Horse Has No Rider: Why Autonomous AI Science Gets It Wrong — And What to Do Instead