Scaling Doesn’t Fix the Self-Model Problem

New to this research? This article is part of the Reflexive Reality formal research program. Brief introduction ↗ · Full research index ↗

Series: NEMS on AI Safety · Part 1: No AI Can Verify Itself · Part 2: Scaling Doesn’t Fix the Self-Model Problem · Parts 3–5 below


Every effort to make AI systems more interpretable, more self-aware, more accurately self-modeling runs into the same wall: there is always a part of the system that the system’s model of itself cannot capture. A machine-checked theorem proves this is not an engineering limitation. The blind spot is topological — it has a specific proved shape, and it cannot be eliminated by scaling, adding parameters, or any architectural refinement within the same representational type. The self-model problem is permanent.


The Interpretability Hope

Mechanistic interpretability — the effort to understand what is happening inside AI systems — proceeds on a reasonable hope: that with enough analysis, we can identify what computations a model is performing, what features it has learned, what it is actually doing when it produces an output. If we could fully interpret a model, we could fully verify its behavior.

This is a genuine and important research program. But it runs into a fundamental structural barrier that no interpretability technique can overcome: the model’s representation of itself — the internal self-model implicit in its parameters and activations — necessarily misses content about itself. The missing content is not a gap we haven’t filled yet. It is structurally excluded from the representational scheme.


Representational Incompleteness

The Representational Incompleteness theorem (RP-RI program) proves: for any parametric self-model — a model of the form s(a, b) where a encodes the system and b encodes the input — and for any fixed-point-free transformation f, the diagonal function d(a) = f(s(a, a)) is never in the model’s representational range.

What does this mean concretely? The self-model s(a, a) is the model’s representation of itself when processing itself as input — exactly what happens in self-introspection. The diagonal function d(a) is the content that would need to be in the model to have a complete self-representation. The theorem says d is always outside the representational range — there is always content about the system that the system’s own representational scheme cannot capture.

This is a precise topological result, not a vague claim about “limits of self-knowledge.” The missing content has a specific structure: it is the diagonal. And the diagonal has a specific location relative to the model: it is always one step beyond the representational scheme, in the same way that the set of all sets not containing themselves is always one step beyond any set-theoretic hierarchy.

Lean anchor: RepresentationalIncompleteness.representational_incompleteness. Machine-checked.


Why Scaling Cannot Fix This

The most common response to self-model limitations is: scale. Make the model bigger, give it more parameters, train it on more data, give it access to its own weights during inference. Surely at some level of scale the self-model becomes complete enough to capture everything relevant?

The theorem rules this out. Here is why: scaling within a representational type — making the parametric self-model s(a, b) larger — does not change the type of the model. The diagonal d(a) = f(s(a, a)) is still outside the representational range for every model in the same type class. Scaling shifts the model to a larger instance within the type, and the diagonal shifts too. It always remains one step ahead.

The Reflective Fold Obstruction (RFO) program makes this precise with the semantic type preorder: a system at semantic type T cannot, by any sequence of type-preserving operations (scaling, adding parameters, additional training), reach semantic type T’ > T. Genuine self-model depth increase requires a qualitative architectural transition — a fold into a new type — not more of the same. And folds cannot be achieved by iteration within a type.

Lean anchor: ReflectiveFoldObstruction.SemanticType.selfModelDepth_obstruction.


The Specific Shape of the Blind Spot

One virtue of the theorem over vague claims about “limits of self-knowledge” is that it gives the blind spot a precise structure. The missing content is not random noise, not uniformly distributed over all content, not anything the system could fill in by trying harder. It is specifically the diagonal of the self-model — the content that would be needed for the model to represent its own fixed-point-free transformation applied to itself.

This means you can, in principle, characterize what a self-model is missing. You cannot fill it in (that would require leaving the representational type). But you can recognize its structure and reason about it from outside the system. This is the formal basis for why external evaluators can sometimes see things about a system that the system cannot see about itself — they are not subject to the same representational constraint on the diagonal.


Implications for Interpretability and Alignment

  1. No model can be fully interpretable by itself. Any interpretability method that relies on the model’s own self-representation hits the diagonal blind spot. External interpretability — humans or other systems analyzing the model from outside its representational type — is not subject to the same barrier.
  2. “Chain-of-thought” reasoning about self is structurally incomplete. When a model reasons about its own reasoning using its own representational machinery, it is doing a parametric self-model operation. The theorem applies. There is always content the chain-of-thought cannot access about itself.
  3. Emergent self-awareness from scaling is formally blocked. The scaling-produces-consciousness intuition requires that enough scale eventually produces complete self-understanding. The semantic type obstruction shows that this would require crossing a type boundary, which iteration within the type cannot achieve.
  4. The residue is not failure — it is structural. The blind spot in a self-model is not a deficiency to be corrected. It is the proved structural consequence of the model being a parametric self-model in the first place. A self-model that captured everything would not be a self-model — it would need to be something of a strictly higher type.

The Papers and Proofs

Lean proof library: novaspivack/nems-lean · Full research index: novaspivack.com/research ↗

This entry was posted in AI, Computer Science, NEMS, Science, Theorems on by .

About Nova Spivack

A prolific inventor, noted futurist, computer scientist, and technology pioneer, Nova was one of the earliest Web pioneers and helped to build many leading ventures including EarthWeb, The Daily Dot, Klout, and SRI’s venture incubator that launched Siri. Nova flew to the edge of space in 1999 as one of the first space tourists, and was an early space angel-investor. As co-founder and chairman of the nonprofit charity, the Arch Mission Foundation, he leads an international effort to backup planet Earth, with a series of “planetary backup” installations around the solar system. In 2024, he landed his second Lunar Library, on the Moon – comprising a 30 million page archive of human knowledge, including the Wikipedia and a library of books and other cultural archives, etched with nanotechnology into nickel plates that last billions of years. Nova is also highly active on the cutting-edges of AI, consciousness studies, computer science and physics, authoring a number of groundbreaking new theoretical and mathematical frameworks. He has a strong humanitarian focus and works with a wide range of humanitarian projects, NGOs, and teams working to apply technology to improve the human condition.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.