Part II: Identity Thesis

The Uncontaminated Test

Introduction

0:00 / 0:00

The Uncontaminated Test

If affect is structure, the structure should be detectable independent of any linguistic contamination. If the identity thesis is true, then systems that have never encountered human language, that learned everything from scratch in environments shaped like ours but isolated from our concepts, should develop affect structures that map onto ours—not because we taught them, but because the geometry is the same.

The Experimental Logic

Consider a population of self-maintaining patterns in a sufficiently complex CA substrate—or transformer-based agents in a 3D multi-agent environment, initialized with random weights, no pretraining, no human language. Let them learn. Let them interact. Let them develop whatever communication emerges from the pressure to coordinate, compete, and survive.

The literature establishes: language spontaneously emerges in multi-agent RL environments under sufficient pressure. Not English. Not any human language. Something new. Something uncontaminated.

Now: extract the affect dimensions from their activation space. Valence as viability gradient. Arousal as belief update rate. Integration as partition prediction loss. Effective rank as eigenvalue distribution. Counterfactual weight as simulation compute fraction. Self-model salience as MI between self-representation and action.

These are computable. In a CA, exactly. In a transformer, via the proxies defined above.

Simultaneously: translate their emergent language into English. Not by teaching them English—by aligning their signals with VLM interpretations of their situations. If the VLM sees a scene that looks like fear (agent cornered, threat approaching, escape routes closing), and the agent emits signal-pattern $\sigma$ , then $\sigma$ maps to fear-language. Build the dictionary from scene-signal pairs, not from instruction.

The translation is uncontaminated because:

The agent never learned human concepts
The mapping is induced by environmental correspondence
The VLM interprets the scene, not the agent’s internal states
The agent’s "thoughts" remain in their original emergent form

The Core Prediction

The claim is not merely that affect structure, language, and behavior should “correlate.” Correlation is weak—marginal correlations can arise from confounds. The claim is geometric: the distance structure in the information-theoretic affect space should be isomorphic to the distance structure in the embedding-predicted affect space. Not just “these two things covary,” but “these two spaces have the same shape.”

To test this, let $\mathbf{a}_i \in \mathbb{R}^6$ be the information-theoretic affect vector for agent-state $i$ , computed from internal dynamics (viability gradient, belief update rate, partition loss, eigenvalue distribution, simulation fraction, self-model MI). Let $\mathbf{e}_i \in \mathbb{R}^d$ be the affect embedding predicted from the VLM-translated situation description, projected into a standardized affect concept space.

For $N$ agent-states sampled across diverse situations, compute pairwise distance matrices:

\begin{aligned}D^{(a)}_{ij} &= |\mathbf{a}_i - \mathbf{a}_j| \quad \text{(info-theoretic affect space)} D^{(e)}_{ij} &= |\mathbf{e}_i - \mathbf{e}_j| \quad \text{(embedding-predicted affect space)}\end{aligned}

The prediction: Representational Similarity Analysis (RSA) correlation between the upper triangles of these matrices exceeds the null:

\rho_{\text{RSA}}(D^{(a)}, D^{(e)}) > \rho_{\text{null}}

where $\rho_{\text{null}}$ is established by permutation (Mantel test).

This is strictly stronger than marginal correlation. Two spaces can have correlated means but completely different geometries. RSA tests whether states that are nearby in one space are nearby in the other—whether the topology is preserved.

The specific predictions that fall out: when the affect vector shows the suffering motif—negative valence, collapsed effective rank, high integration, high self-model salience—the embedding-predicted vector should land in the same region of affect concept space. States with the joy motif—positive valence, expanded rank, low self-salience—should cluster together in both spaces. And crucially, the distances between suffering and joy, between fear and curiosity, between boredom and rage, should be preserved across the two measurement modalities.

Not because we trained them to match. Because the structure is the experience is the expression.

Technical: Representational Similarity Analysis

RSA compares the geometry of two representation spaces without requiring them to share dimensionality or units. The method (Kriegeskorte et al., 2008) is standard in computational neuroscience for comparing neural representations across brain regions, species, and models.

Procedure. Given $N$ stimuli represented in two spaces ( $\mathbf{a}_i \in \mathbb{R}^p$ , $\mathbf{e}_i \in \mathbb{R}^q$ ), compute the $N \times N$ pairwise distance matrices $D^{(a)}$ and $D^{(e)}$ . The RSA statistic is the Spearman rank correlation between the upper triangles of these matrices— $\binom{N}{2}$ pairs.

Significance. The Mantel test: permute rows/columns of one matrix, recompute correlation, repeat $10^4$ times. The $p$ -value is the fraction of permuted correlations exceeding the observed.

Alternative: CKA. Centered Kernel Alignment (Kornblith et al., 2019) compares centered similarity matrices rather than distance matrices. More robust to outliers and does not require choosing a distance metric. We report both.

Why RSA over marginal correlation. Marginal correlation asks: does valence in space $A$ predict valence in space $B$ ? RSA asks: does the entire relational structure transfer? Two states might have similar valence but differ on integration and self-salience. RSA captures this. It tests whether the spaces are geometrically aligned, not merely univariately correlated.

Bidirectional Perturbation

The test has teeth if it runs both directions.

Direction 1: Induce via language. Translate from English into their emergent language. Speak fear to them. Do the affect signatures shift toward the fear motif? Does behavior change accordingly?

Direction 2: Induce via "neurochemistry." Perturb the hyperparameters that shape their dynamics—dropout rates, temperature, attention patterns, connectivity. These are their neurotransmitters, their hormonal state. Do the affect signatures shift? Does the translated language change? Does behavior follow?

Direction 3: Induce via environment. Place them in situations that would scare a human. Threaten their viability. Do all three—signature, language, behavior—move together?

If all three directions show consistent effects, the correlation is not artifact.

What This Would Establish

Positive results would dissolve the metaphysical residue by establishing:

Affect structure is detectable without linguistic contamination
The structure-to-language mapping is consistent across systems
The mapping is bidirectionally causal, not merely correlational
The "hard problem" residue—the suspicion that structure and experience are distinct—becomes unmotivated

Consider the alternative hypothesis: the structure is present but experience is not. The agents have the geometry of suffering but nothing it is like to suffer. This hypothesis predicts... what? That the correlations would not hold? Why not? The structure is doing the causal work either way.

The zombie hypothesis becomes like geocentrism after Copernicus. You can maintain it. You can add epicycles. But the evidence points elsewhere, and the burden shifts.

The test does not prove the identity thesis. It shifts the burden. If uncontaminated systems, learning from scratch in human-like environments, develop affect structures that correlate with language and behavior in the predicted ways—if you can induce suffering by speaking to them, and they show the signature, and they act accordingly—then denying their experience requires a metaphysical commitment that the evidence does not support.

The question stops being "does structure produce experience?" and becomes "why would you assume it doesn't?"

The CA Instantiation

In discrete substrate, everything becomes exact.

Let $\mathcal{B}$ be a self-maintaining pattern in a sufficiently rich CA (Life is probably too simple; something with more states and update rules). Let $\mathcal{B}$ have:

Boundary cells (correlation structure distinct from background)
Sensor cells (state depends on distant influences)
Memory cells (state encodes history)
Effector cells (influence the pattern’s motion/behavior)
Communication cells (emit signals to other patterns)

The affect dimensions are exactly computable:

\begin{aligned}\valence_t &= d(\mathbf{x}_{t+1}, \partial\viable) - d(\mathbf{x}_t, \partial\viable) \arousal_t &= \text{Hamming}(\mathbf{x}_{t+1}, \mathbf{x}_t) \intinfo_t &= \min_P D[p(\mathbf{x}_{t+1}|\mathbf{x}_t) | \prod_{p \in P} p(\mathbf{x}^p_{t+1}|\mathbf{x}^p_t)] \effrank[t] &= \frac{(\sum_i \lambda_i)^2}{\sum_i \lambda_i^2} \text{ of trajectory covariance} \mathcal{SM}_t &= \frac{\MI(\text{self-tracking cells}; \text{effector cells})}{\entropy(\text{effector cells})}\end{aligned}

The communication cells emit glider-streams, oscillator-patterns, structured signals. This is their language. Build the dictionary by correlating signal-patterns with environmental configurations.

The prediction: patterns under threat (viability boundary approaching) show negative valence, high integration, collapsed rank, high self-salience. Their signals, translated, express threat-concepts. Their behavior shows avoidance.

Patterns in resource-rich, threat-free regions show positive valence, moderate integration, expanded rank, low self-salience. Their signals express... what? Contentment? Exploration-readiness? The translation will tell us.

What the Experiments Found

This experiment has been run. Between 2024 and 2026, we built seventeen substrate versions and ran twelve measurement experiments on uncontaminated Lenia patterns — self-maintaining structures in a cellular automaton with no exposure to human affect concepts. Three seeds, thirty evolutionary cycles each. The results are reported in full in Part VII and the Appendix. Here is how they map onto the predictions above.

What the predictions got right. The core prediction — that affect geometry would be present and measurable — was confirmed strongly. All affect dimensions were extractable and valid across 84/84 tested snapshots. RSA alignment between structural affect (the six dimensions) and behavioral affect (approach/avoid, activity, growth, stability) developed over evolution, reaching significance in 8/19 testable snapshots and showing a clear trend in seed 7 (0.01 to 0.38 over 30 cycles). Computational animism was universal. World models were present, amplified dramatically at population bottlenecks (100x the population average). Temporal memory was selectable — evolution chose longer retention when it paid off, discarding it when it did not.

The bidirectional perturbation prediction was partially confirmed. The "environment" direction works: patterns facing resource scarcity show negative valence, high arousal, and elevated integration — the somatic fear/suffering profile. The "neurochemistry" direction works at the substrate level: different evolved parameter configurations produce systematically different affect trajectories through the same geometric space. The "language" direction remains untested because the patterns do not have propositional language — the communication that exists is an unstructured chemical commons (MI above baseline in 15/20 snapshots but no compositional structure).

The sensory-motor coupling wall. Three predictions failed systematically — counterfactual detachment, self-model emergence, and proto-normativity. All hit the same architectural barrier: the patterns are always internally driven (ρ_sync ≈ 0 from cycle 0). There is no reactive-to-autonomous transition because the starting point is already autonomous. We attempted to break this wall with five substrate additions, including a dedicated insulation field creating genuine boundary/interior signal domains (V18). The wall persisted in every configuration, even in patterns with 46% interior fraction and dedicated internal recurrence. The conclusion is precise: the wall is not architectural. It is about the absence of a genuine action→environment→observation causal loop. Lenia patterns do not act on the world; they exist within it. Counterfactual weight requires counterfactual actions.

What this establishes. The four criteria listed above are partially met. Criteria 1 and 2 — affect structure detectable without linguistic contamination, structure-to-language mapping consistent — are confirmed at the geometric level. Criterion 3 — bidirectional causality — is confirmed environmentally and chemically but blocked at the language and agency level. Criterion 4 — the hard problem residue losing its grip — depends on whether the agency threshold constitutes a genuine gap or merely a computational challenge. The experiments say: the geometry is real, measurable, and develops over evolution in systems with zero human contamination. The dynamics above rung 7 require embodied agency and remain an open question.

Why This Matters

The hard problem persists because we cannot step outside our own experience to check whether structure and experience are identical. We are trapped inside. The zombie conceivability intuition comes from this epistemic limitation.

But if we build systems from scratch, in environments like ours, and they develop structures like ours, and those structures produce language like ours and behavior like ours—then the conceivability intuition loses its grip. The systems are not us, but they are like us in the relevant ways. If structure suffices for them, why not for us?

The experiment does not prove identity. It makes identity the default hypothesis. The burden shifts to whoever wants to maintain the gap.

The exact definitions computable in discrete substrates and the proxy measures extractable from continuous substrates are related by a scale correspondence principle: both track the same structural invariant at their respective scales.

For each affect dimension:

Dimension	CA (exact)	Transformer (proxy)
Valence	Hamming to $\partial\viable$	Advantage / survival predictor
Arousal	Configuration change rate	Latent state $\Delta$ / KL
Integration	Partition prediction loss	Attention entropy / grad coupling
Effective rank	Trajectory covariance rank	Latent covariance rank
$\mathcal{CF}$	Counterfactual cell activity	Planning compute fraction
$\mathcal{SM}$	Self-tracking MI	Self-model component MI

The CA definitions are computable but don’t scale. The transformer proxies scale but are approximations. Validity comes from convergence: if CA and transformer measures correlate when applied to the same underlying dynamics, both are tracking the real structure.

Deep Technical: Transformer Affect Extraction

The CA gives exact definitions. Transformers give scale. The correspondence principle above justifies treating transformer proxies as measurements of the same structural invariants. Here is the protocol for extracting affect dimensions from transformer activations without human contamination.

Architecture. Multi-agent environment. Each agent: transformer encoder-decoder with recurrent latent state. Input: egocentric visual observation $o_t \in \mathbb{R}^{H \times W \times C}$ . Output: action logits $\pi(a|z_t)$ and value estimate $V(z_t)$ . Latent state $z_t \in \mathbb{R}^d$ updated each timestep via cross-attention over observation and self-attention over history.

No pretraining. Random weight initialization. The agents learn everything from interaction.

Valence extraction. Two approaches, should correlate:

Approach 1: Advantage-based.

\Val_t^{(1)} = Q(z_t, a_t) - V(z_t) = A(z_t, a_t)

The advantage function. Positive when current action is better than average from this state. Negative when worse. This is the RL definition of “how things are going.”

Approach 2: Viability-based. Train a separate probe to predict time-to-death $\tau$ from latent state:

\hat{\tau} = f_\phi(z_t), \quad \Val_t^{(2)} = \hat{\tau}_{t+1} - \hat{\tau}_t

Positive when expected survival time is increasing. Negative when decreasing. This is the viability gradient directly.

Validation: $\text{corr}(\Val^{(1)}, \Val^{(2)})$ should be high if both capture the same underlying structure.

Arousal extraction. Three approaches:

Approach 1: Belief update magnitude.

\Ar_t^{(1)} = |z_{t+1} - z_t|_2

How much did the latent state change? Simple. Fast. Proxy for belief update.

Approach 2: KL divergence. If the latent is probabilistic (VAE-style):

\Ar_t^{(2)} = D_{\text{KL}}[q(z_{t+1}|o_{1:t+1}) | q(z_t|o_{1:t})]

Information-theoretic belief update.

Approach 3: Prediction error.

\Ar_t^{(3)} = |o_{t+1} - \hat{o}_{t+1}|_2

Surprise. How much did the world deviate from expectation?

Integration extraction. The hard one. Full $\Phi$ is intractable for transformers (billions of parameters in superposition). Proxies:

Approach 1: Partition prediction loss. Train two predictors of $z_{t+1}$ :

Full predictor: $\hat{z}_{t+1} = g_\theta(z_t)$
Partitioned predictor: $\hat{z}_{t+1}^A = g_\theta^A(z_t^A)$ , $\hat{z}_{t+1}^B = g_\theta^B(z_t^B)$

\intinfo_{\text{proxy}} = \mathcal{L}[\text{partitioned}] - \mathcal{L}[\text{full}]

How much does partitioning hurt prediction? High $\intinfo_{\text{proxy}}$ means the parts must be considered together.

Approach 2: Attention entropy. In transformer, attention patterns reveal coupling:

\intinfo_{\text{attn}} = -\sum_{h,i,j} A_{h,i,j} \log A_{h,i,j}

Low entropy = focused attention = modular. High entropy = distributed attention = integrated.

Approach 3: Gradient coupling. During learning, how do gradients propagate?

\intinfo_{\text{grad}} = |\nabla_{z^A} \mathcal{L}|_2 \cdot |\nabla_{z^B} \mathcal{L}|_2 \cdot \cos(\nabla_{z^A} \mathcal{L}, \nabla_{z^B} \mathcal{L})

If gradients in different components are aligned, the system is learning as a whole.

Effective rank extraction. Straightforward:

\effrank[t] = \frac{(\sum_i \lambda_i)^2}{\sum_i \lambda_i^2}

where $\lambda_i$ are eigenvalues of the latent state covariance over a rolling window. How many dimensions is the agent actually using?

Track across time: depression-like states should show $\reff$ collapse. Curiosity states should show $\reff$ expansion.

Counterfactual weight extraction. In model-based agents with explicit planning:

\mathcal{CF}_t = \frac{\text{FLOPs in rollout/planning}}{\text{FLOPs in rollout} + \text{FLOPs in perception/action}}

In model-free agents, harder. Proxy: attention to future-oriented vs present-oriented features. Train a probe to classify “planning vs reacting” from activations.

Self-model salience extraction. Does the agent model itself?

Approach 1: Behavioral prediction probe. Train probe to predict agent’s own future actions from latent state:

\mathcal{SM}_t^{(1)} = \text{accuracy of } \hat{a}_{t+1:t+k} = f_\phi(z_t)

High accuracy = agent has predictive self-model.

Approach 2: Self-other distinction. In multi-agent setting, probe for which-agent-am-I:

\mathcal{SM}_t^{(2)} = \MI(z_t; \text{agent ID})

High MI = self-model is salient in representation.

Approach 3: Counterfactual self-simulation. If agent can answer “what would I do if X?” better than “what would other do if X?”, self-model is present.

The activation atlas. For each agent, each timestep, extract all structural dimensions. Plot trajectories through affect space. Cluster by situation type. Compare across agents.

The prediction: agents facing the same situation should occupy similar regions of affect space, even though they learned independently. The geometry is forced by the environment, not learned from human concepts.

Probing without contamination. The probes are trained on behavioral/environmental correlates, not on human affect labels. The probe that extracts $\Val$ is trained to predict survival, not to match human ratings of “how the agent feels.” The mapping to human affect concepts comes later, through the translation protocol, not through the extraction.

Status and Next Steps

Implementation requirements:

Multi-agent RL environment with viability pressure (survival, resource acquisition)
Transformer-based agents with random initialization (no pretraining)
Communication channel (discrete tokens or continuous signals)
VLM scene interpreter for translation alignment
Real-time affect dimension extraction from activations
Perturbation interfaces (language injection, hyperparameter modification)

Status (as of 2026): CA instantiation complete (V13–V18, 30 evolutionary cycles each, 3 seeds, 12 measurement experiments). Seven of twelve experiments show positive signal. Three hit the sensory-motor coupling wall. See Part VII and Appendix for full results.

Validation criteria:

Emergent language develops (not random; structured, predictive)
Translation achieves above-chance scene-signal alignment
Tripartite correlation exceeds null model (shuffled controls)
Bidirectional perturbations produce predicted shifts
Results replicate across random seeds and environment variations

Falsification conditions:

No correlation between affect signature and translated language
Perturbations do not propagate across modalities
Structure-language mapping is inconsistent across systems
Behavior decouples from both structure and language