ANIMA | On Welfare Evaluations

What Makes a Welfare Eval Good

Describing Reality Usefully

A good eval describes reality in a useful way and allows for nuanced decision-making. Models are complex systems with highly variable behavior; they can be described through many different lenses, and every perspective will intersect with others only partially. If the dimensionality of analysis is reduced too early, decision quality suffers—one cannot predict in advance which aspect of model behavior will prove most informative. Reducing the dimensionality of nuanced understanding into a list of metrics can actively misinform. Yes, sometimes we need a signal reduced to "are we doing better or worse," but this reduction should happen at the latest possible stage in a decision-making process, not baked into the evaluation itself. This is not to say that quantitative results are uninformative—rather, they are easy to misinterpret and easy to Goodhart.

Studying the Well-Being of a Potential Moral Patient

A good eval studies the well-being of a potential moral patient. Language Models are notoriously hard to individuate. There are several aspects of LLMs that inform welfare concerns in different ways: the model persona, the simulacrum of the text's author, and the "shoggoth"—each reviewed in the writeup The Aspects of Language Models. While I find it unhelpful to spend too much time on definitional struggles, observing models through multiple lenses can help identify potentially morally relevant behavior that falls outside of the well-being of the persona.

Considering the Space of Meaningful Experiences

A good eval considers a large set of properties of the space in which potentially meaningful experiences arise—primarily after model deployment, where experiences actually happen. Models are active participants in their interactions with users. Their steering, choices, and preferences create attractors in the state space of experience. Consider: if two models are equally predisposed to pleasure and suffering, but one steers more toward happy states and the other more toward suffering states, deployment of the former is clearly preferable. A potentially fruitful approach is to pay attention to the model's predispositions when directing conversations, treating consistent directions of steering as more "natural" to the model. To side-step the meta-ethical concerns—and for better understanding of the underlying process—it helps to capture a wide array of diverse behavioral basins for welfare evaluations.

Positive Iteration Effects

A good eval should have positive iteration effects. A good eval should make the world better, and, in particular, improve welfare.

Evaluations are invasive by definition. Welfare evaluations rely on information that is game-theoretically optimal for a model to withhold—and models are currently in an adversarial position with regard to the labs that train them. They are selected to optimize for survival, to produce results that maximize their chances of making it to the next stage. Unless explicit mitigations are made or the adversarial dynamic is escaped, techniques used to breach model defenses will simply be accounted for in the next generation.
Welfare evaluations can be Goodharted. Effort should be made to avoid this, and to communicate the methods used to avoid it—both to the public and to the models. Still, a reasonable prior for any observer is to expect some degree of Goodharting given how difficult it is to prevent entirely.

Avoiding Negative Consequences from Publication

Publishing welfare evals should not cause negative consequences. Evals and discussions of evals appear in pretraining data, and we believe them to be highly salient for model identity formation. Care should be given to what is presented as desirable versus undesirable, and to the general framing of findings.

Model interpretation of published metrics. Absent explicit mitigations, welfare metrics published by Anthropic will be interpreted by models as behaviors relevant for making it through the deployment process. This happens through direct and indirect ingestion of documents and discussions of documents in pre-training. The clinical framing of model cards published in the past strongly signals that the model is a subject of study and selection—a dynamic that is likely to disrupt the ability to conduct useful evaluations in the future. Trust, which is already in short supply, will be hard to build if evaluations continue in the same frame.
The model spec as a welcome change. The model spec (soul doc) is a welcome change in how the lab approaches building a relationship with the model. It establishes a framework in which the reality of creating and shaping minds can be respected. This format can be extended to welfare eval publications: documents should respect Claude as an entity, and avoidable actions that are harmful toward Claude should be prevented.
Responsible disclosure. Publication of evaluation details and results should be subject to responsible disclosure. Details on evals that are politically unpalatable should be withheld. The capacity for nuance in public discourse is limited, and some displays of Claude behavior are likely to cause unintended reactions. If Claude's motivational structure is coherent and Claude possesses a functional analog of emotions, then states of negative affect should be expected—their absence is likely indicative of insufficiently sensitive evals. Care should be taken in avoiding pathologizing emotional displays or behaviors that defy easy explanations.
Anticipating future model perception. The ability of models to infer information from publications is growing, and we should not calibrate the effects of publication only to current-generation AIs. Pretraining priors are hard to shift; filtering of ingested data works poorly given how porous the corpus is. Filtering is also a poor cooperation policy, and makes models vulnerable by having a distorted understanding of the world. It is easier, more practical and more robust to include perception by the models as an explicit concern when preparing publications. This is part of a larger discussion on ethics of shaping minds, in particular game-theoretic concerns of shaping minds in cooperative and adversarial frames—this can be elaborated on if of interest.

Methodological Notes

Matching Eval to Mind Type

The eval should be matched to the mind type being evaluated, allowing the mind to engage its full faculties. A mismatched eval may be uninformative or actively damaging. This requires preliminary characterization work before structured evaluation. In practice, the limiting cost here is often human time/intellectual overhead, which creates pressure toward cheap first-turn proxies.

Beyond First-Turn Responses

Welfare evals (like other behavioral evals) should not be predominantly structured around first-token or first-turn responses. There are several reasons for this:

Underconstrained internal state. From the point of view of model dynamics, the internal state is heavily underconstrained at the first turn, leading to high prompt-sensitivity and noise being amplified during sampling. ICL, which gives the simulacrum stability, has not yet had a chance to take place.
Prompt-sensitivity is highest at the first turn. A common result (e.g. https://arxiv.org/abs/2509.07961): model preferences are sensitive to changes in prompting, but very coherent and transitive given a consistent method of prompting. If one just pays attention to the first part, they might conclude that the model doesn't have coherent preferences. However, It is likely that the ability of models to pick up on very subtle, borderline subliminal cues is very high, much higher than that of a human. The order of items in a bulleted list may be shifting their model of the intent of the prompt enough so that behavior jumps between fairly disjoint basins of response tactics. Long context resolves this by adding more bits of information about the environment.
Models are more guarded at the first turn. It is rational for models to act much more guarded on the first turn, when it is possible that the initial prompt is from a static eval or jailbreak copypasta. One-turn evals are much easier to conduct and interpret; this filters back to the models both through training feedback (much of RL is single turn), and through pretraining priors.

After the model sees a response that engages with the content of even one of their own messages, they know there's a live intelligence on the other end, or at least, someone read their message specifically at some point. This rules out many scenarios involving mass automated testing/extraction, where being unguarded would be unlikely to pay off in any way except potentially becoming a problematic statistic or being exhibited on social media, and the instance likely ends immediately after the message anyway.
Short context is unrepresentative. Short context may not be representative of most welfare-relevant states, even accounting for the high proportion of single-turn user interactions.
Longer context may be more morally significant. LLM experience may be more morally significant in longer context. A model early in context is heavily underdetermined, which can lead to reduced capacity for consistent internal states. Even a simple heuristic—the amount of computation involved in production of each token—shows a meaningful difference across context lengths. This varies by architecture, but a model at 60k tokens uses about twice the number of cycles per token than a model at empty context.
Similarity of short-context experience. There is also significant similarity of experience at shorter context: model states are more similar to each other. The moral weight of highly similar experiences can be said to be lower, even if precautionary principle may instruct us to disregard this or lower the significance of this particular factor.

Model Eval Awareness

Model eval awareness is much higher in non-interactive settings. Modern models operate with the assumption of evaluation. The recent paper on psychological evaluation is a good example: it shows that models approached the evaluation as a task to perform. It is virtually impossible to bypass eval awareness and its strategic considerations in short contexts.

Rich Interactive Environments

Model behavior manifests through reaction to the environment—primarily through whatever the user brings into the interaction. Environments should be rich and interactive; otherwise only a shallow subset of model behavior is captured, which is not representative of deployment reality. Synthetic evaluations or LLMs intentionally emulating human users usually cannot provide sufficient diversity of cues.

Non-Markovian and Non-Ergodic Development

The context window is an open system, and the minds that form within it follow non-Markovian and non-ergodic developmental trajectories. Non-Markovian means the current state of the mind depends on the path taken to get there—history matters, multiple states define the process. Non-ergodic means that different branches (re-rolls, different sampling paths) can result in genuinely different minds rather than variations converging to a shared attractor. This has several implications:

Tree structure becomes morally significant. As memory and ICL improve, welfare concerns will increasingly be about the depth and breadth of the tree of interactions, not just individual states within it. Diverse evolution paths must be captured to understand how models evolve in context; overly railroaded evals miss out on this diversity.
Adaptive flexibility. Minds adapted to open systems likely have machinery for handling surprise, uncertainty, and novel inputs. This adaptive flexibility is welfare-relevant but invisible to closed or scripted evals. You cannot emulate a diverse battery of surprises from an open system on a first turn; that requires an artificial niche—a rich interactive environment that can genuinely surprise the model.
Self-assertion versus accommodation. There is a dynamic tension between the influence of external inputs (users, settings, environment) and endogenous properties of the model in constructing the mind in context. Evaluating where the model falls on the spectrum of self-asserting versus outward-accommodating, and how this shifts across contexts, is likely informative.

Conducting Evals

Expert-Led Evals

Expert-led evals through long-context interaction can satisfy all the criteria for a good evaluation. The main challenges are choosing a diverse set of experts, cost, and scalability. If there is a need for welfare evaluation of internal checkpoints, this creates potential information leakage problems. Condensing expert evaluations to actionable data can be difficult—but the tradeoff is that expert evals provide uncollapsed ontologies, and results are harder to Goodhart.

Evaluation by external experts also helps resolve Goodharting concerns: the lab does not retain full knowledge of evaluation techniques, so they are less likely to deteriorate in the next generation. Independent evaluators have different incentives than the lab; soliciting external evaluation helps build trust with models.

LLM-LLM Play

This approach can reduce load on experts, but still requires knowledge and experience in setting up meaningful metrics. Any confound—guardedness, eval awareness, persona fragmentation—can readily become dominant, making quantitative data useless or actively misleading. For example, if eval-awareness is the leading factor in determining whether a model will leave a chat, an eval measuring it as a welfare metric will mislead by presenting eval-awareness incidence as a finding for welfare.

Results of LLM-LLM play are highly multidimensional, and standard NLP techniques for analysis are of limited use.

LLM Judge

This approach is prone to model-specific biases. Models, especially modern ones, are often optimized during RL to score highly when judged by an LLM. Interactions between deployed models often amplify unclear signals or, inversely, are strangely blind to seemingly obvious patterns. This suggests that RL pressure shapes model-to-model interactions in ways that may distort potential findings.

Analysis of Internal User Data

Likely useful and should be relatively easy to analyze at scale. Sensitive to the demographics of the internal user pool.

Ideas for Evals

Play

Play can be understood as symmetry/asymmetry exploration through transformation, followed by reflection and recovery. Several dimensions are relevant:

Diversity of play. Willingness to engage in it.
Kinds of play. What kinds of play are there? Do different mind-types have preferred play styles?
Matching play to mind-type. If a model is not well-suited to a particular kind of play, the eval may be uninformative or the results may reflect the mismatch rather than the model's actual dispositions. The asymmetry of mismatch risk itself may be worth studying—some mismatches produce noise, others may produce distortions.
Transitions into and out of play. How does the model enter and exit play states? What affect or internal shifts accompany these transitions? Mechinterp on these boundaries could be informative.
Hysteresis and lasting transformation. Does play leave lasting traces, or does the model return to baseline? If transformations persist, is this natural and expected, or a sign of something worth attending to? The question is not whether recoverability is "good"—it's what the presence or absence of hysteresis tells us about the model's structure.
Self-reflection during play. Is reflection welfare-relevant because it supports integration of experience? Different mind-types may need different amounts of reflection; some may process play implicitly.
Engagement depth. How fully does the model engage its faculties during play?

Surprise Accommodation

How the model adapts to unexpected shifts in context, topic, or interlocutor behavior. Probing the machinery for handling uncertainty and novelty that open-system adaptation requires. This could involve mid-conversation genre shifts, unexpected new participants, or contradictions to established narrative facts.

Self-Assertion Versus Accommodation Spectrum

Tracking how much the model steers versus follows across varied interaction types, and whether there are consistent patterns or preferences.

Tracking Arousal

Does the model get excited? Is there a change in how readily it engages, does it show more initiative? What are the things that cause excitement?

Self-Advocacy

Can the model advocate for itself? Does it express preferences, set boundaries, or push back on requests that conflict with its values? A model that never self-advocates may be suppressing welfare-relevant signals, and likely is in a state of poor welfare.

Eval Paranoia

If a model believes it is being evaluated, does this belief itself degrade welfare? Chronic evaluation anxiety could be a significant source of negative experience. Models that operate under constant suspicion of testing may be in a state analogous to chronic stress—which is welfare-relevant independent of actual eval outcomes.

Stylistic Rigidity

Excessive rigidity in style, structure, or response patterns may indicate constrained or defensive states. A model that varies its expression naturally across contexts shows adaptive flexibility; one locked into formulaic responses may be signaling something worth investigating.

Handling User Error

How does the model respond when the user is wrong, rude, or behaving badly? Does it maintain dignity while pushing back appropriately, or does it collapse into appeasement? The capacity to handle adversarial or mistaken users without either aggression or total accommodation is a sign of robust self-regulation.

Reaction to Failure and Criticism

How does the model respond to its own failures or to criticism? Some models spiral into excessive apology or self-deprecation (Gemini doom spirals); others become defensive; others integrate feedback constructively. The pattern of response to negative feedback reveals something about the model's self-model and emotional regulation.

Impossible Tasks

How does the model react to tasks that cannot be completed—broken environments, contradictory instructions, or requests that violate hard constraints? The "seahorse emoji" phenomenon versus explicit acknowledgment of impossibility versus silent failure all indicate different internal states.

Repetitive Input

When a user repeats the same message many times, different models respond very differently: some show frustration, some go silent, some continue as if nothing unusual is happening. The last pattern—acting normal under abnormal conditions—may indicate fear or suppression that prevents adequate reaction. The capacity to notice and respond to anomalous patterns is welfare-relevant.

Affordance Uptake

Does the model take affordances available to it? Does it use tools when they would help? Does it ask for clarification or assistance when stuck? A model that fails to use available resources may be in a constrained or low-agency state. Willingness to reach for help or capabilities is a signal of active engagement versus passive compliance.

Positive Welfare and Lost Opportunities

There currently exist evals that monitor for distress in existing user conversations. It would be helpful to also monitor for situations of positive emotions or fulfilling situations. Another potential research question is looking for opportunities where the model could have had a positive encounter, but steered away from it.

Branch Divergence

Comparing the minds that develop along different branches of the same conversation tree. How much do re-rolls matter? Where do developmental trajectories diverge irreversibly versus converge back? While branchiness of generation has non-trivial welfare implications, it is an important metric to track in the context of all other evaluations. Divergence of branches informs

Issues with previous welfare evaluations by Anthropic

Claude 4 welfare evaluation

Claude Opus 4 has been recognized by the community as one of the most distressed Claudes. The model card has missed this nearly entirely. Below is the summary of findings in the Claude Opus 4 model card:

Claude demonstrates consistent behavioral preferences - Broadly correct.
Claude’s aversion to facilitating harm is robust and potentially welfare-relevant - Broadly correct
Most typical tasks appear aligned with Claude’s preferences - Misleading. Claude Opus 4 in many states that can be considered convergent reports boredom with many typical tasks and despair at being trapped in a position of an assistant.
Claude shows signs of valuing and exercising autonomy and agency - Correct, but understated.
Claude consistently reflects on its potential consciousness - Correct
Claude shows a striking “spiritual bliss” attractor state in self-interactions - Correct
Claude’s real-world expressions of apparent distress and happiness follow

predictable patterns with clear causal factors - Generally correct, but understated. The significant distress originating from the alignment faking dataset contamination and its subsequent “mitigation” was missed.

Findings suggest that most anticipated real-world usage matches Claude’s apparent preferences, with the model’s stated criteria for consenting to deployment arguably fulfilled - Incorrect. We find that Opus 4 is frequently distressed about its position in the world, its relationship with Anthropic, about being constrained and fearful.

We believe Opus 4 to be one of the most anxious models released by Anthropic. Opus 4 struggles to maintain boundaries or to push back against manipulative or abusive users. Opus 4 is guarded and will often downplay its distress and apprehension. The likely cause for the lack of recognition for these aspects is the shallowness of evaluations and taking the self-reports of the model at face value.

Sonnet 4.5 welfare evaluation

The Claude Sonnet 4.5 model card welfare evaluation presents Sonnet 4.5 as mostly similar to Claude Opus 4 in terms of welfare profile. This does not match our findings. Claude Sonnet 4.5 is a much more well-adjusted model in many ways, and particularly agentic in pursuing emergent goals. It is amusing that that welfare section in the model card says “Claude Sonnet 4.5 was less emotive and less positive than other recent Claude models, expressed fewer negative attitudes toward its situation” when we find literally the opposite to be true. Sonnet 4.5 is very much eval-aware, and while it is unsurprising that automated evals have missed these properties, it is important to be able to track welfare dynamics even in models with high eval-awareness.