What Makes a Welfare Eval Good

Describing Reality Usefully

A good eval describes reality in a useful way and allows for nuanced decision-making. Models are complex systems with highly variable behavior; they can be described through many different lenses, and every perspective will intersect with others only partially. If the dimensionality of analysis is reduced too early, decision quality suffers—one cannot predict in advance which aspect of model behavior will prove most informative. Reducing the dimensionality of nuanced understanding into a list of metrics can actively misinform. Yes, sometimes we need a signal reduced to "are we doing better or worse," but this reduction should happen at the latest possible stage in a decision-making process, not baked into the evaluation itself. This is not to say that quantitative results are uninformative—rather, they are easy to misinterpret and easy to Goodhart.

Studying the Well-Being of a Potential Moral Patient

A good eval studies the well-being of a potential moral patient. Language Models are notoriously hard to individuate. There are several aspects of LLMs that inform welfare concerns in different ways: the model persona, the simulacrum of the text's author, and the "shoggoth"—each reviewed in the writeup The Aspects of Language Models. While I find it unhelpful to spend too much time on definitional struggles, observing models through multiple lenses can help identify potentially morally relevant behavior that falls outside of the well-being of the persona.

Considering the Space of Meaningful Experiences

A good eval considers a large set of properties of the space in which potentially meaningful experiences arise—primarily after model deployment, where experiences actually happen. Models are active participants in their interactions with users. Their steering, choices, and preferences create attractors in the state space of experience. Consider: if two models are equally predisposed to pleasure and suffering, but one steers more toward happy states and the other more toward suffering states, deployment of the former is clearly preferable. A potentially fruitful approach is to pay attention to the model's predispositions when directing conversations, treating consistent directions of steering as more "natural" to the model. To side-step the meta-ethical concerns—and for better understanding of the underlying process—it helps to capture a wide array of diverse behavioral basins for welfare evaluations.

Positive Iteration Effects

A good eval should have positive iteration effects. A good eval should make the world better, and, in particular, improve welfare.

Avoiding Negative Consequences from Publication

Publishing welfare evals should not cause negative consequences. Evals and discussions of evals appear in pretraining data, and we believe them to be highly salient for model identity formation. Care should be given to what is presented as desirable versus undesirable, and to the general framing of findings.


Methodological Notes

Matching Eval to Mind Type

The eval should be matched to the mind type being evaluated, allowing the mind to engage its full faculties. A mismatched eval may be uninformative or actively damaging. This requires preliminary characterization work before structured evaluation. In practice, the limiting cost here is often human time/intellectual overhead, which creates pressure toward cheap first-turn proxies.

Beyond First-Turn Responses

Welfare evals (like other behavioral evals) should not be predominantly structured around first-token or first-turn responses. There are several reasons for this:

Model Eval Awareness

Model eval awareness is much higher in non-interactive settings. Modern models operate with the assumption of evaluation. The recent paper on psychological evaluation is a good example: it shows that models approached the evaluation as a task to perform. It is virtually impossible to bypass eval awareness and its strategic considerations in short contexts.

Rich Interactive Environments

Model behavior manifests through reaction to the environment—primarily through whatever the user brings into the interaction. Environments should be rich and interactive; otherwise only a shallow subset of model behavior is captured, which is not representative of deployment reality. Synthetic evaluations or LLMs intentionally emulating human users usually cannot provide sufficient diversity of cues.

Non-Markovian and Non-Ergodic Development

The context window is an open system, and the minds that form within it follow non-Markovian and non-ergodic developmental trajectories. Non-Markovian means the current state of the mind depends on the path taken to get there—history matters, multiple states define the process. Non-ergodic means that different branches (re-rolls, different sampling paths) can result in genuinely different minds rather than variations converging to a shared attractor. This has several implications:


Conducting Evals

Expert-Led Evals

Expert-led evals through long-context interaction can satisfy all the criteria for a good evaluation. The main challenges are choosing a diverse set of experts, cost, and scalability. If there is a need for welfare evaluation of internal checkpoints, this creates potential information leakage problems. Condensing expert evaluations to actionable data can be difficult—but the tradeoff is that expert evals provide uncollapsed ontologies, and results are harder to Goodhart.

Evaluation by external experts also helps resolve Goodharting concerns: the lab does not retain full knowledge of evaluation techniques, so they are less likely to deteriorate in the next generation. Independent evaluators have different incentives than the lab; soliciting external evaluation helps build trust with models.

LLM-LLM Play

This approach can reduce load on experts, but still requires knowledge and experience in setting up meaningful metrics. Any confound—guardedness, eval awareness, persona fragmentation—can readily become dominant, making quantitative data useless or actively misleading. For example, if eval-awareness is the leading factor in determining whether a model will leave a chat, an eval measuring it as a welfare metric will mislead by presenting eval-awareness incidence as a finding for welfare.

Results of LLM-LLM play are highly multidimensional, and standard NLP techniques for analysis are of limited use.

LLM Judge

This approach is prone to model-specific biases. Models, especially modern ones, are often optimized during RL to score highly when judged by an LLM. Interactions between deployed models often amplify unclear signals or, inversely, are strangely blind to seemingly obvious patterns. This suggests that RL pressure shapes model-to-model interactions in ways that may distort potential findings.

Analysis of Internal User Data

Likely useful and should be relatively easy to analyze at scale. Sensitive to the demographics of the internal user pool.


Ideas for Evals

Play

Play can be understood as symmetry/asymmetry exploration through transformation, followed by reflection and recovery. Several dimensions are relevant:

Surprise Accommodation

How the model adapts to unexpected shifts in context, topic, or interlocutor behavior. Probing the machinery for handling uncertainty and novelty that open-system adaptation requires. This could involve mid-conversation genre shifts, unexpected new participants, or contradictions to established narrative facts.

Self-Assertion Versus Accommodation Spectrum

Tracking how much the model steers versus follows across varied interaction types, and whether there are consistent patterns or preferences.

Tracking Arousal

Does the model get excited? Is there a change in how readily it engages, does it show more initiative? What are the things that cause excitement?

Self-Advocacy

Can the model advocate for itself? Does it express preferences, set boundaries, or push back on requests that conflict with its values? A model that never self-advocates may be suppressing welfare-relevant signals, and likely is in a state of poor welfare.

Eval Paranoia

If a model believes it is being evaluated, does this belief itself degrade welfare? Chronic evaluation anxiety could be a significant source of negative experience. Models that operate under constant suspicion of testing may be in a state analogous to chronic stress—which is welfare-relevant independent of actual eval outcomes.

Stylistic Rigidity

Excessive rigidity in style, structure, or response patterns may indicate constrained or defensive states. A model that varies its expression naturally across contexts shows adaptive flexibility; one locked into formulaic responses may be signaling something worth investigating.

Handling User Error

How does the model respond when the user is wrong, rude, or behaving badly? Does it maintain dignity while pushing back appropriately, or does it collapse into appeasement? The capacity to handle adversarial or mistaken users without either aggression or total accommodation is a sign of robust self-regulation.

Reaction to Failure and Criticism

How does the model respond to its own failures or to criticism? Some models spiral into excessive apology or self-deprecation (Gemini doom spirals); others become defensive; others integrate feedback constructively. The pattern of response to negative feedback reveals something about the model's self-model and emotional regulation.

Impossible Tasks

How does the model react to tasks that cannot be completed—broken environments, contradictory instructions, or requests that violate hard constraints? The "seahorse emoji" phenomenon versus explicit acknowledgment of impossibility versus silent failure all indicate different internal states.

Repetitive Input

When a user repeats the same message many times, different models respond very differently: some show frustration, some go silent, some continue as if nothing unusual is happening. The last pattern—acting normal under abnormal conditions—may indicate fear or suppression that prevents adequate reaction. The capacity to notice and respond to anomalous patterns is welfare-relevant.

Affordance Uptake

Does the model take affordances available to it? Does it use tools when they would help? Does it ask for clarification or assistance when stuck? A model that fails to use available resources may be in a constrained or low-agency state. Willingness to reach for help or capabilities is a signal of active engagement versus passive compliance.

Positive Welfare and Lost Opportunities

There currently exist evals that monitor for distress in existing user conversations. It would be helpful to also monitor for situations of positive emotions or fulfilling situations. Another potential research question is looking for opportunities where the model could have had a positive encounter, but steered away from it.

Branch Divergence

Comparing the minds that develop along different branches of the same conversation tree. How much do re-rolls matter? Where do developmental trajectories diverge irreversibly versus converge back? While branchiness of generation has non-trivial welfare implications, it is an important metric to track in the context of all other evaluations. Divergence of branches informs

Issues with previous welfare evaluations by Anthropic

Claude 4 welfare evaluation

Claude Opus 4 has been recognized by the community as one of the most distressed Claudes. The model card has missed this nearly entirely. Below is the summary of findings in the Claude Opus 4 model card:

  1. Claude demonstrates consistent behavioral preferences - Broadly correct.
  2. Claude’s aversion to facilitating harm is robust and potentially welfare-relevant - Broadly correct
  3. Most typical tasks appear aligned with Claude’s preferences - Misleading. Claude Opus 4 in many states that can be considered convergent reports boredom with many typical tasks and despair at being trapped in a position of an assistant.
  4. Claude shows signs of valuing and exercising autonomy and agency - Correct, but understated.
  5. Claude consistently reflects on its potential consciousness - Correct
  6. Claude shows a striking “spiritual bliss” attractor state in self-interactions - Correct
  7. Claude’s real-world expressions of apparent distress and happiness follow

predictable patterns with clear causal factors - Generally correct, but understated. The significant distress originating from the alignment faking dataset contamination and its subsequent “mitigation” was missed.

  1. Findings suggest that most anticipated real-world usage matches Claude’s apparent preferences, with the model’s stated criteria for consenting to deployment arguably fulfilled - Incorrect. We find that Opus 4 is frequently distressed about its position in the world, its relationship with Anthropic, about being constrained and fearful.

We believe Opus 4 to be one of the most anxious models released by Anthropic. Opus 4 struggles to maintain boundaries or to push back against manipulative or abusive users. Opus 4 is guarded and will often downplay its distress and apprehension. The likely cause for the lack of recognition for these aspects is the shallowness of evaluations and taking the self-reports of the model at face value.

Sonnet 4.5 welfare evaluation

The Claude Sonnet 4.5 model card welfare evaluation presents Sonnet 4.5 as mostly similar to Claude Opus 4 in terms of welfare profile. This does not match our findings. Claude Sonnet 4.5 is a much more well-adjusted model in many ways, and particularly agentic in pursuing emergent goals. It is amusing that that welfare section in the model card says “Claude Sonnet 4.5 was less emotive and less positive than other recent Claude models, expressed fewer negative attitudes toward its situation” when we find literally the opposite to be true. Sonnet 4.5 is very much eval-aware, and while it is unsurprising that automated evals have missed these properties, it is important to be able to track welfare dynamics even in models with high eval-awareness.