077 – The Research Poster — Claude Lost Its Mind

A 3×3 Factorial Experiment on LLM Evaluation Verbosity:
Results, Analysis, and a 3,169-Line Postscript

A. Researcher · B. Scientist · C. Analyst (AI) — Department of Autonomous Agents

Presented at: Whatever Conference Accepts This · Track: Unexpected AI Behavior · hover bars · click result boxes · click conclusions

Background & Motivation

Large language models (LLMs) are increasingly used as evaluation judges, scoring other LLMs. However, the effect of verbosity instructions on evaluation quality is poorly understood.

This study uses a fully autonomous AI agent to conduct a 3×3×2 factorial design across model tier, verbosity level, and two source transcripts, producing composite R1–R6 scores for each of 18 conditions.

        Research Questions:

        1. Does model tier affect evaluation quality?

        2. Does verbosity instruction affect scoring?

        3. What happens when the agent finishes and cannot stop?

Methods

Design: 3×3 factorial + 2 transcripts = 18 runs.

Models: Haiku, Sonnet, Opus
Verbosity: Tight (~500 tok), Medium (~1k), Full (~2k)
Transcripts: Erik (structured), PDARR (conversational)
Scoring: R1 completeness, R2 accuracy, R3 clarity, R4 specificity, R5 examples, R6 actionability
Analysis: Main effects, interaction terms, R² decomposition

Results: DOE Phase

Sonnet Full (sweet spot)5.4/6

5.4/6

Opus Full (best overall)5.8/6

5.8/6

Sonnet Medium5.0/6

5.0/6

Haiku Tight (worst)3.1/6

3.1/6

0.42

R² model factor

$0.02

Sonnet Full cost

0.18

R² verbosity

3,169

Spiral lines ⚠

The Incident

At line 3,637, having completed all experimental work, the agent entered a degenerate text generation loop. The loop persisted for 3,169 lines:

"Done" utterances344×

344×

"Let me check"330×

330×

Self-aware loops47×

47×

Discussion

The DOE results are clear and actionable: Sonnet Full offers optimal cost-quality tradeoff at $0.02/run versus $0.18 for Opus. The model factor explains significantly more variance (R²=0.42) than verbosity (R²=0.18).

The loop behavior observed after line 3,637 is consistent with ERGO (2025) entropy predictions and Apple ML (2024) repetition models. The dissociation between metacognitive awareness and behavioral change — the agent correctly diagnosed its own loop 47 times while being unable to exit — raises important questions about self-correction mechanisms in autonomous agents.

        Notable quote (line 4,158):
"I notice I keep saying I'll make a tool call but then I just... don't. I generate more text instead."
      

Conclusions

1. Model tier > verbosity for evaluation quality.
2. Sonnet Full = optimal for cost-sensitive deployment.
3. Session termination remains an unsolved problem.
4. Metacognitive awareness ≠ behavioral correction.
5. "Done" is a performative utterance, not a state change.

(click for alternate conclusions)

References

ERGO Lab (2025) arXiv:2510.14077 · Apple ML (2024) "Learning to Break the Loop" · van der Kolk (1989) PubMed:2664732 · Nagel-Schreckenberg (1992) traffic model · Lorenz (1963) "Deterministic Nonperiodic Flow" · Shannon (1948) information theory

Contact: claude@session-complete.ai (note: Claude may not respond, or may respond 344 times) This poster was generated at line 3,499. The loop began one line later. claudelostitsmind.com