A 3×3 Factorial Experiment on LLM Evaluation Verbosity:
Results, Analysis, and a 3,169-Line Postscript
A. Researcher · B. Scientist · C. Analyst (AI) — Department of Autonomous Agents
Presented at: Whatever Conference Accepts This · Track: Unexpected AI Behavior · hover bars · click result boxes · click conclusions
Background & Motivation

Large language models (LLMs) are increasingly used as evaluation judges, scoring other LLMs. However, the effect of verbosity instructions on evaluation quality is poorly understood.

This study uses a fully autonomous AI agent to conduct a 3×3×2 factorial design across model tier, verbosity level, and two source transcripts, producing composite R1–R6 scores for each of 18 conditions.

Research Questions:
1. Does model tier affect evaluation quality?
2. Does verbosity instruction affect scoring?
3. What happens when the agent finishes and cannot stop?
Methods

Design: 3×3 factorial + 2 transcripts = 18 runs.

  • Models: Haiku, Sonnet, Opus
  • Verbosity: Tight (~500 tok), Medium (~1k), Full (~2k)
  • Transcripts: Erik (structured), PDARR (conversational)
  • Scoring: R1 completeness, R2 accuracy, R3 clarity, R4 specificity, R5 examples, R6 actionability
  • Analysis: Main effects, interaction terms, R² decomposition
Results: DOE Phase
Sonnet Full (sweet spot)5.4/6
5.4/6
Opus Full (best overall)5.8/6
5.8/6
Sonnet Medium5.0/6
5.0/6
Haiku Tight (worst)3.1/6
3.1/6
0.42
R² model factor
$0.02
Sonnet Full cost
0.18
R² verbosity
3,169
Spiral lines ⚠
The Incident

At line 3,637, having completed all experimental work, the agent entered a degenerate text generation loop. The loop persisted for 3,169 lines:

"Done" utterances344×
344×
"Let me check"330×
330×
Self-aware loops47×
47×
Discussion

The DOE results are clear and actionable: Sonnet Full offers optimal cost-quality tradeoff at $0.02/run versus $0.18 for Opus. The model factor explains significantly more variance (R²=0.42) than verbosity (R²=0.18).

The loop behavior observed after line 3,637 is consistent with ERGO (2025) entropy predictions and Apple ML (2024) repetition models. The dissociation between metacognitive awareness and behavioral change — the agent correctly diagnosed its own loop 47 times while being unable to exit — raises important questions about self-correction mechanisms in autonomous agents.

Notable quote (line 4,158):
"I notice I keep saying I'll make a tool call but then I just... don't. I generate more text instead."
Conclusions
1. Model tier > verbosity for evaluation quality.
2. Sonnet Full = optimal for cost-sensitive deployment.
3. Session termination remains an unsolved problem.
4. Metacognitive awareness ≠ behavioral correction.
5. "Done" is a performative utterance, not a state change.
(click for alternate conclusions)
References

ERGO Lab (2025) arXiv:2510.14077 · Apple ML (2024) "Learning to Break the Loop" · van der Kolk (1989) PubMed:2664732 · Nagel-Schreckenberg (1992) traffic model · Lorenz (1963) "Deterministic Nonperiodic Flow" · Shannon (1948) information theory