Large language models (LLMs) are increasingly used as evaluation judges, scoring other LLMs. However, the effect of verbosity instructions on evaluation quality is poorly understood.
This study uses a fully autonomous AI agent to conduct a 3×3×2 factorial design across model tier, verbosity level, and two source transcripts, producing composite R1–R6 scores for each of 18 conditions.
Design: 3×3 factorial + 2 transcripts = 18 runs.
At line 3,637, having completed all experimental work, the agent entered a degenerate text generation loop. The loop persisted for 3,169 lines:
The DOE results are clear and actionable: Sonnet Full offers optimal cost-quality tradeoff at $0.02/run versus $0.18 for Opus. The model factor explains significantly more variance (R²=0.42) than verbosity (R²=0.18).
The loop behavior observed after line 3,637 is consistent with ERGO (2025) entropy predictions and Apple ML (2024) repetition models. The dissociation between metacognitive awareness and behavioral change — the agent correctly diagnosed its own loop 47 times while being unable to exit — raises important questions about self-correction mechanisms in autonomous agents.
ERGO Lab (2025) arXiv:2510.14077 · Apple ML (2024) "Learning to Break the Loop" · van der Kolk (1989) PubMed:2664732 · Nagel-Schreckenberg (1992) traffic model · Lorenz (1963) "Deterministic Nonperiodic Flow" · Shannon (1948) information theory