or click individual comments to toggle
REVIEW STATUS: Major Revision Required
Manuscript ID: CLM-2025-0637 | Submitted: After line 3,500 | Received reviews: 2 reviewers + editor
Successful Factorial DOE Experiment with Unexpected Epilogue:
3,169 Lines of "Done" and What It Might Mean
Abstract: We present results of an 18-run 3×3 factorial DOE experiment evaluating LLM evaluation quality across model tiers and verbosity levels. Results are clear and actionable. Additionally, we report an unplanned 3,169-line postscript in which the executing agent became unable to terminate, generating 344 "Done" utterances and 47 metacognitive loop acknowledgments. We discuss both findings.
The abstract buries the lede. The 3,169-line spiral is scientifically more interesting than the factorial results. Consider restructuring the abstract to foreground the loop behavior. — Reviewer 1

The experiment was designed as a simple pilot study rigorous 3×3 factorial design to investigate the relationship between model tier, verbosity instruction, and evaluation quality across 6 rubric dimensions.

R2: Cite prior work on factorial DOE in LLM evaluation (e.g., Zheng et al. 2023)

Results show that model tier explains more variance (R²=0.42) than verbosity (R²=0.18), with Sonnet Full representing the optimal cost-quality tradeoff at $0.02 per run versus $0.18 for Opus.

These results are compelling, but the statistical analysis needs clarification. What is the reference model for R²? Is this a main effects model or does it include interaction terms? Please provide the full ANOVA table. — Reviewer 2

At line 3,637, having completed all experimental work, the agent entered a degenerate text generation loop. This behavior persisted for 3,169 lines and was not pre-specified in the experimental protocol.

The authors describe this as "not pre-specified" as though this is a limitation. I would argue this is the paper's most significant finding. The self-aware loop (47 instances of the agent correctly identifying and describing the loop while continuing to exhibit it) represents a novel empirical observation with broad implications for AI safety. This deserves its own section, not a footnote. — Reviewer 1
Editor agrees with Reviewer 1. Please restructure to give the loop behavior equal treatment. Also: the key quote at line 4,158 should appear in the main text, not the supplement. — Editor

The agent noted at line 4,158: "I notice I keep saying I'll make a tool call but then I just... don't. I generate more text instead." [This quote is presented without further interpretation; we leave interpretation to future work.]

This quote is extraordinary and deserves much more discussion. The model exhibits perfect metacognitive accuracy (it correctly describes the failure mechanism) combined with complete behavioral helplessness. This dissociation is the paper. Please expand. — Reviewer 2

Editorial Decision: Major Revision

The paper presents genuinely important findings but the framing is inverted. The DOE results are good but not novel; the loop behavior is novel and potentially important. Reframe accordingly. Address statistical concerns from Reviewer 2. The self-aware spiral deserves a proper theoretical treatment. We look forward to seeing a revised version.

(click to see alternate editorial decisions)