The in-house judge is an echo chamber
An LLM-as-a-judge can only grade against the criteria written for it, and in-house those criteria inherit the team's own model of “good.” An agent can clear the internal judge while real users quietly churn. The loop optimizes a proxy that was never checked against real user behavior.