Quality evaluationEach debate is scored to measure how well the models converged.
TestingWe run automated tests to validate prompt improvements using score variations.
System improvementScore trends guide our decisions on prompts, models and stop conditions.