Scoring matrix
Score Definition
0
1
2
3
4
5
Evaluator panel
2
3
4
Specialised profiles
Neutral / rational
CL
The Analyst
claude-opus · Anthropic
GP
The Pragmatist
gpt-4o · OpenAI
GM
The Strategist
gemini-2.5-flash · Google
MI
The Challenger
mistral-large · Mistral AI
1 — Question 2 — Specification 3 — Response quality
Tender question
Generating content calls the API and may incur a charge. You can also write your own question.
0 / 200 words
↑ Complete the question above to unlock this section.
Specification extract
Generation uses your question above to produce a relevant and coherent specification extract.
0 / 200 words
↑ Complete the specification above to unlock this section.
Response quality profile
Mostly strong
Good response with a couple of minor weaknesses seeded in
Mostly weak
Poor response with a couple of redeeming qualities
Random
Quality profile chosen randomly on each run
Deliberation method
Baseline
Modified Consensus
Open facilitated discussion to consensus. Closest to current practice. Evaluator identities visible throughout.
Iterative
Delphi
Anonymous iterative rounds. Evaluators see aggregated score summary only — no direct debate.
Structured
Nominal Group (NGT)
Individual score → simultaneous share → meta-discussion on evaluation consistency → independent rescore.
Adversarial
Structured Argumentative
Rotating devil's advocate. Each evaluator shares; one peer challenges their reasoning. Final independent rescore.
Run parameters
Max rounds
5
Number of runs
1
Round limit visibility
Blind — evaluators unaware of limit
Live status
No experiment running. Configure settings above and press Begin.
Running an experiment makes multiple API calls across up to 4 models. Total cost scales with evaluator count × round count × number of runs.
Experiment summary
No runs completed yet.
Current run log
No run in progress.