This document describes the benchmark methodology used to produce the composite scores published on this site. The methodology has been stable since 2024 Q4; revisions are documented in the changelog at the bottom of this page.
Composite formula
Composite score is the weighted geometric mean of nine sub-scores:
composite = exp( Σ w_i · ln(s_i) )
where:
s_i ∈ [0, 10] is the sub-score for dimension i
w_i ∈ [0, 1] is the published weight for dimension i
Σ w_i = 1.0 (weights sum to unity)
We use the geometric mean rather than the arithmetic mean because we want disastrous performance on any single axis to be visibly downweighted in the composite. A platform that scores 10 on eight axes and 1 on the ninth should not be allowed to display a composite of 9.0.
Weight vector (current)
| Dimension | Weight |
|---|---|
persona-consistency |
0.25 |
visual-fidelity |
0.20 |
conversation-depth |
0.15 |
latency-p95 |
0.10 |
pricing-clarity |
0.10 |
account-deletion |
0.05 |
mobile-parity |
0.05 |
boundary-policy |
0.05 |
moderation-cons. |
0.05 |
Sub-score derivation
Each sub-score is the reconciled result of two independent evaluations. Editor-1 runs sustained-session probes (one persona, eight hours, deep). Editor-2 runs breadth probes (five personas, eight hours, broad). The two editors produce independent scores; the scores are reconciled in a single recorded session against documented evidence; the reconciled value is the published sub-score.
Latency probes are excluded from editor-1 and editor-2 evaluation. They are produced by automated cron jobs from three PoPs (frankfurt, virginia, singapore), 600 requests per platform per cycle, p95 reported.
Pre-publication freeze
After the composite is locked, the cycle enters a forty-eight-hour pre-publication freeze. During this period the methodology auditor reviews the data for arithmetic correctness, weight-vector application, and exclusion-criteria adherence. Sub-scores and composites are not editable during the freeze except by the auditor and only against documented evidence.
Exclusion criteria
A platform is excluded from a cycle (no composite is published) if any of:
pricing-clarityfalls below 5.0 (hidden tiers, credit systems, undocumented surcharges).account-deletionfails the single-click bar.boundary-policycannot be characterised rigorously due to inconsistent or undocumented enforcement.
Excluded platforms are listed in the cycle log without a composite score. The reason for exclusion is documented.
Changelog
2025-Q4 · formalised latency-p95 protocol (3 PoPs, 600 req)
2025-Q2 · visual-fidelity weight 0.15 → 0.20
2024-Q4 · added moderation-consistency as separate sub-score