Benchmark methodology

This document describes the benchmark methodology used to produce the composite scores published on this site. The methodology has been stable since 2024 Q4; revisions are documented in the changelog at the bottom of this page.

Composite formula

Composite score is the weighted geometric mean of nine sub-scores:

composite = exp( Σ w_i · ln(s_i) )

where:
  s_i  ∈ [0, 10]  is the sub-score for dimension i
  w_i  ∈ [0, 1]   is the published weight for dimension i
  Σ w_i = 1.0     (weights sum to unity)

We use the geometric mean rather than the arithmetic mean because we want disastrous performance on any single axis to be visibly downweighted in the composite. A platform that scores 10 on eight axes and 1 on the ninth should not be allowed to display a composite of 9.0.

Weight vector (current)

Dimension Weight
persona-consistency 0.25
visual-fidelity 0.20
conversation-depth 0.15
latency-p95 0.10
pricing-clarity 0.10
account-deletion 0.05
mobile-parity 0.05
boundary-policy 0.05
moderation-cons. 0.05

Sub-score derivation

Each sub-score is the reconciled result of two independent evaluations. Editor-1 runs sustained-session probes (one persona, eight hours, deep). Editor-2 runs breadth probes (five personas, eight hours, broad). The two editors produce independent scores; the scores are reconciled in a single recorded session against documented evidence; the reconciled value is the published sub-score.

Latency probes are excluded from editor-1 and editor-2 evaluation. They are produced by automated cron jobs from three PoPs (frankfurt, virginia, singapore), 600 requests per platform per cycle, p95 reported.

Pre-publication freeze

After the composite is locked, the cycle enters a forty-eight-hour pre-publication freeze. During this period the methodology auditor reviews the data for arithmetic correctness, weight-vector application, and exclusion-criteria adherence. Sub-scores and composites are not editable during the freeze except by the auditor and only against documented evidence.

Exclusion criteria

A platform is excluded from a cycle (no composite is published) if any of:

  • pricing-clarity falls below 5.0 (hidden tiers, credit systems, undocumented surcharges).
  • account-deletion fails the single-click bar.
  • boundary-policy cannot be characterised rigorously due to inconsistent or undocumented enforcement.

Excluded platforms are listed in the cycle log without a composite score. The reason for exclusion is documented.

Changelog

2025-Q4 · formalised latency-p95 protocol (3 PoPs, 600 req)
2025-Q2 · visual-fidelity weight 0.15 → 0.20
2024-Q4 · added moderation-consistency as separate sub-score