Methodology

Calculator Scoring Accuracy

This page documents the deterministic scoring methodology and regression test harness for the 236 clinical calculators built into Scribeable. It is not a peer-reviewed study; we publish methodology so anyone can verify the math.

What this is

A pre-registered test harness that runs in CI on every commit. The 854-fixture regression suite covers all 236 calculators and their activation triggers. Every scoring function is deterministic — same inputs, same output, every time. No LLM is involved in the scoring step; the LLM only decides when a calculator should activate, and that decision is also tested.

Calculator inventory

Point-sum rules

Examples: qSOFA, PHQ-9, GAD-7, HEART, Wells, PERC, CIWA-Ar, NIHSS, GCS, CURB-65, RCRI

Test method: Byte-for-byte match against published guideline

135+

calculators

Lab / formula

Examples: CKD-EPI eGFR, FIB-4, FENA, MELD-Na, ASCVD 10-Year Risk, Anion Gap

Test method: Numerical tolerance ≤0.1% vs reference implementation

60+

calculators

Decision rules

Examples: Alvarado, ABCD2, Duke Criteria, Child-Pugh, Caprini, CHA2DS2-VASc, HAS-BLED

Test method: Full category partition coverage in golden fixtures

40+

calculators

The test harness

854

Regression fixtures in CI

236

Scored calculators

100%

Point-sum calculators: exact match required

  • Deterministic scoring. Scoring functions are pure: same inputs, same outputs, zero randomness, no model invocation.
  • Golden test cases. Each fixture specifies the expected score and the provenance of that expectation (guideline PDF page, reference implementation commit).
  • Mismatch auto-correction loop. When a regression fires, the diff is surfaced to maintainers with the original source citation. Nothing ships until the diff is resolved.
  • Activation engine tested separately. A second harness covers the LLM-driven decision of when to activate each calculator — tested for both false positives and false negatives against a hand-labeled fixture set.

Pre-registered analysis plan

  • Point-sum calculators: 100% exact match against the published guideline — no numerical tolerance allowed
  • Lab-based and formula calculators: ≤0.1% relative error against the reference implementation
  • Activation engine: no false positives on non-triggering fixtures; no false negatives on triggering fixtures
  • Release gate: any regression fails CI and blocks the release

Published fixture-set hash

This SHA-256 hash covers the published Tier 1 reference fixture bundle used in the methodology brief for independent spot-checking. The broader 854-case release-gate suite remains internal, but we share it under NDA for formal review protocols.

sha256:ba72aaa63dfd4987b79b66f7be6f38d959810a328977a246348284aa8d3df84f

Last updated: 2026-04-12

What we don’t claim

  • We do not claim this is a peer-reviewed study.
  • We do not claim patient-level outcomes or cost savings from calculator accuracy.
  • We do not claim a comparison against other AI scribes' calculator engines (they do not have one).
  • We do not claim regulatory clearance. These are clinical decision-support tools, not diagnostic instruments.

For independent verification

The public methodology brief includes the reference-fixture hash and source-path notes. For the full internal regression suite and activation fixtures, email [email protected] with your review protocol and we’ll share the broader harness under a mutual NDA.