Calculator Scoring Accuracy
This page documents the deterministic scoring methodology and regression test harness for the 236 clinical calculators built into Scribeable. It is not a peer-reviewed study; we publish methodology so anyone can verify the math.
What this is
A pre-registered test harness that runs in CI on every commit. The 854-fixture regression suite covers all 236 calculators and their activation triggers. Every scoring function is deterministic — same inputs, same output, every time. No LLM is involved in the scoring step; the LLM only decides when a calculator should activate, and that decision is also tested.
Calculator inventory
Point-sum rules
Examples: qSOFA, PHQ-9, GAD-7, HEART, Wells, PERC, CIWA-Ar, NIHSS, GCS, CURB-65, RCRI
Test method: Byte-for-byte match against published guideline
calculators
Lab / formula
Examples: CKD-EPI eGFR, FIB-4, FENA, MELD-Na, ASCVD 10-Year Risk, Anion Gap
Test method: Numerical tolerance ≤0.1% vs reference implementation
calculators
Decision rules
Examples: Alvarado, ABCD2, Duke Criteria, Child-Pugh, Caprini, CHA2DS2-VASc, HAS-BLED
Test method: Full category partition coverage in golden fixtures
calculators
The test harness
Regression fixtures in CI
Scored calculators
Point-sum calculators: exact match required
- •Deterministic scoring. Scoring functions are pure: same inputs, same outputs, zero randomness, no model invocation.
- •Golden test cases. Each fixture specifies the expected score and the provenance of that expectation (guideline PDF page, reference implementation commit).
- •Mismatch auto-correction loop. When a regression fires, the diff is surfaced to maintainers with the original source citation. Nothing ships until the diff is resolved.
- •Activation engine tested separately. A second harness covers the LLM-driven decision of when to activate each calculator — tested for both false positives and false negatives against a hand-labeled fixture set.
Pre-registered analysis plan
- Point-sum calculators: 100% exact match against the published guideline — no numerical tolerance allowed
- Lab-based and formula calculators: ≤0.1% relative error against the reference implementation
- Activation engine: no false positives on non-triggering fixtures; no false negatives on triggering fixtures
- Release gate: any regression fails CI and blocks the release
Published fixture-set hash
This SHA-256 hash covers the published Tier 1 reference fixture bundle used in the methodology brief for independent spot-checking. The broader 854-case release-gate suite remains internal, but we share it under NDA for formal review protocols.
Last updated: 2026-04-12
What we don’t claim
- We do not claim this is a peer-reviewed study.
- We do not claim patient-level outcomes or cost savings from calculator accuracy.
- We do not claim a comparison against other AI scribes' calculator engines (they do not have one).
- We do not claim regulatory clearance. These are clinical decision-support tools, not diagnostic instruments.
For independent verification
The public methodology brief includes the reference-fixture hash and source-path notes. For the full internal regression suite and activation fixtures, email [email protected] with your review protocol and we’ll share the broader harness under a mutual NDA.