Methodology

Calculator Scoring Accuracy

This page documents the deterministic scoring methodology and regression test harness for the 236 clinical calculators built into Scribeable. It is not a peer-reviewed study; we publish methodology so anyone can verify the math.

What this is

A pre-registered test harness that runs in CI on every commit. The 854-fixture regression suite covers all 236 calculators and their activation triggers. Every scoring function is deterministic — same inputs, same output, every time. No LLM is involved in the scoring step; the LLM only decides when a calculator should activate, and that decision is also tested.

Calculator inventory

Point-sum rules

Examples: qSOFA, PHQ-9, GAD-7, HEART, Wells, PERC, CIWA-Ar, NIHSS, GCS, CURB-65, RCRI

Test method: Byte-for-byte match against published guideline

135+

calculators

Lab / formula

Examples: CKD-EPI eGFR, FIB-4, FENA, MELD-Na, ASCVD 10-Year Risk, Anion Gap

Test method: Numerical tolerance ≤0.1% vs reference implementation

60+

calculators

Decision rules

Examples: Alvarado, ABCD2, Duke Criteria, Child-Pugh, Caprini, CHA2DS2-VASc, HAS-BLED

Test method: Full category partition coverage in golden fixtures

40+

calculators

The test harness

854

Regression fixtures in CI

236

Scored calculators

100%

Point-sum calculators: exact match required

•Deterministic scoring. Scoring functions are pure: same inputs, same outputs, zero randomness, no model invocation.
•Golden test cases. Each fixture specifies the expected score and the provenance of that expectation (guideline PDF page, reference implementation commit).
•Mismatch auto-correction loop. When a regression fires, the diff is surfaced to maintainers with the original source citation. Nothing ships until the diff is resolved.
•Activation engine tested separately. A second harness covers the LLM-driven decision of when to activate each calculator — tested for both false positives and false negatives against a hand-labeled fixture set.

Pre-registered analysis plan

Point-sum calculators: 100% exact match against the published guideline — no numerical tolerance allowed
Lab-based and formula calculators: ≤0.1% relative error against the reference implementation
Activation engine: no false positives on non-triggering fixtures; no false negatives on triggering fixtures
Release gate: any regression fails CI and blocks the release

Published fixture-set hash

This SHA-256 hash covers the published Tier 1 reference fixture bundle used in the methodology brief for independent spot-checking. The broader 854-case release-gate suite remains internal, but we share it under NDA for formal review protocols.

sha256:ba72aaa63dfd4987b79b66f7be6f38d959810a328977a246348284aa8d3df84f

Last updated: 2026-04-12

What we don’t claim

We do not claim this is a peer-reviewed study.
We do not claim patient-level outcomes or cost savings from calculator accuracy.
We do not claim a comparison against other AI scribes' calculator engines (they do not have one).
We do not claim regulatory clearance. These are clinical decision-support tools, not diagnostic instruments.

For independent verification

The public methodology brief includes the reference-fixture hash and source-path notes. For the full internal regression suite and activation fixtures, email [email protected] with your review protocol and we’ll share the broader harness under a mutual NDA.

What this is

Calculator inventory

Point-sum rules

Examples: qSOFA, PHQ-9, GAD-7, HEART, Wells, PERC, CIWA-Ar, NIHSS, GCS, CURB-65, RCRI

Test method: Byte-for-byte match against published guideline

135+

calculators

Lab / formula

Examples: CKD-EPI eGFR, FIB-4, FENA, MELD-Na, ASCVD 10-Year Risk, Anion Gap

Test method: Numerical tolerance ≤0.1% vs reference implementation

60+

calculators

Decision rules

Examples: Alvarado, ABCD2, Duke Criteria, Child-Pugh, Caprini, CHA2DS2-VASc, HAS-BLED

Test method: Full category partition coverage in golden fixtures

40+

calculators

The test harness

854

Regression fixtures in CI

236

Scored calculators

100%

Point-sum calculators: exact match required

•Deterministic scoring. Scoring functions are pure: same inputs, same outputs, zero randomness, no model invocation.
•Golden test cases. Each fixture specifies the expected score and the provenance of that expectation (guideline PDF page, reference implementation commit).
•Mismatch auto-correction loop. When a regression fires, the diff is surfaced to maintainers with the original source citation. Nothing ships until the diff is resolved.
•Activation engine tested separately. A second harness covers the LLM-driven decision of when to activate each calculator — tested for both false positives and false negatives against a hand-labeled fixture set.

Pre-registered analysis plan

Point-sum calculators: 100% exact match against the published guideline — no numerical tolerance allowed
Lab-based and formula calculators: ≤0.1% relative error against the reference implementation
Activation engine: no false positives on non-triggering fixtures; no false negatives on triggering fixtures
Release gate: any regression fails CI and blocks the release

Published fixture-set hash

sha256:ba72aaa63dfd4987b79b66f7be6f38d959810a328977a246348284aa8d3df84f

Last updated: 2026-04-12

What we don’t claim

We do not claim this is a peer-reviewed study.
We do not claim patient-level outcomes or cost savings from calculator accuracy.
We do not claim a comparison against other AI scribes' calculator engines (they do not have one).
We do not claim regulatory clearance. These are clinical decision-support tools, not diagnostic instruments.

For independent verification