External validation and comparisons

Quality and validation

Not loud percentages, but verifiable material: an external benchmark on state-exam questions, open comparisons with other AI, and a clear pilot procedure so a partner can verify on their own data.

The final clinical decision is always made by a doctor. Wizey works as an assistant: it speeds up review and reduces routine, but does not replace a specialist.

External validation

92% on a General Medicine state-exam benchmark

A blind test on 675 questions from an official state final-certification collection β€” with no training on this dataset and a strict Exact Match metric. This is not a promise of clinical infallibility: it is a check of the model's ability to work consistently with medical wording.

Benchmark result
92%
620 correct answers
out of 675 exam tasks

A blind test on a state curriculum

We tested material from the final certification in the specialty General Medicine (31.05.01). The algorithm did not see this dataset during training, and partially correct answers were counted as wrong.

Metric
Exact Match: an answer is correct only on a full match with the reference
Mode
Blind testing, with no fine-tuning on the test set
Volume
675 tasks across 3 specialty blocks
Scoring
620 of 675 correct β€” strict, no partial credit

Sample structure

Breakdown of the 675 exam tasks by block. Per-specialty accuracy is reviewed during the pilot on your scenario.

Therapycardio, gastro, endo, pulmo, nephro, rheuma, hema
370
Fundamentalanatomy, biology, biochemistry
210
Surgeryhospital surgery
95
The scale is proportional to each block's share of the sample (675 tasks total).

The benchmark confirms the model works confidently with medical terminology and logic. But for a B2B rollout this is only a starting point β€” final suitability is best verified on your scenario and an agreed set of examples.

Method

How we control quality

Four recurring practices the product relies on. This is not marketing but a working process: sources, review, audit, and learning from mistakes.

1

Sources of truth

Recognized clinical guidelines, current nosological protocols, up-to-date reference ranges for specific labs, and topical handbooks. We do not substitute guidelines with the model's generic answers.

More on the approach β€” in the B2B use cases.
2

Expert review

The medical correctness of wording is checked by Wizey's Chief Medical Officer (Internal Medicine) and medical team. Experts review phrasing, disputed interpretations, and edge cases, and correct answer templates.

For B2B, a separate template audit under your brand and clinic protocols.
3

Regular audit

Continuous checking: a sampled audit of answers on typical cases, and a regular review of templates when new guidelines appear or partner-lab reference ranges change.

For B2B β€” a dedicated template audit for your protocols.
4

Learning from errors

A spotted error is logged, reviewed by an expert, and turned into a template fix or a regression case. This keeps the model from repeating the same mistake on new analyses.

In a B2B pilot, the partner sees the correction workflow in their dashboard.
Pilot

How validation on your data works

A pilot is meant to assess not an abstract model but concrete value inside a clinic, laboratory, or digital product. Three predictable steps and a clear result.

Data

An agreed set of examples

We check performance on text medical data resembling the partner's real flow: lab results, conclusions, discharge summaries. Data transfer goes through a secure perimeter.

Step 1 Β· Input
Assessment

Quality control by the partner

The partner's team checks the clarity, completeness, structure, and fitness of the result for the chosen role β€” lab tech, doctor, patient. You can use your own evaluation criteria.

Step 2 Β· Review
Decision

A plan for refinements and integration

After the pilot we record what to change in answer format, roles, and templates. Next β€” integration via API and terms under your plan.

Step 3 Β· Next
Pilot metrics

What we measure during the pilot

Metrics differ across B2B scenarios. Labs look at add-on uptake and the quality of explanations to patients; clinics at how fast summaries are prepared and how much routine the doctor sheds; digital products at conversion into use and repeat visits. We agree on metrics before launch, so each side knows how to read the result.

result clarity completeness of analysis processing speed expert feedback conversion and retention readiness to scale

Want to verify quality on your scenario?

Describe the process, the document types, and the user role. We'll propose a pilot-validation format, a set of materials, and evaluation criteria for your team.

Discuss a quality check