External validation and comparisons

Quality and validation

Not loud percentages, but verifiable material: an external benchmark on state-exam questions, open comparisons with other AI, and a clear pilot procedure so a partner can verify on their own data.

Discuss a quality check Security and data

The final clinical decision is always made by a doctor. Wizey works as an assistant: it speeds up review and reduces routine, but does not replace a specialist.

External validation

92% on a General Medicine state-exam benchmark

A blind test on 675 questions from an official state final-certification collection — with no training on this dataset and a strict Exact Match metric. This is not a promise of clinical infallibility: it is a check of the model's ability to work consistently with medical wording.

Benchmark result

92%

620 correct answers
out of 675 exam tasks

A blind test on a state curriculum

We tested material from the final certification in the specialty General Medicine (31.05.01). The algorithm did not see this dataset during training, and partially correct answers were counted as wrong.

Metric

Exact Match: an answer is correct only on a full match with the reference

Mode

Blind testing, with no fine-tuning on the test set

Volume

675 tasks across 3 specialty blocks

Scoring

620 of 675 correct — strict, no partial credit

Sample structure

Breakdown of the 675 exam tasks by block. Per-specialty accuracy is reviewed during the pilot on your scenario.

Therapycardio, gastro, endo, pulmo, nephro, rheuma, hema

370

Fundamentalanatomy, biology, biochemistry

210

Surgeryhospital surgery

The scale is proportional to each block's share of the sample (675 tasks total).

The benchmark confirms the model works confidently with medical terminology and logic. But for a B2B rollout this is only a starting point — final suitability is best verified on your scenario and an agreed set of examples.

Comparisons

Wizey vs general-purpose AI

Honest breakdowns on real clinical panels: where general models hallucinate, confuse units of measurement, or give potentially unsafe advice — and where Wizey differs through specialization, expert review, and medical sources. Each comparison is evergreen and built for independent verification.

Wizey vs ChatGPTMedical AI vs a general chatbot

A general LLM versus a purpose-built medical assistant: where ChatGPT drifts into generic phrasing, where it invents reference ranges, and which lab-analysis tasks belong only in a specialized service.

accuracyprivacyOCR

Read the comparison

Wizey vs ClaudeConstitutional AI and medicine

Claude hallucinates less and more readily declines medical questions. Is that enough for interpreting lab results? Its strengths and clear limits, side by side with a specialized tool.

refusalsanswer safetyreasoning

Read the comparison

Wizey vs GeminiMultimodality and medical documents

Gemini can process photos and PDFs. We look at whether multimodality helps when interpreting lab results, and where specialized OCR plus medical context beats a general multimodal model.

multimodalityPDF / photoOCR

Read the comparison

Want the same breakdown for your task? Send 3–5 anonymized cases — during the pilot we'll compare Wizey with the model you use today. See use cases, integration, and the data perimeter.

Method

How we control quality

Four recurring practices the product relies on. This is not marketing but a working process: sources, review, audit, and learning from mistakes.

Sources of truth

Recognized clinical guidelines, current nosological protocols, up-to-date reference ranges for specific labs, and topical handbooks. We do not substitute guidelines with the model's generic answers.

More on the approach — in the B2B use cases.

Expert review

The medical correctness of wording is checked by Wizey's Chief Medical Officer (Internal Medicine) and medical team. Experts review phrasing, disputed interpretations, and edge cases, and correct answer templates.

For B2B, a separate template audit under your brand and clinic protocols.

Regular audit

Continuous checking: a sampled audit of answers on typical cases, and a regular review of templates when new guidelines appear or partner-lab reference ranges change.

For B2B — a dedicated template audit for your protocols.

Learning from errors

A spotted error is logged, reviewed by an expert, and turned into a template fix or a regression case. This keeps the model from repeating the same mistake on new analyses.

In a B2B pilot, the partner sees the correction workflow in their dashboard.

Pilot

How validation on your data works

A pilot is meant to assess not an abstract model but concrete value inside a clinic, laboratory, or digital product. Three predictable steps and a clear result.

Data

An agreed set of examples

We check performance on text medical data resembling the partner's real flow: lab results, conclusions, discharge summaries. Data transfer goes through a secure perimeter.

Step 1 · Input

Assessment

Quality control by the partner

The partner's team checks the clarity, completeness, structure, and fitness of the result for the chosen role — lab tech, doctor, patient. You can use your own evaluation criteria.

Step 2 · Review

Decision

A plan for refinements and integration

After the pilot we record what to change in answer format, roles, and templates. Next — integration via API and terms under your plan.

Step 3 · Next

Pilot metrics

What we measure during the pilot

Metrics differ across B2B scenarios. Labs look at add-on uptake and the quality of explanations to patients; clinics at how fast summaries are prepared and how much routine the doctor sheds; digital products at conversion into use and repeat visits. We agree on metrics before launch, so each side knows how to read the result.

result clarity completeness of analysis processing speed expert feedback conversion and retention readiness to scale

Want to verify quality on your scenario?

Describe the process, the document types, and the user role. We'll propose a pilot-validation format, a set of materials, and evaluation criteria for your team.

Discuss a quality check