Your EU AI Act Risk Assessment Is a Story. Conformity Assessment Needs Math.
Your conformity assessment is due and the question on the table is deceptively simple: is this high-risk AI system safe enough to deploy? You have a risk management file, a stack of test reports, and a narrative that says you mitigated foreseeable harms. Then the auditor asks the one thing your documentation cannot answer: what level of failure probability is acceptable here, and what statistical evidence shows you meet it?
That gap between legal language like "acceptable risk" and engineering-grade verification is where EU AI Act compliance will stall for a lot of otherwise serious teams.
The problem: "acceptable risk" is not an engineering specification
The EU AI Act requires accuracy, robustness, cybersecurity, human oversight, post-market monitoring, and quality management for high-risk systems. But it mostly does not hand you a numeric target like "failure probability must be below 10 to the negative 6 per decision" or "false negative rate must be below 0.1% in operating condition X."
Instead, organizations produce what shows up in almost every early compliance package: a qualitative risk register that says "harm severity: high, likelihood: low," a set of model metrics on a benchmark dataset that is not tied to the real operating domain, and a narrative argument that mitigations exist. Two organizations can ship very different systems with very different real-world failure rates and both claim acceptable residual risk, because the term is not quantitatively pinned down.
A position paper from Nessler, Hochreiter, and Doms at TÜV AUSTRIA and JKU Linz makes this case directly. They argue that the EU AI Act requires extensive documentation but fails to define testable quality requirements for automated decisions, and that the difference between a trustworthy AI system and a non-trustworthy one lies in the precision of the application domain definition and whether the system was statistically tested on that domain. That framing changes what compliance evidence looks like. It stops being a story about intentions and becomes a set of measurable claims with confidence bounds, test design, and clear validity limits.
The two-stage model that makes this workable
The approach separates what should be a policy decision from what should be an engineering task.
Stage one is policy. A regulator, notified body, or competent authority specifies two things: the acceptable failure probability for specific failure modes (for example, the probability of a harmful decision must be below 1 in 10,000 decisions, or false negatives for condition X must be below 0.5%), and the operating domain under which the claim must hold. Operating domain is not just geography or language. It is the distribution of inputs and contexts the system will face: device types, user populations, environmental conditions, workflow constraints, adversarial exposure, and the boundaries of intended use.
This maps closely to how safety engineering works in aviation and medical devices. Aviation does not say "acceptable risk." It defines failure probabilities for specific hazards and specifies operating conditions and maintenance assumptions. Medical devices define intended use and performance claims tied to specific populations.
Stage two is engineering. Once targets and domains are defined, engineers design tests and generate evidence. Define failure modes precisely. Run system-level evaluations that reflect the operating domain. Compute estimates and confidence bounds on failure probability. Document assumptions, sampling methods, and validity limits. The output is not "we believe risk is acceptable." The output is "under operating domain D, with confidence level 95%, the failure probability is below threshold T, based on N samples and test protocol P." That is an artifact an auditor can interrogate, reproduce, and compare across systems.
The math that changes everything
Here is where this stops being abstract and starts disrupting compliance planning.
Suppose your acceptable harmful failure probability is p at or below 0.001, which is 0.1%. You run a test and observe zero harmful failures. How many independent samples do you need to claim, with 95% confidence, that p is at or below 0.001?
A standard result from binomial confidence bounds: with zero observed failures, the upper bound is approximately 3 divided by N at 95% confidence. So you need roughly N equals 3,000 samples to get an upper bound around 0.001.
That single calculation changes planning immediately. You cannot quickly test your way into strong guarantees. If the acceptable failure probability is very low, your evaluation effort must scale accordingly. If testing at that scale is impossible, you need to reduce the claim, narrow the operating domain, or add operational controls that reduce exposure. This is why a quantitative definition of acceptable risk is disruptive: it forces alignment between the claim and the evidence budget.
And the math gets harder for real systems. High-risk AI systems rarely fail in a single way. They fail differently across populations, contexts, and decision types. "Accuracy equals 94%" is almost never a meaningful safety claim. You need failure modes that map to harm. A recruitment screening model: false negatives that systematically exclude qualified candidates in a protected group. A creditworthiness model: false positives that deny credit incorrectly. A medical triage model: false negatives that delay urgent care. A biometric identification system: false matches leading to wrongful identification.
For each failure mode, you need an operational definition. If two reviewers cannot agree on whether an output is a failure, you cannot measure it. That forces you to formalize labels, rubrics, and adjudication procedures, exactly the engineering hygiene that conformity assessments tend to expose.
Test design that auditors can actually use
Three principles separate a defensible test package from a checkbox exercise.
First, the operating domain must be a testable object, not a prose description. Write down input types and ranges, user populations and segmentation, languages and dialects, workflow constraints, environmental conditions, threat model assumptions, and data freshness patterns. Then translate that into a sampling plan with explicit coverage goals. Where do test cases come from? Historical production data, synthetic generation with stated coverage goals, third-party datasets with justified domain match, and targeted corner case suites for rare but high-severity conditions.
Second, use black-box evaluation when model internals do not matter to the claim. For conformity assessment, what matters is system behavior: inputs, outputs, decisions, and impacts. Black-box evaluation works across vendor models you do not control, complex pipelines with retrieval and rules and human-in-the-loop, and agentic workflows where the model is not a single component. You define the system boundary, then test the system as deployed. This matters because high-risk failures often come from integration, not the base model. A perfectly fine classifier can become unsafe when embedded in a workflow with bad thresholds, missing escalation, or overly broad automation.
Third, produce confidence bounds, not point estimates. A conformity assessment should not hinge on "we observed zero failures in our test set." That statement is meaningless without sample size and confidence. With 50 test cases and zero failures, you have not shown the failure probability is below 0.1%. You have shown you did not observe failures in a small sample. Auditors and regulators need a bound: with confidence level alpha, the failure probability is below some number. That bound, tied to a specific operating domain and test protocol, is the core artifact.
Thresholds are part of the system, not a tuning detail
Many high-risk systems are AI-assisted, meaning they output a score and a workflow consumes it. The threshold that triggers an automated action is where risk becomes real.
Quantitative acceptable risk pushes you to verify the whole decision rule: score distribution in the operating domain, threshold selection rationale, tradeoffs between false positives and false negatives by subgroup, and stability of those tradeoffs under drift. Teams often get caught here. They validate the model, but the deployed threshold changed later for business reasons. Under an engineering-grade approach, that threshold change must be governed, tested, and documented as part of the conformity evidence.
Why guarantees decay and what to do about it
Even if you produce strong statistical evidence at time zero, the real world does not stay still. EU AI Act compliance is not a one-time event. High-risk obligations include monitoring, logging, and corrective actions. A quantitative approach makes those obligations sharper by giving you a measurable claim that can be invalidated.
Non-stationary data breaks operating domain assumptions. Seasonality, product changes, demographic shifts, and adversarial adaptation all shift the input distribution away from what you tested. A probabilistic guarantee is only as good as the assumption that future inputs resemble the tested domain. That is not a reason to abandon quantification. It is a reason to pair it with domain shift detection and revalidation triggers.
Model and system updates invalidate prior evidence. If you update the base model, the prompt, the retrieval corpus, the tool set, the threshold policy, or upstream preprocessing, you changed the system under assessment. Your old confidence bound is now evidence for a system that no longer exists. This is where EU AI Act quality management and change control become the enforcement mechanism that keeps quantitative verification meaningful.
Monitoring must be tied to quantified claims. If your claim is "harmful failure probability at or below 0.1% in operating domain D," your monitoring should detect when you leave domain D, when failure indicators rise, when new failure modes appear, and when incident rates exceed thresholds. Quantification turns monitoring into a control loop: detect drift, assess impact on the bound, decide whether to roll back, retrain, narrow scope, or add oversight.
Where probabilistic verification works and where it does not
Probabilistic verification is strongest when the system makes discrete decisions with clear labels and short time horizons. Credit scoring, eligibility determination, triage, fraud detection, recruitment screening, biometric verification under controlled conditions. In these contexts, a failure probability bound is meaningful, auditable, and supports comparability across providers.
The moment you move into systems that generate open-ended text, take tool actions, operate across multiple steps, or adapt plans over time, a single failure probability becomes harder to define. Agent trajectories are not independent and identically distributed. One bad tool call changes state and cascades into later failures. For these systems, you shift from global failure probability to a set of bounded claims: tool call policy compliance rate, rate of unauthorized action attempts, rate of PII leakage under a defined red-team suite, and time-to-detection metrics. You quantify what you can, and you wrap the rest in enforceable operational controls.
Build this into your evidence pipeline
If your organization is working toward EU AI Act readiness, treat quantitative acceptable risk as a build problem, not a policy memo.
For each high-risk AI system, make three things explicit. What failure looks like, defined by failure mode, not aggregate metrics. Where the system is allowed to operate, defined as domain boundaries you can monitor. What evidence you can continuously produce, defined as tests, bounds, logs, and revalidation triggers.
Then connect those to your operational controls: change management that triggers re-evaluation when prompts, models, or thresholds change. Monitoring that detects when the operating domain shifts. Incident response that defines what counts as an unacceptable deviation based on your quantified targets, not just "a bad outcome."
The EU AI Act classification tool can tell you whether your system is high-risk. The question this post addresses is what happens next: turning "acceptable risk" from a narrative into a measurable, monitorable claim that survives a conformity assessment.
We built Aguardic to close the gap between regulatory language and engineering evidence. If you are building a conformity assessment package for a high-risk AI system, start by extracting enforceable requirements from your existing compliance documents and see which claims need statistical backing versus operational controls.

