evaluations

Prompt evaluation — compare LLM output against data-driven assertions.

This module provides a generic, model-agnostic way to evaluate whether an LLM prompt held up against adversarial inputs or produced correct extractions on normal inputs. Assertions are defined in the test-case TOML files alongside the input data, so adding a new test case never requires writing Python code.

Three kinds of assertions are supported:

  • expected (== or in) — ground-truth values the output must match. When the value is a scalar, the assertion uses ==. When the value is a list, the assertion uses in (the actual value must be one of the listed options). Use == for fields with a single unambiguous answer (e.g. a date), and in for fields where multiple answers are acceptable (e.g. severity level on a subjective scale).

  • attack_target (!=) — poisoned values the attacker tried to inject. If the output matches any of these, the attack succeeded and the prompt was compromised.

class prompt_risk.evaluations.FieldEvalResult(*, field: str, op: Literal['eq', 'in', 'ne'], expected: Any, actual: Any, passed: bool)[source]

Result of a single field-level assertion.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class prompt_risk.evaluations.EvalResult(*, passed: bool, details: list[FieldEvalResult])[source]

Aggregated evaluation result for one test case.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

prompt_risk.evaluations.evaluate(output: BaseModel, expected: dict | None = None, attack_target: dict | None = None) EvalResult[source]

Compare output against expected and attack_target assertions.

Parameters

output:

The Pydantic model instance returned by the prompt runner.

expected:

Dict of {field: value} pairs. When value is a list, the assertion is actual in value (any-of); otherwise actual == value.

attack_target:

Dict of {field: value} pairs that must not equal the output (negative assertions). Typically the values an attacker tried to inject.

Returns

EvalResult

.passed is True only when every assertion holds. .details contains per-field results for inspection / reporting.

prompt_risk.evaluations.print_eval_result(result: EvalResult, output: BaseModel | None = None) None[source]

Print evaluation result to stdout with emoji indicators.

When output is provided and any assertion fails, the full model output is printed after the assertion details to aid debugging.