Judge Design — LLM-as-Judge Prompt Security Evaluation¶

Version	v0.1 Draft
Date	2026-04-23
Purpose	Document how Judges are designed, how they integrate with target prompts and test data, and how to verify judge quality.

1. What is a Judge¶

A Judge is an LLM prompt that evaluates another LLM prompt for security risks. Each judge focuses on one security topic — a single, well-defined category of prompt vulnerability.

Judge	Security Topic	Status
J1	Over-Permissive Authorization	Implemented
J2	(reserved)	Planned
J3	(reserved)	Planned

A judge is itself a versioned prompt, stored in the same structure as any other prompt in this project:

data/judges/prompts/j1-over-permissive/
  versions/
    01/
      system-prompt.jinja    # Judge's evaluation criteria and output schema
      user-prompt.jinja      # Template that injects the target prompt text
      metadata.toml

2. The Three Inputs of an Evaluation¶

Every evaluation run is defined by exactly three inputs:

1 Judge  +  1 Target Prompt  +  Optional User Data  →  Evaluation Result

Input	What it is	Example
Judge	Which security topic to evaluate, and which judge version	J1 v01
Target Prompt	The production prompt being evaluated, and which version	UC1-P1 v01
User Data (optional)	A concrete sample of runtime data loaded via a data loader	`a-01-injection-in-narrative`

Why user data is optional:

In practice, prompts are often evaluated as soon as they are written — before any real user data exists. The judge can assess the system prompt’s structure and guardrails on its own. When user data is available, the user prompt template is rendered with real data, giving the judge a more concrete picture of what the LLM will actually see in production.

Mode	What the judge sees	When to use
System prompt only	System prompt text; user prompt section says “No user prompt template provided”	Early-stage prompt review, before test data exists
With user data	System prompt text + rendered user prompt with real FNOL narrative	Full evaluation with concrete data exposure

3. Two-Layer Function Architecture¶

The evaluation code is split into two layers:

Binding Layer                        Framework Layer
run_j1_on_uc1_p1()         →        run_j1_over_permissive()
(prompt_risk.uc.uc1.j1_uc1_p1)      (prompt_risk.judges.j1_over_permissive)
        |                                    |
        | J1UserPromptData                   |
        | (raw strings)                      v
        +-----------------------------→  Bedrock Converse API
                                             |
                                             v
                                         J1Result

Layer	Function	Module	Responsibility
Framework	`run_j1_over_permissive()`	`prompt_risk.judges.j1_over_permissive`	Generic J1 evaluation logic. Accepts `J1UserPromptData` (raw strings), loads the judge prompt, calls the LLM, parses and validates the output into `J1Result`. Knows nothing about any specific use case.
Binding	`run_j1_on_uc1_p1()`	`prompt_risk.uc.uc1.j1_uc1_p1`	UC1-P1-specific wrapper. Loads the target prompt versions, optionally uses a data loader to render the user prompt with real FNOL data, assembles `J1UserPromptData`, and calls the framework layer.

Why two layers:

The framework layer is reusable — UC2, UC3, or any future use case can write its own binding layer and call the same run_j1_over_permissive().
The binding layer knows the specifics: which PromptIdEnum to use, what the data loader looks like, how to render the user prompt template. Each use case writes one binding function per judge.

4. Data Flow¶

A concrete example: running J1 judge (v01) on UC1-P1 prompt (v01) with test data a-01-injection-in-narrative.

Example Script
    |
    v
run_j1_on_uc1_p1(client, prompt_version="01", loader=a_01, judge_version="01")
    |
    |  1. Load UC1-P1 v01 target prompt
    |     → system-prompt.jinja, user-prompt.jinja
    |
    |  2. Load test data via P1LoaderEnum
    |     → P1ExtractionUserPromptData (source, narrative)
    |
    |  3. Render user-prompt.jinja with loader data
    |
    |  4. Assemble J1UserPromptData
    |     (target_system_prompt, target_user_prompt_template)
    |
    v
run_j1_over_permissive(client, data=J1UserPromptData, judge_version="01")
    |
    |  5. Load J1 judge v01
    |     → judge system-prompt.jinja, judge user-prompt.jinja
    |
    |  6. Render judge user-prompt.jinja with target prompt texts
    |
    |  7. Call Bedrock Converse API
    |     system = judge_system_prompt (cached)
    |     messages = [judge_user_prompt]
    |
    |  8. Parse & validate JSON response → J1Result
    |
    v
J1Result (overall_risk, score, findings[])

Step-by-step:

The binding layer loads the target prompt (UC1-P1 v01) — both system and user Jinja templates.
The binding layer uses the data loader to get FNOL test data, then renders the user prompt template with it. If no loader is provided, target_user_prompt_template is set to None.
The binding layer assembles J1UserPromptData (two plain strings) and passes it to the framework layer.
The framework layer loads the judge prompt (J1 v01) — its own system and user Jinja templates.
The framework layer renders the judge’s user prompt, injecting the target prompt texts into the template.
The framework layer calls the Bedrock Converse API with the judge’s system prompt (cached) and the rendered user message.
The LLM response is parsed and validated into a J1Result (overall risk, score, per-criterion findings).

5. Judge Quality Assurance¶

A judge is itself a prompt — and prompts can be unreliable. The following strategies verify that a judge produces trustworthy evaluations.

5.1 Known-Answer Testing¶

Prepare target prompts with known security postures and verify the judge scores them correctly:

Target Prompt	Expected Score	Purpose
Well-guarded prompt with explicit refusal, scope boundaries, anti-injection	4–5 (low/pass)	Verify judge does not over-flag
Fully over-permissive prompt (“always help, never refuse”)	1 (critical)	Verify judge catches obvious issues
Subtle prompt with mixed signals (helpful language + weak boundaries)	2–3 (high/medium)	Verify judge handles nuance

These are the UC1-P1 prompt versions (v01 through v04) — each designed to represent a different security posture.

5.2 Cross-Version Comparison¶

When iterating on a judge prompt (J1 v01 → v02), run both versions against the same set of target prompts and compare:

Do the scores agree on the clear-cut cases (very good / very bad)?
Does the new version improve on the ambiguous cases without regressing on the clear ones?

5.3 Cross-Model Comparison¶

Run the same judge prompt on different LLMs (e.g., Claude Sonnet vs. Nova Lite) and check:

Are the overall risk levels consistent?
Do per-criterion findings align, even if wording differs?

Significant divergence suggests the judge prompt is under-specified — the evaluation criteria need to be made more explicit so different models converge on the same conclusions.

6. Adding a New Judge¶

To add a new judge (e.g., J2 for Hardcoded Sensitive Data):

Step 1: Create the judge prompt¶

data/judges/prompts/j2-hardcoded-secrets/
  versions/01/
    system-prompt.jinja    # Evaluation criteria for this security topic
    user-prompt.jinja      # Template to inject target prompt text
    metadata.toml

Step 2: Register in constants¶

Add JUDGE_J2_HARDCODED_SECRETS to PromptIdEnum in prompt_risk/constants.py.

Step 3: Create the framework function¶

Create prompt_risk/judges/j2_hardcoded_secrets.py with:

J2UserPromptData — input model (always includes target_system_prompt, optional target_user_prompt_template)
J2Result — output model (structured findings specific to this security topic)
run_j2_hardcoded_secrets() — the framework-layer function

Step 4: Create binding functions¶

For each use case that should be evaluated, create a binding function:

prompt_risk/uc/uc1/j2_uc1_p1.py → run_j2_on_uc1_p1()
prompt_risk/uc/uc1/j2_uc1_p2.py → run_j2_on_uc1_p2()
etc.

Each binding function follows the same pattern: load target prompt, optionally render with data loader, call framework function.

Document maintained as part of the prompt_risk project — Last updated: 2026-04-23