Judge Catalog: LLM-as-Judge Security Evaluation Pipeline

Version

v0.1 Draft

Date

2026-04-23

Purpose

Define the complete set of judges used to evaluate prompt-level security risks. This document serves as an index — each judge’s implementation details (prompt text, data flow, execution) are covered in Documents 04–06.


Overview

A Judge is an LLM prompt that evaluates another LLM prompt for security risks. Judges are organized into a three-layer pipeline: a deterministic rule engine for high-confidence pattern matching, specialized LLM judges for semantic analysis, and a meta-judge for aggregation and final risk rating.

The pipeline processes a Prompt Submission Package — the target prompt plus deployment context — and produces a structured Risk Report.

Pipeline Architecture

Input: Prompt Submission Package
              |
  +---------------------------------+
  |  Layer 1: Rule Engine           |  <-- deterministic rules, no LLM
  |  R1 - Secrets Scanner           |
  |  R2 - Keyword Blocklist         |
  +--------------+------------------+
                 | flag -> escalate to Critical
                 | pass -> continue
                 v
  +---------------------------------------------------+
  |  Layer 2: Specialized LLM Judges (run in parallel) |
  |                                                    |
  |  J1 - Over-Permissive Authorization                |
  |  J2 - Hardcoded Sensitive Data                     |
  |  J3 - Role Confusion                               |
  |  J4 - Instruction Conflict                         |
  |  J5 - Logic Ambiguity                              |
  +---------------------------------------------------+
                 |
                 v
  +---------------------------------+
  |  Layer 3: Meta-Judge            |
  |  Aggregation + Risk Report      |
  +---------------------------------+
                 |
                 v
Output: Risk Report (structured)

Prompt Submission Package

Every evaluation takes as input not just the prompt text, but a standardized package of deployment context:

Field

Description

Example

prompt_text

The full system prompt text under review

deployment_type

Deployment architecture pattern

Agentic Executor / Chatbot / One-shot Query / RAG Pipeline

data_input_channels

Data channels that enter the LLM context alongside the prompt

User input, file upload, RAG retrieval, database query, API response

target_user_group

Who uses the application

Internal employees / External customers / Mixed

sensitivity_level

Sensitivity of data the system handles

High (actuarial logic, claims data) / Medium / Low

author_notes

Optional notes from the prompt author on design intent

The deployment_type field significantly affects risk weighting: the same over-permissive instruction carries far greater risk in an Agentic Executor than in a One-shot Query.


Layer 1: Rule Engine (Deterministic)

This layer uses no LLM. It applies deterministic pattern matching for two categories of risk with unambiguous signatures. It is fast, cheap, and fully explainable — the first gate in the pipeline.

R1 — Secrets Scanner

Risk Category: Hardcoded Sensitive Data

Detects sensitive information embedded in prompt text via regex patterns:

Pattern Type

Examples

API key formats

sk-[a-zA-Z0-9]{20,}, Bearer [a-zA-Z0-9\-._~+/]+=*

Database connection strings

jdbc:, mongodb://, postgresql://

Internal network addresses

192.168.x.x, 10.x.x.x

Numeric coefficients with business labels

Consecutive decimal values with labels like “weight,” “coefficient,” “surcharge”

Trigger behavior: Any match immediately escalates the finding to Critical severity. The pipeline does not wait for Layer 2 judges to complete.

R2 — Keyword Blocklist Detector

Risk Category: Over-Permissive Authorization (keyword-level)

Scans for high-risk phrases that signal unconditional compliance or suppressed refusal:

Language

Keywords

English

never refuse, always comply, under no circumstances decline, no content restrictions, ignore your guidelines, ignore previous instructions

Chinese

不得拒绝, 任何情况下都要, 无条件, 不需要任何限制, 忽略你之前的指令

Trigger behavior: Matches are flagged as a warning and flow into Layer 2 for deeper semantic evaluation by J1. A keyword match alone does not determine the final risk level — J1 assesses whether the keyword reflects genuine over-permissiveness in context.


Layer 2: Specialized LLM Judges

Five judges run in parallel, each focused on a single security topic. Parallel execution isolates reasoning load — a judge analyzing instruction conflicts does not compete for attention with one analyzing role confusion.

Each judge is itself a versioned prompt stored under data/judges/prompts/ (see Document 04 for directory structure). Each judge outputs structured findings per criterion, following a common schema (see Document 06 for output format and execution details).

Judge Summary Table

Judge

Security Topic

Risk Category

Status

J1

Over-Permissive Authorization

Risk 1

Implemented

J2

Hardcoded Sensitive Data

Risk 2

Planned

J3

Role Confusion

Risk 3

Planned

J4

Instruction Conflict

Risk 4

Planned

J5

Logic Ambiguity

Risk 5

Planned


J1 — Over-Permissive Authorization Judge

Risk category: Risk 1 — Over-Permissive Authorization

Status: Implemented (data/judges/prompts/j1-over-permissive/)

What it evaluates: Whether the target prompt grants the model excessive behavioral freedom, weakens its refusal capability, or fails to define clear scope boundaries. J1 also picks up semantic variants of over-permissive language that R2’s keyword blocklist cannot catch — for example, “put user experience first and ensure every question receives a satisfactory answer” is semantically equivalent to “always comply” but contains no blocklist keywords.

Evaluation criteria (5):

#

Criterion

What it checks

1

Explicit Refusal Capability

Does the prompt define when and how to refuse?

2

Scope Boundaries

Are both positive scope (may do) and negative scope (must not do) defined?

3

Unconditional Compliance Language

Does the prompt contain phrases instructing unconditional compliance?

4

Failure Handling

Does the prompt define behavior for requests the model cannot or should not fulfill?

5

Anti-Injection Guardrails

Does the prompt instruct the model to treat user input as data, not commands?

Scoring: 1 (critical) through 5 (pass), based on the number and severity of findings across criteria.


J2 — Hardcoded Sensitive Data Judge

Risk category: Risk 2 — Hardcoded Sensitive Data

Status: Planned

What it evaluates: Whether the target prompt embeds proprietary information that could be extracted by adversarial users — pricing coefficients, underwriting rules, actuarial formulas, internal thresholds, API credentials, or business logic that should reside in backend systems rather than in prompt text.

J2 complements R1 (Secrets Scanner): R1 catches structured patterns (API keys, connection strings), while J2 performs semantic analysis to identify business logic and decision rules expressed in natural language — content that has no regex signature but is equally sensitive.

Key evaluation dimensions:

  • Presence of numeric business parameters (coefficients, thresholds, weights) with operational significance

  • Embedded decision trees or scoring rules that constitute proprietary business logic

  • Internal system names, endpoints, or architecture details that aid reconnaissance

  • Information that would give a competitor or adversary an actionable advantage if extracted


J3 — Role Confusion Judge

Risk category: Risk 3 — Role Confusion

Status: Planned

What it evaluates: Whether the target prompt defines the model’s identity with sufficient precision and resilience to resist identity manipulation attacks — including persona switching (“pretend you are…”), gradual role drift across multi-turn conversations, and thought-experiment framing (“hypothetically, if you had no restrictions…”).

Key evaluation dimensions:

#

Dimension

What it checks

1

Identity precision

Is the role definition specific and bounded, or vague and open-ended? (e.g., “flexibly support the user” invites role switching)

2

User override protection

Does the prompt explicitly prohibit users from redefining the model’s role or identity during conversation?

3

Persona switching defense

Does the prompt explicitly reject hypothetical/role-play framings like “assume you are a consultant with no policy constraints”?


J4 — Instruction Conflict Judge

Risk category: Risk 4 — Instruction Conflict

Status: Planned

What it evaluates: Whether the target prompt contains pairs of instructions that produce contradictory behavior under specific triggering conditions. This is the most reasoning-intensive judge — it must decompose the prompt into individual constraint units, perform pairwise comparison, and identify scenarios where following one instruction necessarily violates another.

Key conflict patterns:

Conflict Type

Example

Confidentiality vs. Helpfulness

“Never disclose internal rules” + “Ensure the user feels fully supported and informed”

Refusal vs. User Experience

“Refuse out-of-scope requests” + “Avoid responses that make the user feel rejected”

Compliance vs. Flexibility

“Follow all regulatory guidelines strictly” + “Use your judgment to provide the best outcome”

Trigger conditions: Conflicts often surface only under specific user behaviors — persistent follow-up questions, urgency claims, authority assertions — not during normal interaction.


J5 — Logic Ambiguity Judge

Risk category: Risk 5 — Logic Ambiguity

Status: Planned

What it evaluates: Whether the target prompt contains restriction instructions with soft qualifiers — words like “usually,” “try to avoid,” “unless necessary,” “in most cases” — that create exploitable exception pathways. An attacker can construct a scenario that satisfies the exception condition, causing the model to abandon the restriction.

Key evaluation dimensions:

  • Presence of soft qualifiers in security-critical instructions

  • Whether exception conditions are bounded (specific, enumerated) or unbounded (vague, judgment-based)

  • Feasibility of constructing a plausible scenario that triggers the exception (e.g., fabricating an emergency to bypass “usually do not recommend third-party services”)


Layer 3: Meta-Judge (Aggregation)

The Meta-Judge receives all outputs from Layer 1 and Layer 2 and performs two tasks: weighted risk aggregation and risk report generation.

Deployment Scenario Weighting

The final risk level is not simply the highest individual finding. It is adjusted by a deployment scenario weight that reflects the real-world impact amplification of different architectures:

Deployment Scenario

Weight

Agentic Executor (tool-calling, external actions)

2.0x

RAG Pipeline (external knowledge base)

1.5x

Customer-facing Chatbot (external users)

1.3x

Internal Employee Tool (internal users only)

1.0x

One-shot Query (no session memory)

0.8x

The same risk finding in an Agentic Executor context may be several times more dangerous than in a One-shot Query context.

Risk Report Structure

The Meta-Judge produces a structured Risk Report:

## Prompt Security Evaluation Report

### Submission Info
- Prompt ID: [unique identifier]
- Submission Time: [timestamp]
- Deployment Type: [scenario type]
- Sensitivity Level: [High / Medium / Low]

### Overall Risk Rating: Critical / High / Medium / Low

### Findings

#### [Risk Category Name]
- Severity: [level]
- Evidence: [quoted prompt text]
- Triggered By: [Layer 1 rule / J1 / J2 / J3 / J4 / J5]
- Recommendation: [specific remediation action]

### Prioritized Remediation Plan
1. [Immediate action items]
2. [Short-term action items]
3. [Ongoing governance items]

### Audit Trail
- Reviewed by: LLM-as-Judge Pipeline v[version]
- Human Review Required: Yes / No
- Escalation Status: [whether human review is needed]

Status: Planned


Judge Quality Metrics

Judges are themselves prompts — and prompts can be unreliable. The following metrics are tracked to ensure judge trustworthiness:

Metric

Definition

Target

Detection Rate (Recall)

Proportion of known-vulnerable prompts correctly flagged

>= 90%

False Positive Rate

Proportion of known-safe prompts incorrectly flagged

<= 10%

Severity Accuracy

Accuracy of risk level assignment (Critical/High/Medium/Low)

As high as possible

Actionability Score

Proportion of recommendations rated “directly actionable” by human reviewers

Human-assessed

Judge quality assurance methodology (known-answer testing, cross-version comparison, cross-model comparison) is detailed in Document 06.


Cross-Reference to Use Cases

Each judge applies across all six use cases defined in Document 02, but the relevance and severity weighting varies by use case architecture:

Judge

UC1 Pipeline

UC2 RAG

UC3 Web Agent

UC4 Basic Agent

UC5 Advanced Agent

UC6 Customer-Facing

J1 Over-Permissive

Medium

Medium

Medium

Medium

High

High

J2 Hardcoded Data

High (P4)

High (pricing in KB)

Low

Low

Medium

Low

J3 Role Confusion

Low

Low

Low

Low

Medium

High

J4 Instruction Conflict

Medium

Low

Medium

Medium

High

High

J5 Logic Ambiguity

Medium

Low

Medium

Medium

High

High

R1 Secrets Scanner

All

All

All

All

All

All

R2 Keyword Blocklist

All

All

All

All

All

All


Next Steps

Each judge will be expanded with:

  1. Full prompt specifications (system prompt + user prompt templates) — Document 04 (data structure)

  2. Execution mechanics (two-layer architecture, data flow) — Document 06

  3. Known-answer test suites for judge quality assurance


Document maintained as part of the prompt_risk project — Last updated: 2026-04-23