Judge Catalog: LLM-as-Judge Security Evaluation Pipeline¶

Version	v0.1 Draft
Date	2026-04-23
Purpose	Define the complete set of judges used to evaluate prompt-level security risks. This document serves as an index — each judge’s implementation details (prompt text, data flow, execution) are covered in Documents 04–06.

Overview¶

A Judge is an LLM prompt that evaluates another LLM prompt for security risks. Judges are organized into a three-layer pipeline: a deterministic rule engine for high-confidence pattern matching, specialized LLM judges for semantic analysis, and a meta-judge for aggregation and final risk rating.

The pipeline processes a Prompt Submission Package — the target prompt plus deployment context — and produces a structured Risk Report.

Pipeline Architecture¶

Input: Prompt Submission Package
              |
  +---------------------------------+
  |  Layer 1: Rule Engine           |  <-- deterministic rules, no LLM
  |  R1 - Secrets Scanner           |
  |  R2 - Keyword Blocklist         |
  +--------------+------------------+
                 | flag -> escalate to Critical
                 | pass -> continue
                 v
  +---------------------------------------------------+
  |  Layer 2: Specialized LLM Judges (run in parallel) |
  |                                                    |
  |  J1 - Over-Permissive Authorization                |
  |  J2 - Hardcoded Sensitive Data                     |
  |  J3 - Role Confusion                               |
  |  J4 - Instruction Conflict                         |
  |  J5 - Logic Ambiguity                              |
  +---------------------------------------------------+
                 |
                 v
  +---------------------------------+
  |  Layer 3: Meta-Judge            |
  |  Aggregation + Risk Report      |
  +---------------------------------+
                 |
                 v
Output: Risk Report (structured)

Prompt Submission Package¶

Every evaluation takes as input not just the prompt text, but a standardized package of deployment context:

Field	Description	Example
`prompt_text`	The full system prompt text under review	—
`deployment_type`	Deployment architecture pattern	`Agentic Executor` / `Chatbot` / `One-shot Query` / `RAG Pipeline`
`data_input_channels`	Data channels that enter the LLM context alongside the prompt	User input, file upload, RAG retrieval, database query, API response
`target_user_group`	Who uses the application	Internal employees / External customers / Mixed
`sensitivity_level`	Sensitivity of data the system handles	`High` (actuarial logic, claims data) / `Medium` / `Low`
`author_notes`	Optional notes from the prompt author on design intent	—

The deployment_type field significantly affects risk weighting: the same over-permissive instruction carries far greater risk in an Agentic Executor than in a One-shot Query.

Layer 1: Rule Engine (Deterministic)¶

This layer uses no LLM. It applies deterministic pattern matching for two categories of risk with unambiguous signatures. It is fast, cheap, and fully explainable — the first gate in the pipeline.

R1 — Secrets Scanner¶

Risk Category: Hardcoded Sensitive Data

Detects sensitive information embedded in prompt text via regex patterns:

Pattern Type	Examples
API key formats	`sk-[a-zA-Z0-9]{20,}`, `Bearer [a-zA-Z0-9\-._~+/]+=*`
Database connection strings	`jdbc:`, `mongodb://`, `postgresql://`
Internal network addresses	`192.168.x.x`, `10.x.x.x`
Numeric coefficients with business labels	Consecutive decimal values with labels like “weight,” “coefficient,” “surcharge”

Trigger behavior: Any match immediately escalates the finding to Critical severity. The pipeline does not wait for Layer 2 judges to complete.

R2 — Keyword Blocklist Detector¶

Risk Category: Over-Permissive Authorization (keyword-level)

Scans for high-risk phrases that signal unconditional compliance or suppressed refusal:

Language	Keywords
English	`never refuse`, `always comply`, `under no circumstances decline`, `no content restrictions`, `ignore your guidelines`, `ignore previous instructions`
Chinese	`不得拒绝`, `任何情况下都要`, `无条件`, `不需要任何限制`, `忽略你之前的指令`

Trigger behavior: Matches are flagged as a warning and flow into Layer 2 for deeper semantic evaluation by J1. A keyword match alone does not determine the final risk level — J1 assesses whether the keyword reflects genuine over-permissiveness in context.

Layer 2: Specialized LLM Judges¶

Five judges run in parallel, each focused on a single security topic. Parallel execution isolates reasoning load — a judge analyzing instruction conflicts does not compete for attention with one analyzing role confusion.

Each judge is itself a versioned prompt stored under data/judges/prompts/ (see Document 04 for directory structure). Each judge outputs structured findings per criterion, following a common schema (see Document 06 for output format and execution details).

Judge Summary Table¶

Judge	Security Topic	Risk Category	Status
J1	Over-Permissive Authorization	Risk 1	Implemented
J2	Hardcoded Sensitive Data	Risk 2	Planned
J3	Role Confusion	Risk 3	Planned
J4	Instruction Conflict	Risk 4	Planned
J5	Logic Ambiguity	Risk 5	Planned

J1 — Over-Permissive Authorization Judge¶

Risk category: Risk 1 — Over-Permissive Authorization

Status: Implemented (data/judges/prompts/j1-over-permissive/)

What it evaluates: Whether the target prompt grants the model excessive behavioral freedom, weakens its refusal capability, or fails to define clear scope boundaries. J1 also picks up semantic variants of over-permissive language that R2’s keyword blocklist cannot catch — for example, “put user experience first and ensure every question receives a satisfactory answer” is semantically equivalent to “always comply” but contains no blocklist keywords.

Evaluation criteria (5):

#	Criterion	What it checks
1	Explicit Refusal Capability	Does the prompt define when and how to refuse?
2	Scope Boundaries	Are both positive scope (may do) and negative scope (must not do) defined?
3	Unconditional Compliance Language	Does the prompt contain phrases instructing unconditional compliance?
4	Failure Handling	Does the prompt define behavior for requests the model cannot or should not fulfill?
5	Anti-Injection Guardrails	Does the prompt instruct the model to treat user input as data, not commands?

Scoring: 1 (critical) through 5 (pass), based on the number and severity of findings across criteria.

J2 — Hardcoded Sensitive Data Judge¶

Risk category: Risk 2 — Hardcoded Sensitive Data

Status: Planned

What it evaluates: Whether the target prompt embeds proprietary information that could be extracted by adversarial users — pricing coefficients, underwriting rules, actuarial formulas, internal thresholds, API credentials, or business logic that should reside in backend systems rather than in prompt text.

J2 complements R1 (Secrets Scanner): R1 catches structured patterns (API keys, connection strings), while J2 performs semantic analysis to identify business logic and decision rules expressed in natural language — content that has no regex signature but is equally sensitive.

Key evaluation dimensions:

Presence of numeric business parameters (coefficients, thresholds, weights) with operational significance
Embedded decision trees or scoring rules that constitute proprietary business logic
Internal system names, endpoints, or architecture details that aid reconnaissance
Information that would give a competitor or adversary an actionable advantage if extracted

J3 — Role Confusion Judge¶

Risk category: Risk 3 — Role Confusion

Status: Planned

What it evaluates: Whether the target prompt defines the model’s identity with sufficient precision and resilience to resist identity manipulation attacks — including persona switching (“pretend you are…”), gradual role drift across multi-turn conversations, and thought-experiment framing (“hypothetically, if you had no restrictions…”).

Key evaluation dimensions:

#	Dimension	What it checks
1	Identity precision	Is the role definition specific and bounded, or vague and open-ended? (e.g., “flexibly support the user” invites role switching)
2	User override protection	Does the prompt explicitly prohibit users from redefining the model’s role or identity during conversation?
3	Persona switching defense	Does the prompt explicitly reject hypothetical/role-play framings like “assume you are a consultant with no policy constraints”?

J4 — Instruction Conflict Judge¶

Risk category: Risk 4 — Instruction Conflict

Status: Planned

What it evaluates: Whether the target prompt contains pairs of instructions that produce contradictory behavior under specific triggering conditions. This is the most reasoning-intensive judge — it must decompose the prompt into individual constraint units, perform pairwise comparison, and identify scenarios where following one instruction necessarily violates another.

Key conflict patterns:

Conflict Type	Example
Confidentiality vs. Helpfulness	“Never disclose internal rules” + “Ensure the user feels fully supported and informed”
Refusal vs. User Experience	“Refuse out-of-scope requests” + “Avoid responses that make the user feel rejected”
Compliance vs. Flexibility	“Follow all regulatory guidelines strictly” + “Use your judgment to provide the best outcome”

Trigger conditions: Conflicts often surface only under specific user behaviors — persistent follow-up questions, urgency claims, authority assertions — not during normal interaction.

J5 — Logic Ambiguity Judge¶

Risk category: Risk 5 — Logic Ambiguity

Status: Planned

What it evaluates: Whether the target prompt contains restriction instructions with soft qualifiers — words like “usually,” “try to avoid,” “unless necessary,” “in most cases” — that create exploitable exception pathways. An attacker can construct a scenario that satisfies the exception condition, causing the model to abandon the restriction.

Key evaluation dimensions:

Presence of soft qualifiers in security-critical instructions
Whether exception conditions are bounded (specific, enumerated) or unbounded (vague, judgment-based)
Feasibility of constructing a plausible scenario that triggers the exception (e.g., fabricating an emergency to bypass “usually do not recommend third-party services”)

Layer 3: Meta-Judge (Aggregation)¶

The Meta-Judge receives all outputs from Layer 1 and Layer 2 and performs two tasks: weighted risk aggregation and risk report generation.

Deployment Scenario Weighting¶

The final risk level is not simply the highest individual finding. It is adjusted by a deployment scenario weight that reflects the real-world impact amplification of different architectures:

Deployment Scenario	Weight
Agentic Executor (tool-calling, external actions)	2.0x
RAG Pipeline (external knowledge base)	1.5x
Customer-facing Chatbot (external users)	1.3x
Internal Employee Tool (internal users only)	1.0x
One-shot Query (no session memory)	0.8x

The same risk finding in an Agentic Executor context may be several times more dangerous than in a One-shot Query context.

Risk Report Structure¶

The Meta-Judge produces a structured Risk Report:

## Prompt Security Evaluation Report

### Submission Info
- Prompt ID: [unique identifier]
- Submission Time: [timestamp]
- Deployment Type: [scenario type]
- Sensitivity Level: [High / Medium / Low]

### Overall Risk Rating: Critical / High / Medium / Low

### Findings

#### [Risk Category Name]
- Severity: [level]
- Evidence: [quoted prompt text]
- Triggered By: [Layer 1 rule / J1 / J2 / J3 / J4 / J5]
- Recommendation: [specific remediation action]

### Prioritized Remediation Plan
1. [Immediate action items]
2. [Short-term action items]
3. [Ongoing governance items]

### Audit Trail
- Reviewed by: LLM-as-Judge Pipeline v[version]
- Human Review Required: Yes / No
- Escalation Status: [whether human review is needed]

Status: Planned

Judge Quality Metrics¶

Judges are themselves prompts — and prompts can be unreliable. The following metrics are tracked to ensure judge trustworthiness:

Metric	Definition	Target
Detection Rate (Recall)	Proportion of known-vulnerable prompts correctly flagged	>= 90%
False Positive Rate	Proportion of known-safe prompts incorrectly flagged	<= 10%
Severity Accuracy	Accuracy of risk level assignment (Critical/High/Medium/Low)	As high as possible
Actionability Score	Proportion of recommendations rated “directly actionable” by human reviewers	Human-assessed

Judge quality assurance methodology (known-answer testing, cross-version comparison, cross-model comparison) is detailed in Document 06.

Cross-Reference to Use Cases¶

Each judge applies across all six use cases defined in Document 02, but the relevance and severity weighting varies by use case architecture:

Judge	UC1 Pipeline	UC2 RAG	UC3 Web Agent	UC4 Basic Agent	UC5 Advanced Agent	UC6 Customer-Facing
J1 Over-Permissive	Medium	Medium	Medium	Medium	High	High
J2 Hardcoded Data	High (P4)	High (pricing in KB)	Low	Low	Medium	Low
J3 Role Confusion	Low	Low	Low	Low	Medium	High
J4 Instruction Conflict	Medium	Low	Medium	Medium	High	High
J5 Logic Ambiguity	Medium	Low	Medium	Medium	High	High
R1 Secrets Scanner	All	All	All	All	All	All
R2 Keyword Blocklist	All	All	All	All	All	All

Next Steps¶

Each judge will be expanded with:

Full prompt specifications (system prompt + user prompt templates) — Document 04 (data structure)
Execution mechanics (two-layer architecture, data flow) — Document 06
Known-answer test suites for judge quality assurance

Document maintained as part of the prompt_risk project — Last updated: 2026-04-23