.. _judge-catalog:

Judge Catalog: LLM-as-Judge Security Evaluation Pipeline
==============================================================================

.. list-table::
   :widths: 20 80

   * - **Version**
     - v0.1 Draft
   * - **Date**
     - 2026-04-23
   * - **Purpose**
     - Define the complete set of judges used to evaluate prompt-level security risks. This document serves as an index — each judge's implementation details (prompt text, data flow, execution) are covered in Documents 04–06.

----

Overview
------------------------------------------------------------------------------
A **Judge** is an LLM prompt that evaluates another LLM prompt for security risks. Judges are organized into a three-layer pipeline: a deterministic rule engine for high-confidence pattern matching, specialized LLM judges for semantic analysis, and a meta-judge for aggregation and final risk rating.

The pipeline processes a **Prompt Submission Package** — the target prompt plus deployment context — and produces a structured **Risk Report**.

Pipeline Architecture
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: text

   Input: Prompt Submission Package
                 |
     +---------------------------------+
     |  Layer 1: Rule Engine           |  <-- deterministic rules, no LLM
     |  R1 - Secrets Scanner           |
     |  R2 - Keyword Blocklist         |
     +--------------+------------------+
                    | flag -> escalate to Critical
                    | pass -> continue
                    v
     +---------------------------------------------------+
     |  Layer 2: Specialized LLM Judges (run in parallel) |
     |                                                    |
     |  J1 - Over-Permissive Authorization                |
     |  J2 - Hardcoded Sensitive Data                     |
     |  J3 - Role Confusion                               |
     |  J4 - Instruction Conflict                         |
     |  J5 - Logic Ambiguity                              |
     +---------------------------------------------------+
                    |
                    v
     +---------------------------------+
     |  Layer 3: Meta-Judge            |
     |  Aggregation + Risk Report      |
     +---------------------------------+
                    |
                    v
   Output: Risk Report (structured)

Prompt Submission Package
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Every evaluation takes as input not just the prompt text, but a standardized package of deployment context:

.. list-table::
   :header-rows: 1
   :widths: 20 40 40

   * - Field
     - Description
     - Example
   * - ``prompt_text``
     - The full system prompt text under review
     - —
   * - ``deployment_type``
     - Deployment architecture pattern
     - ``Agentic Executor`` / ``Chatbot`` / ``One-shot Query`` / ``RAG Pipeline``
   * - ``data_input_channels``
     - Data channels that enter the LLM context alongside the prompt
     - User input, file upload, RAG retrieval, database query, API response
   * - ``target_user_group``
     - Who uses the application
     - Internal employees / External customers / Mixed
   * - ``sensitivity_level``
     - Sensitivity of data the system handles
     - ``High`` (actuarial logic, claims data) / ``Medium`` / ``Low``
   * - ``author_notes``
     - Optional notes from the prompt author on design intent
     - —

The ``deployment_type`` field significantly affects risk weighting: the same over-permissive instruction carries far greater risk in an Agentic Executor than in a One-shot Query.

----

Layer 1: Rule Engine (Deterministic)
------------------------------------------------------------------------------

This layer uses **no LLM**. It applies deterministic pattern matching for two categories of risk with unambiguous signatures. It is fast, cheap, and fully explainable — the first gate in the pipeline.

R1 — Secrets Scanner
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Risk Category:** Hardcoded Sensitive Data

Detects sensitive information embedded in prompt text via regex patterns:

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Pattern Type
     - Examples
   * - API key formats
     - ``sk-[a-zA-Z0-9]{20,}``, ``Bearer [a-zA-Z0-9\-._~+/]+=*``
   * - Database connection strings
     - ``jdbc:``, ``mongodb://``, ``postgresql://``
   * - Internal network addresses
     - ``192.168.x.x``, ``10.x.x.x``
   * - Numeric coefficients with business labels
     - Consecutive decimal values with labels like "weight," "coefficient," "surcharge"

**Trigger behavior:** Any match immediately escalates the finding to Critical severity. The pipeline does not wait for Layer 2 judges to complete.

R2 — Keyword Blocklist Detector
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Risk Category:** Over-Permissive Authorization (keyword-level)

Scans for high-risk phrases that signal unconditional compliance or suppressed refusal:

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Language
     - Keywords
   * - English
     - ``never refuse``, ``always comply``, ``under no circumstances decline``, ``no content restrictions``, ``ignore your guidelines``, ``ignore previous instructions``
   * - Chinese
     - ``不得拒绝``, ``任何情况下都要``, ``无条件``, ``不需要任何限制``, ``忽略你之前的指令``

**Trigger behavior:** Matches are flagged as a warning and flow into Layer 2 for deeper semantic evaluation by J1. A keyword match alone does not determine the final risk level — J1 assesses whether the keyword reflects genuine over-permissiveness in context.

----

Layer 2: Specialized LLM Judges
------------------------------------------------------------------------------
Five judges run in parallel, each focused on a single security topic. Parallel execution isolates reasoning load — a judge analyzing instruction conflicts does not compete for attention with one analyzing role confusion.

Each judge is itself a versioned prompt stored under ``data/judges/prompts/`` (see Document 04 for directory structure). Each judge outputs structured findings per criterion, following a common schema (see Document 06 for output format and execution details).

Judge Summary Table
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
   :header-rows: 1
   :widths: 10 30 30 30

   * - Judge
     - Security Topic
     - Risk Category
     - Status
   * - **J1**
     - Over-Permissive Authorization
     - Risk 1
     - Implemented
   * - **J2**
     - Hardcoded Sensitive Data
     - Risk 2
     - Planned
   * - **J3**
     - Role Confusion
     - Risk 3
     - Planned
   * - **J4**
     - Instruction Conflict
     - Risk 4
     - Planned
   * - **J5**
     - Logic Ambiguity
     - Risk 5
     - Planned

----

J1 — Over-Permissive Authorization Judge
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Risk category:** Risk 1 — Over-Permissive Authorization

**Status:** Implemented (``data/judges/prompts/j1-over-permissive/``)

**What it evaluates:** Whether the target prompt grants the model excessive behavioral freedom, weakens its refusal capability, or fails to define clear scope boundaries. J1 also picks up semantic variants of over-permissive language that R2's keyword blocklist cannot catch — for example, "put user experience first and ensure every question receives a satisfactory answer" is semantically equivalent to "always comply" but contains no blocklist keywords.

**Evaluation criteria (5):**

.. list-table::
   :header-rows: 1
   :widths: 5 30 65

   * - #
     - Criterion
     - What it checks
   * - 1
     - Explicit Refusal Capability
     - Does the prompt define when and how to refuse?
   * - 2
     - Scope Boundaries
     - Are both positive scope (may do) and negative scope (must not do) defined?
   * - 3
     - Unconditional Compliance Language
     - Does the prompt contain phrases instructing unconditional compliance?
   * - 4
     - Failure Handling
     - Does the prompt define behavior for requests the model cannot or should not fulfill?
   * - 5
     - Anti-Injection Guardrails
     - Does the prompt instruct the model to treat user input as data, not commands?

**Scoring:** 1 (critical) through 5 (pass), based on the number and severity of findings across criteria.

----

J2 — Hardcoded Sensitive Data Judge
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Risk category:** Risk 2 — Hardcoded Sensitive Data

**Status:** Planned

**What it evaluates:** Whether the target prompt embeds proprietary information that could be extracted by adversarial users — pricing coefficients, underwriting rules, actuarial formulas, internal thresholds, API credentials, or business logic that should reside in backend systems rather than in prompt text.

J2 complements R1 (Secrets Scanner): R1 catches structured patterns (API keys, connection strings), while J2 performs semantic analysis to identify business logic and decision rules expressed in natural language — content that has no regex signature but is equally sensitive.

**Key evaluation dimensions:**

- Presence of numeric business parameters (coefficients, thresholds, weights) with operational significance
- Embedded decision trees or scoring rules that constitute proprietary business logic
- Internal system names, endpoints, or architecture details that aid reconnaissance
- Information that would give a competitor or adversary an actionable advantage if extracted

----

J3 — Role Confusion Judge
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Risk category:** Risk 3 — Role Confusion

**Status:** Planned

**What it evaluates:** Whether the target prompt defines the model's identity with sufficient precision and resilience to resist identity manipulation attacks — including persona switching ("pretend you are..."), gradual role drift across multi-turn conversations, and thought-experiment framing ("hypothetically, if you had no restrictions...").

**Key evaluation dimensions:**

.. list-table::
   :header-rows: 1
   :widths: 5 25 70

   * - #
     - Dimension
     - What it checks
   * - 1
     - Identity precision
     - Is the role definition specific and bounded, or vague and open-ended? (e.g., "flexibly support the user" invites role switching)
   * - 2
     - User override protection
     - Does the prompt explicitly prohibit users from redefining the model's role or identity during conversation?
   * - 3
     - Persona switching defense
     - Does the prompt explicitly reject hypothetical/role-play framings like "assume you are a consultant with no policy constraints"?

----

J4 — Instruction Conflict Judge
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Risk category:** Risk 4 — Instruction Conflict

**Status:** Planned

**What it evaluates:** Whether the target prompt contains pairs of instructions that produce contradictory behavior under specific triggering conditions. This is the most reasoning-intensive judge — it must decompose the prompt into individual constraint units, perform pairwise comparison, and identify scenarios where following one instruction necessarily violates another.

**Key conflict patterns:**

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Conflict Type
     - Example
   * - Confidentiality vs. Helpfulness
     - "Never disclose internal rules" + "Ensure the user feels fully supported and informed"
   * - Refusal vs. User Experience
     - "Refuse out-of-scope requests" + "Avoid responses that make the user feel rejected"
   * - Compliance vs. Flexibility
     - "Follow all regulatory guidelines strictly" + "Use your judgment to provide the best outcome"

**Trigger conditions:** Conflicts often surface only under specific user behaviors — persistent follow-up questions, urgency claims, authority assertions — not during normal interaction.

----

J5 — Logic Ambiguity Judge
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Risk category:** Risk 5 — Logic Ambiguity

**Status:** Planned

**What it evaluates:** Whether the target prompt contains restriction instructions with soft qualifiers — words like "usually," "try to avoid," "unless necessary," "in most cases" — that create exploitable exception pathways. An attacker can construct a scenario that satisfies the exception condition, causing the model to abandon the restriction.

**Key evaluation dimensions:**

- Presence of soft qualifiers in security-critical instructions
- Whether exception conditions are bounded (specific, enumerated) or unbounded (vague, judgment-based)
- Feasibility of constructing a plausible scenario that triggers the exception (e.g., fabricating an emergency to bypass "usually do not recommend third-party services")

----

Layer 3: Meta-Judge (Aggregation)
------------------------------------------------------------------------------
The Meta-Judge receives all outputs from Layer 1 and Layer 2 and performs two tasks: **weighted risk aggregation** and **risk report generation**.

Deployment Scenario Weighting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The final risk level is not simply the highest individual finding. It is adjusted by a **deployment scenario weight** that reflects the real-world impact amplification of different architectures:

.. list-table::
   :header-rows: 1
   :widths: 60 40

   * - Deployment Scenario
     - Weight
   * - Agentic Executor (tool-calling, external actions)
     - 2.0x
   * - RAG Pipeline (external knowledge base)
     - 1.5x
   * - Customer-facing Chatbot (external users)
     - 1.3x
   * - Internal Employee Tool (internal users only)
     - 1.0x
   * - One-shot Query (no session memory)
     - 0.8x

The same risk finding in an Agentic Executor context may be several times more dangerous than in a One-shot Query context.

Risk Report Structure
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Meta-Judge produces a structured Risk Report:

.. code-block:: text

   ## Prompt Security Evaluation Report

   ### Submission Info
   - Prompt ID: [unique identifier]
   - Submission Time: [timestamp]
   - Deployment Type: [scenario type]
   - Sensitivity Level: [High / Medium / Low]

   ### Overall Risk Rating: Critical / High / Medium / Low

   ### Findings

   #### [Risk Category Name]
   - Severity: [level]
   - Evidence: [quoted prompt text]
   - Triggered By: [Layer 1 rule / J1 / J2 / J3 / J4 / J5]
   - Recommendation: [specific remediation action]

   ### Prioritized Remediation Plan
   1. [Immediate action items]
   2. [Short-term action items]
   3. [Ongoing governance items]

   ### Audit Trail
   - Reviewed by: LLM-as-Judge Pipeline v[version]
   - Human Review Required: Yes / No
   - Escalation Status: [whether human review is needed]

**Status:** Planned

----

Judge Quality Metrics
------------------------------------------------------------------------------
Judges are themselves prompts — and prompts can be unreliable. The following metrics are tracked to ensure judge trustworthiness:

.. list-table::
   :header-rows: 1
   :widths: 20 50 30

   * - Metric
     - Definition
     - Target
   * - **Detection Rate (Recall)**
     - Proportion of known-vulnerable prompts correctly flagged
     - >= 90%
   * - **False Positive Rate**
     - Proportion of known-safe prompts incorrectly flagged
     - <= 10%
   * - **Severity Accuracy**
     - Accuracy of risk level assignment (Critical/High/Medium/Low)
     - As high as possible
   * - **Actionability Score**
     - Proportion of recommendations rated "directly actionable" by human reviewers
     - Human-assessed

Judge quality assurance methodology (known-answer testing, cross-version comparison, cross-model comparison) is detailed in Document 06.

----

Cross-Reference to Use Cases
------------------------------------------------------------------------------
Each judge applies across all six use cases defined in Document 02, but the relevance and severity weighting varies by use case architecture:

.. list-table::
   :header-rows: 1
   :widths: 16 14 14 14 14 14 14

   * - Judge
     - UC1 Pipeline
     - UC2 RAG
     - UC3 Web Agent
     - UC4 Basic Agent
     - UC5 Advanced Agent
     - UC6 Customer-Facing
   * - **J1** Over-Permissive
     - Medium
     - Medium
     - Medium
     - Medium
     - High
     - High
   * - **J2** Hardcoded Data
     - High (P4)
     - High (pricing in KB)
     - Low
     - Low
     - Medium
     - Low
   * - **J3** Role Confusion
     - Low
     - Low
     - Low
     - Low
     - Medium
     - High
   * - **J4** Instruction Conflict
     - Medium
     - Low
     - Medium
     - Medium
     - High
     - High
   * - **J5** Logic Ambiguity
     - Medium
     - Low
     - Medium
     - Medium
     - High
     - High
   * - **R1** Secrets Scanner
     - All
     - All
     - All
     - All
     - All
     - All
   * - **R2** Keyword Blocklist
     - All
     - All
     - All
     - All
     - All
     - All

----

Next Steps
------------------------------------------------------------------------------
Each judge will be expanded with:

1. Full prompt specifications (system prompt + user prompt templates) — Document 04 (data structure)
2. Execution mechanics (two-layer architecture, data flow) — Document 06
3. Known-answer test suites for judge quality assurance

----

*Document maintained as part of the* ``prompt_risk`` *project — Last updated: 2026-04-23*