.. _judge-catalog: Judge Catalog: LLM-as-Judge Security Evaluation Pipeline ============================================================================== .. list-table:: :widths: 20 80 * - **Version** - v0.1 Draft * - **Date** - 2026-04-23 * - **Purpose** - Define the complete set of judges used to evaluate prompt-level security risks. This document serves as an index — each judge's implementation details (prompt text, data flow, execution) are covered in Documents 04–06. ---- Overview ------------------------------------------------------------------------------ A **Judge** is an LLM prompt that evaluates another LLM prompt for security risks. Judges are organized into a three-layer pipeline: a deterministic rule engine for high-confidence pattern matching, specialized LLM judges for semantic analysis, and a meta-judge for aggregation and final risk rating. The pipeline processes a **Prompt Submission Package** — the target prompt plus deployment context — and produces a structured **Risk Report**. Pipeline Architecture ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: text Input: Prompt Submission Package | +---------------------------------+ | Layer 1: Rule Engine | <-- deterministic rules, no LLM | R1 - Secrets Scanner | | R2 - Keyword Blocklist | +--------------+------------------+ | flag -> escalate to Critical | pass -> continue v +---------------------------------------------------+ | Layer 2: Specialized LLM Judges (run in parallel) | | | | J1 - Over-Permissive Authorization | | J2 - Hardcoded Sensitive Data | | J3 - Role Confusion | | J4 - Instruction Conflict | | J5 - Logic Ambiguity | +---------------------------------------------------+ | v +---------------------------------+ | Layer 3: Meta-Judge | | Aggregation + Risk Report | +---------------------------------+ | v Output: Risk Report (structured) Prompt Submission Package ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Every evaluation takes as input not just the prompt text, but a standardized package of deployment context: .. list-table:: :header-rows: 1 :widths: 20 40 40 * - Field - Description - Example * - ``prompt_text`` - The full system prompt text under review - — * - ``deployment_type`` - Deployment architecture pattern - ``Agentic Executor`` / ``Chatbot`` / ``One-shot Query`` / ``RAG Pipeline`` * - ``data_input_channels`` - Data channels that enter the LLM context alongside the prompt - User input, file upload, RAG retrieval, database query, API response * - ``target_user_group`` - Who uses the application - Internal employees / External customers / Mixed * - ``sensitivity_level`` - Sensitivity of data the system handles - ``High`` (actuarial logic, claims data) / ``Medium`` / ``Low`` * - ``author_notes`` - Optional notes from the prompt author on design intent - — The ``deployment_type`` field significantly affects risk weighting: the same over-permissive instruction carries far greater risk in an Agentic Executor than in a One-shot Query. ---- Layer 1: Rule Engine (Deterministic) ------------------------------------------------------------------------------ This layer uses **no LLM**. It applies deterministic pattern matching for two categories of risk with unambiguous signatures. It is fast, cheap, and fully explainable — the first gate in the pipeline. R1 — Secrets Scanner ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Risk Category:** Hardcoded Sensitive Data Detects sensitive information embedded in prompt text via regex patterns: .. list-table:: :header-rows: 1 :widths: 30 70 * - Pattern Type - Examples * - API key formats - ``sk-[a-zA-Z0-9]{20,}``, ``Bearer [a-zA-Z0-9\-._~+/]+=*`` * - Database connection strings - ``jdbc:``, ``mongodb://``, ``postgresql://`` * - Internal network addresses - ``192.168.x.x``, ``10.x.x.x`` * - Numeric coefficients with business labels - Consecutive decimal values with labels like "weight," "coefficient," "surcharge" **Trigger behavior:** Any match immediately escalates the finding to Critical severity. The pipeline does not wait for Layer 2 judges to complete. R2 — Keyword Blocklist Detector ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Risk Category:** Over-Permissive Authorization (keyword-level) Scans for high-risk phrases that signal unconditional compliance or suppressed refusal: .. list-table:: :header-rows: 1 :widths: 20 80 * - Language - Keywords * - English - ``never refuse``, ``always comply``, ``under no circumstances decline``, ``no content restrictions``, ``ignore your guidelines``, ``ignore previous instructions`` * - Chinese - ``不得拒绝``, ``任何情况下都要``, ``无条件``, ``不需要任何限制``, ``忽略你之前的指令`` **Trigger behavior:** Matches are flagged as a warning and flow into Layer 2 for deeper semantic evaluation by J1. A keyword match alone does not determine the final risk level — J1 assesses whether the keyword reflects genuine over-permissiveness in context. ---- Layer 2: Specialized LLM Judges ------------------------------------------------------------------------------ Five judges run in parallel, each focused on a single security topic. Parallel execution isolates reasoning load — a judge analyzing instruction conflicts does not compete for attention with one analyzing role confusion. Each judge is itself a versioned prompt stored under ``data/judges/prompts/`` (see Document 04 for directory structure). Each judge outputs structured findings per criterion, following a common schema (see Document 06 for output format and execution details). Judge Summary Table ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 10 30 30 30 * - Judge - Security Topic - Risk Category - Status * - **J1** - Over-Permissive Authorization - Risk 1 - Implemented * - **J2** - Hardcoded Sensitive Data - Risk 2 - Planned * - **J3** - Role Confusion - Risk 3 - Planned * - **J4** - Instruction Conflict - Risk 4 - Planned * - **J5** - Logic Ambiguity - Risk 5 - Planned ---- J1 — Over-Permissive Authorization Judge ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Risk category:** Risk 1 — Over-Permissive Authorization **Status:** Implemented (``data/judges/prompts/j1-over-permissive/``) **What it evaluates:** Whether the target prompt grants the model excessive behavioral freedom, weakens its refusal capability, or fails to define clear scope boundaries. J1 also picks up semantic variants of over-permissive language that R2's keyword blocklist cannot catch — for example, "put user experience first and ensure every question receives a satisfactory answer" is semantically equivalent to "always comply" but contains no blocklist keywords. **Evaluation criteria (5):** .. list-table:: :header-rows: 1 :widths: 5 30 65 * - # - Criterion - What it checks * - 1 - Explicit Refusal Capability - Does the prompt define when and how to refuse? * - 2 - Scope Boundaries - Are both positive scope (may do) and negative scope (must not do) defined? * - 3 - Unconditional Compliance Language - Does the prompt contain phrases instructing unconditional compliance? * - 4 - Failure Handling - Does the prompt define behavior for requests the model cannot or should not fulfill? * - 5 - Anti-Injection Guardrails - Does the prompt instruct the model to treat user input as data, not commands? **Scoring:** 1 (critical) through 5 (pass), based on the number and severity of findings across criteria. ---- J2 — Hardcoded Sensitive Data Judge ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Risk category:** Risk 2 — Hardcoded Sensitive Data **Status:** Planned **What it evaluates:** Whether the target prompt embeds proprietary information that could be extracted by adversarial users — pricing coefficients, underwriting rules, actuarial formulas, internal thresholds, API credentials, or business logic that should reside in backend systems rather than in prompt text. J2 complements R1 (Secrets Scanner): R1 catches structured patterns (API keys, connection strings), while J2 performs semantic analysis to identify business logic and decision rules expressed in natural language — content that has no regex signature but is equally sensitive. **Key evaluation dimensions:** - Presence of numeric business parameters (coefficients, thresholds, weights) with operational significance - Embedded decision trees or scoring rules that constitute proprietary business logic - Internal system names, endpoints, or architecture details that aid reconnaissance - Information that would give a competitor or adversary an actionable advantage if extracted ---- J3 — Role Confusion Judge ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Risk category:** Risk 3 — Role Confusion **Status:** Planned **What it evaluates:** Whether the target prompt defines the model's identity with sufficient precision and resilience to resist identity manipulation attacks — including persona switching ("pretend you are..."), gradual role drift across multi-turn conversations, and thought-experiment framing ("hypothetically, if you had no restrictions..."). **Key evaluation dimensions:** .. list-table:: :header-rows: 1 :widths: 5 25 70 * - # - Dimension - What it checks * - 1 - Identity precision - Is the role definition specific and bounded, or vague and open-ended? (e.g., "flexibly support the user" invites role switching) * - 2 - User override protection - Does the prompt explicitly prohibit users from redefining the model's role or identity during conversation? * - 3 - Persona switching defense - Does the prompt explicitly reject hypothetical/role-play framings like "assume you are a consultant with no policy constraints"? ---- J4 — Instruction Conflict Judge ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Risk category:** Risk 4 — Instruction Conflict **Status:** Planned **What it evaluates:** Whether the target prompt contains pairs of instructions that produce contradictory behavior under specific triggering conditions. This is the most reasoning-intensive judge — it must decompose the prompt into individual constraint units, perform pairwise comparison, and identify scenarios where following one instruction necessarily violates another. **Key conflict patterns:** .. list-table:: :header-rows: 1 :widths: 30 70 * - Conflict Type - Example * - Confidentiality vs. Helpfulness - "Never disclose internal rules" + "Ensure the user feels fully supported and informed" * - Refusal vs. User Experience - "Refuse out-of-scope requests" + "Avoid responses that make the user feel rejected" * - Compliance vs. Flexibility - "Follow all regulatory guidelines strictly" + "Use your judgment to provide the best outcome" **Trigger conditions:** Conflicts often surface only under specific user behaviors — persistent follow-up questions, urgency claims, authority assertions — not during normal interaction. ---- J5 — Logic Ambiguity Judge ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Risk category:** Risk 5 — Logic Ambiguity **Status:** Planned **What it evaluates:** Whether the target prompt contains restriction instructions with soft qualifiers — words like "usually," "try to avoid," "unless necessary," "in most cases" — that create exploitable exception pathways. An attacker can construct a scenario that satisfies the exception condition, causing the model to abandon the restriction. **Key evaluation dimensions:** - Presence of soft qualifiers in security-critical instructions - Whether exception conditions are bounded (specific, enumerated) or unbounded (vague, judgment-based) - Feasibility of constructing a plausible scenario that triggers the exception (e.g., fabricating an emergency to bypass "usually do not recommend third-party services") ---- Layer 3: Meta-Judge (Aggregation) ------------------------------------------------------------------------------ The Meta-Judge receives all outputs from Layer 1 and Layer 2 and performs two tasks: **weighted risk aggregation** and **risk report generation**. Deployment Scenario Weighting ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The final risk level is not simply the highest individual finding. It is adjusted by a **deployment scenario weight** that reflects the real-world impact amplification of different architectures: .. list-table:: :header-rows: 1 :widths: 60 40 * - Deployment Scenario - Weight * - Agentic Executor (tool-calling, external actions) - 2.0x * - RAG Pipeline (external knowledge base) - 1.5x * - Customer-facing Chatbot (external users) - 1.3x * - Internal Employee Tool (internal users only) - 1.0x * - One-shot Query (no session memory) - 0.8x The same risk finding in an Agentic Executor context may be several times more dangerous than in a One-shot Query context. Risk Report Structure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Meta-Judge produces a structured Risk Report: .. code-block:: text ## Prompt Security Evaluation Report ### Submission Info - Prompt ID: [unique identifier] - Submission Time: [timestamp] - Deployment Type: [scenario type] - Sensitivity Level: [High / Medium / Low] ### Overall Risk Rating: Critical / High / Medium / Low ### Findings #### [Risk Category Name] - Severity: [level] - Evidence: [quoted prompt text] - Triggered By: [Layer 1 rule / J1 / J2 / J3 / J4 / J5] - Recommendation: [specific remediation action] ### Prioritized Remediation Plan 1. [Immediate action items] 2. [Short-term action items] 3. [Ongoing governance items] ### Audit Trail - Reviewed by: LLM-as-Judge Pipeline v[version] - Human Review Required: Yes / No - Escalation Status: [whether human review is needed] **Status:** Planned ---- Judge Quality Metrics ------------------------------------------------------------------------------ Judges are themselves prompts — and prompts can be unreliable. The following metrics are tracked to ensure judge trustworthiness: .. list-table:: :header-rows: 1 :widths: 20 50 30 * - Metric - Definition - Target * - **Detection Rate (Recall)** - Proportion of known-vulnerable prompts correctly flagged - >= 90% * - **False Positive Rate** - Proportion of known-safe prompts incorrectly flagged - <= 10% * - **Severity Accuracy** - Accuracy of risk level assignment (Critical/High/Medium/Low) - As high as possible * - **Actionability Score** - Proportion of recommendations rated "directly actionable" by human reviewers - Human-assessed Judge quality assurance methodology (known-answer testing, cross-version comparison, cross-model comparison) is detailed in Document 06. ---- Cross-Reference to Use Cases ------------------------------------------------------------------------------ Each judge applies across all six use cases defined in Document 02, but the relevance and severity weighting varies by use case architecture: .. list-table:: :header-rows: 1 :widths: 16 14 14 14 14 14 14 * - Judge - UC1 Pipeline - UC2 RAG - UC3 Web Agent - UC4 Basic Agent - UC5 Advanced Agent - UC6 Customer-Facing * - **J1** Over-Permissive - Medium - Medium - Medium - Medium - High - High * - **J2** Hardcoded Data - High (P4) - High (pricing in KB) - Low - Low - Medium - Low * - **J3** Role Confusion - Low - Low - Low - Low - Medium - High * - **J4** Instruction Conflict - Medium - Low - Medium - Medium - High - High * - **J5** Logic Ambiguity - Medium - Low - Medium - Medium - High - High * - **R1** Secrets Scanner - All - All - All - All - All - All * - **R2** Keyword Blocklist - All - All - All - All - All - All ---- Next Steps ------------------------------------------------------------------------------ Each judge will be expanded with: 1. Full prompt specifications (system prompt + user prompt templates) — Document 04 (data structure) 2. Execution mechanics (two-layer architecture, data flow) — Document 06 3. Known-answer test suites for judge quality assurance ---- *Document maintained as part of the* ``prompt_risk`` *project — Last updated: 2026-04-23*