LLM Judge
LLM Judge uses an AI model to evaluate outputs based on custom criteria. This is useful when evaluation requires human-like judgment.Basic Usage
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | Required | Scorer name in results |
prompt | str | Required | Evaluation prompt template |
model | str | "claude-3-5-haiku" | Model for judging |
output_type | str | "percentage" | Output type (see below) |
categories | list | None | For categorical type |
range | tuple | None | For discrete type |
chain_of_thought | bool | False | Enable reasoning |
temperature | float | 0.0 | Model temperature |
Prompt Variables
Use these variables in your prompt:| Variable | Description |
|---|---|
{input} | The original input |
{output} | The task output to evaluate |
{expected} | The expected output (if provided) |
Output Types
Boolean
Returns 1.0 (true) or 0.0 (false).Discrete
Returns a normalized score from an integer range.Categorical
Returns a position-based score from categories.Percentage
Returns a decimal between 0.0 and 1.0.Chain of Thought
Enable step-by-step reasoning for better judgments:- Think through the evaluation step by step
- Provide reasoning in
<reasoning>tags - Give final answer in
<answer>tags
Practical Examples
Response Completeness
Code Quality
Tone Analysis
Factual Verification
Multiple Judges
Use multiple judges for different aspects:Best Practices
- Be Specific: Write clear, specific prompts
- Use Examples: Include examples in prompts when helpful
- Choose Appropriate Type: Match output type to your criteria
- Use Chain of Thought: For complex evaluations
- Select Right Model: Use capable models for nuanced judgments
- Low Temperature: Keep at 0.0 for consistency