Running Evaluations
Theclient.evals.run() method executes evaluations against your dataset.
Basic Usage
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
name | str | Yes | Name for this evaluation |
dataset | list | Yes | List of test cases |
task | callable | Yes | Function that produces output |
scorers | list | Yes | List of scorers to apply |
max_concurrent | int | No | Max parallel tasks (default: 10) |
show_progress | bool | No | Show progress bar (default: True) |
Dataset Format
Each item in the dataset is a dictionary:input can be a string or a dictionary for complex inputs.
Task Function
The task function receives theinput and returns the output:
Evaluation Result
Individual Results
Access individual row results:Multiple Scorers
Apply multiple scorers in one run:Controlling Concurrency
Adjust parallel execution:Progress Display
Async Evaluations
Error Handling
Task errors are captured per-row:Best Practices
- Start Small: Test with a small dataset first
- Multiple Scorers: Use multiple scorers for comprehensive evaluation
- Limit Concurrency: Avoid rate limits with
max_concurrent - Review Failures: Check
row.errorfor debugging - Save Results: Store results for historical comparison