Empromptu LogoEmpromptu

Evaluations

Define what "success" looks like for your AI application. Create specific, measurable criteria that guide optimization and measure performance objectively.

What Are Evaluations?

Evaluations are specific, measurable criteria that define good performance for your AI application. Instead of subjective judgment, evaluations provide objective benchmarks for optimization.

How Evaluations Work

When your AI application processes an input:

1

Output Generated

Your AI creates a response

2

Evaluations Applied

Each active evaluation scores the output

3

Individual Scores Calculated

Each evaluation gets a 0-10 score

4

Overall Score Computed

Average of all evaluation scores

5

Results Logged

Scores and reasoning saved to Event Log

Creating Evaluations

You have two options for creating evaluations: let the system generate them automatically or create them manually for specific requirements.

Automatic Generation

Best for: Getting started quickly with proven evaluation criteria

How it works:

Access Actions → Evaluations
Select "Generate Automatically"
Empromptu analyzes your task and creates relevant evaluations
Review and activate the generated criteria

Benefits:

  • • Proven criteria based on similar use cases
  • • Good starting point for further customization
  • • Saves time on initial setup

✍️Manual Creation

Best for: Specific requirements or fine-tuned control over success criteria

How it works:

Access Actions → Evaluations
Select "Create Manual"
Write your evaluation name and criteria
Test and activate the evaluation

Benefits:

  • • Complete control over criteria
  • • Task-specific requirements
  • • Custom business logic

Writing Effective Evaluation Criteria

Be Specific and Measurable

❌ Poor:
"Output should be good"
✅ Good:
"Summary should include all product features mentioned in the review"

Focus on Observable Outcomes

❌ Poor:
"Response should be helpful"
✅ Good:
"Response should provide at least 2 actionable solutions to the customer's problem"

Use Clear, Objective Language

❌ Poor:
"Information should be presented nicely"
✅ Good:
"Information appears in logical sequence that reflects the structure of the input"

Common Evaluation Categories

📊Accuracy-Focused

Ensure factual correctness and completeness

"Extracted Complete Bug Set": "All bugs mentioned also appear in the output"
"Accurate Details": "All extracted details were present and correct in the original text"
"No Hallucination": "Output contains no information not found in the input"

📝Format-Focused

Ensure consistent structure and presentation

"Correct Sequence": "Information appears in logical order"
"Proper Structure": "Output follows the specified template format"
"Length Requirements": "Response length falls within specified range"

Quality-Focused

Measure overall usefulness and appropriateness

"Addresses Question": "Response directly answers what was asked"
"Professional Tone": "Language is appropriate for business communication"
"Actionable Content": "Provides specific steps user can take"

Use Case Examples

📄Data Extraction Applications

"Complete Extraction": "All contact information present in the document appears in the structured output"
"Accurate Formatting": "Phone numbers follow (XXX) XXX-XXXX format"
"No Duplication": "Each piece of information appears only once in the output"

🎧Customer Support Applications

"Question Recognition": "Response demonstrates understanding of the customer's specific issue"
"Solution Provided": "Response includes at least one actionable step to resolve the problem"
"Appropriate Escalation": "Complex technical issues are escalated to human agents"

✍️Content Generation Applications

"Brand Voice": "Content matches the company's established tone and style"
"Factual Accuracy": "All claims in the content can be verified from provided sources"
"Target Length": "Content falls within specified word count requirements"

Managing Your Evaluations

Active vs Inactive Evaluations

🟢 Active Evaluations

  • • Used in optimization scoring
  • • Contribute to overall accuracy metrics
  • • Guide automatic optimization decisions

⚫ Inactive Evaluations

  • • Don't affect current scoring
  • • Can be reactivated when needed
  • • Useful for testing different criteria

Evaluation Actions

For each evaluation, you can:

Activate/Deactivate: Toggle whether it's used in scoring
Modify: Edit criteria and descriptions
Delete: Remove evaluations you no longer need
Duplicate: Create variations of existing criteria

How Evaluations Impact Optimization

Automatic Optimization

Evaluations guide both automatic and manual optimization:

  • • Focuses on improving the lowest-scoring evaluations
  • • Creates Prompt Family variations to handle different criteria
  • • Prioritizes changes that improve overall evaluation performance

Manual Optimization

Evaluations provide clear direction for manual improvements:

  • • Shows which specific evaluations need attention
  • • Helps you target optimization efforts effectively
  • • Provides clear metrics for measuring improvement

Individual Evaluation Scores

Each evaluation follows the same 0-10 scale:

0-3Evaluation criteria not met
4-6Partially meets criteria
7-8Meets criteria well
9-10Exceeds criteria expectations

Best Practices

Getting Started

  • • Begin with 3-5 core evaluations
  • • Test with sample inputs
  • • Run initial optimization
  • • Add specific criteria gradually

Balance & Focus

  • • Cover accuracy, format, quality
  • • Avoid redundant evaluations
  • • Focus on user impact
  • • Keep manageable scope

Ongoing Management

  • • Monitor performance in Event Log
  • • Revise low-scoring criteria
  • • Add evaluations for edge cases
  • • Remove non-valuable ones

Testing & Validation

  • • Use diverse test inputs
  • • Check edge case reliability
  • • Verify scoring expectations
  • • Get team feedback