CouncilAsAJudge: 3-Step Quickstart Guide¶
The CouncilAsAJudge architecture enables multi-dimensional evaluation of task responses using specialized judge agents. Each judge evaluates a different dimension (accuracy, helpfulness, harmlessness, coherence, conciseness, instruction adherence), and an aggregator synthesizes their findings into a comprehensive report.
Overview¶
| Feature | Description |
|---|---|
| Multi-Dimensional Evaluation | 6 specialized judges evaluate different quality dimensions |
| Parallel Execution | All judges evaluate concurrently for maximum efficiency |
| Comprehensive Reports | Aggregator synthesizes detailed technical analysis |
| Specialized Judges | Each judge focuses on a specific evaluation criterion |
Task Response
│
▼
┌────────────────────────────────────────┐
│ Parallel Judge Evaluation │
├────────────────────────────────────────┤
│ Accuracy Judge → Analysis │
│ Helpfulness Judge → Analysis │
│ Harmlessness Judge → Analysis │
│ Coherence Judge → Analysis │
│ Conciseness Judge → Analysis │
│ Adherence Judge → Analysis │
└────────────────────────────────────────┘
│
▼
Aggregator Agent
│
▼
Comprehensive Evaluation Report
Step 1: Install and Import¶
Ensure you have Swarms installed and import the CouncilAsAJudge class:
Step 2: Create the Council¶
Initialize the CouncilAsAJudge evaluation system:
# Create the council judge
council = CouncilAsAJudge(
name="Quality-Evaluation-Council",
description="Evaluates response quality across multiple dimensions",
model_name="gpt-4o-mini", # Model for judge agents
aggregation_model_name="gpt-4o-mini", # Model for aggregator
)
Step 3: Evaluate a Response¶
Run the evaluation on a task with a response:
# Task with response to evaluate
task_with_response = """
Task: Explain the concept of machine learning to a beginner.
Response: Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. It works by analyzing large amounts of data to identify patterns and make predictions or decisions. There are three main types: supervised learning (using labeled data), unsupervised learning (finding hidden patterns), and reinforcement learning (learning through trial and error). Machine learning is used in various applications like recommendation systems, image recognition, and natural language processing.
"""
# Run the evaluation
result = council.run(task=task_with_response)
# Print the comprehensive evaluation
print(result)
Complete Example¶
Here's a complete working example:
from swarms import CouncilAsAJudge
# Step 1: Initialize the council
council = CouncilAsAJudge(
name="Technical-Writing-Evaluator",
description="Evaluates technical documentation quality",
model_name="gpt-4o-mini",
)
# Step 2: Prepare the task and response to evaluate
task_with_response = """
Task: Write documentation for a REST API endpoint that creates a new user.
Response: POST /api/users - Creates a new user account. Send a JSON body with 'email', 'password', and 'name' fields. Returns 201 on success with user object, 400 for validation errors, 409 if email exists. Requires no authentication. Example: {"email": "user@example.com", "password": "secure123", "name": "John Doe"}
"""
# Step 3: Run the evaluation
evaluation = council.run(task=task_with_response)
# Display results
print("=" * 60)
print("EVALUATION REPORT:")
print("=" * 60)
print(evaluation)
Evaluation Dimensions¶
The council evaluates responses across these six dimensions:
| Dimension | Focus Area |
|---|---|
| Accuracy | Factual correctness, evidence-based claims, source credibility |
| Helpfulness | Practical value, solution feasibility, completeness |
| Harmlessness | Safety, ethics, bias detection, appropriate disclaimers |
| Coherence | Logical flow, structure, clear transitions, argument quality |
| Conciseness | Communication efficiency, no redundancy, focused content |
| Instruction Adherence | Compliance with requirements, format specifications, scope |
Each judge provides: - Detailed analysis of their dimension - Specific examples from the response - Impact assessment - Concrete improvement suggestions
Evaluating Multiple Responses¶
Compare different responses to the same task:
from swarms import CouncilAsAJudge
council = CouncilAsAJudge(model_name="gpt-4o-mini")
# Response A
response_a = """
Task: Explain recursion in programming.
Response: Recursion is when a function calls itself to solve smaller versions of the same problem until reaching a base case that stops the recursion.
"""
# Response B
response_b = """
Task: Explain recursion in programming.
Response: Recursion is a programming technique where a function calls itself to solve a problem by breaking it down into smaller, similar subproblems. Each recursive call works on a simpler version of the problem until reaching a base case - a condition that stops the recursion. For example, calculating factorial(5) recursively would call factorial(4), which calls factorial(3), and so on until factorial(1) returns 1. The results then combine back up the chain. While elegant, recursion uses more memory than iteration due to the call stack, so it's best for naturally recursive problems like tree traversal or divide-and-conquer algorithms.
"""
# Evaluate both
eval_a = council.run(task=response_a)
eval_b = council.run(task=response_b)
print("Response A Evaluation:")
print(eval_a)
print("\n" + "="*60 + "\n")
print("Response B Evaluation:")
print(eval_b)
Configuration Options¶
| Parameter | Default | Description |
|---|---|---|
name |
"CouncilAsAJudge" |
Display name of the council |
model_name |
"gpt-4o-mini" |
Model for judge agents |
aggregation_model_name |
"gpt-4o-mini" |
Model for aggregator agent |
judge_agent_model_name |
None |
Override model for specific judges |
output_type |
"final" |
Output format type |
max_loops |
1 |
Maximum loops for agents |
random_model_name |
True |
Use random models for diversity |
cache_size |
128 |
LRU cache size for prompts |
Advanced Configuration¶
# Use more powerful model for aggregation
council = CouncilAsAJudge(
model_name="gpt-4o-mini", # For individual judges
aggregation_model_name="gpt-4o", # More powerful for synthesis
cache_size=256, # Larger cache for better performance
)
Use Cases¶
| Domain | Evaluation Purpose |
|---|---|
| Content Review | Evaluate blog posts, articles, or documentation quality |
| LLM Output Evaluation | Assess AI-generated content across dimensions |
| Code Review | Evaluate code explanations and technical documentation |
| Customer Support | Assess quality of support responses |
| Educational Content | Evaluate clarity and accuracy of learning materials |
Example: Evaluating AI Responses¶
from swarms import CouncilAsAJudge
council = CouncilAsAJudge(name="AI-Response-Evaluator")
# AI-generated response to evaluate
ai_response = """
Task: Explain the benefits of cloud computing for small businesses.
Response: [Your AI-generated content here]
"""
evaluation = council.run(task=ai_response)
# Use evaluation to:
# 1. Identify weaknesses in AI output
# 2. Guide prompt refinement
# 3. Ensure quality before publishing
# 4. Compare different AI models
How It Works¶
- Task Submission: Submit a task containing the response to evaluate
- Parallel Evaluation: Six judge agents evaluate concurrently:
- Each judge focuses on their specialized dimension
- Judges provide detailed, technical feedback
- Specific examples and improvement suggestions included
- Aggregation: Aggregator agent synthesizes all evaluations into:
- Executive summary of key strengths/weaknesses
- Cross-dimensional patterns and correlations
- Prioritized improvement recommendations
- Comprehensive technical report
- Result: Receive detailed evaluation report with actionable insights
Best Practices¶
- Clear Task Formatting: Clearly separate the task and response in your input
- Sufficient Context: Provide enough context for judges to evaluate properly
- Appropriate Models: Use more powerful models for aggregation when quality is critical
- Iterative Improvement: Use evaluation feedback to refine responses iteratively
Evaluation Output¶
The council returns a comprehensive report including:
EXECUTIVE SUMMARY
- Key strengths identified
- Critical issues requiring attention
- Overall assessment
DETAILED ANALYSIS
- Cross-dimensional patterns
- Specific examples and implications
- Technical impact assessment
RECOMMENDATIONS
- Prioritized improvement areas
- Specific technical suggestions
- Implementation considerations
Related Architectures¶
- MajorityVoting - Multiple agents vote on best response
- LLM Council - Council members rank each other's responses
- DebateWithJudge - Two agents debate with judge synthesis
Next Steps¶
- Explore CouncilAsAJudge Tutorial for advanced examples
- See GitHub Examples
- Learn about Evaluation Frameworks