The Challenge

When you ship an LLM-powered feature, how do you know it's working? Traditional testing doesn't apply—there's no single "correct" answer to compare against. The model might produce grammatically perfect responses that are completely wrong, or technically accurate answers that miss the user's intent.

At ALDAR, we have multiple AI products in production: HelloAldar (a GPT-based conversational assistant for property queries) and Harbour (an AI investment planning tool). Both use RAG pipelines to ground responses in real data. Both can fail in subtle ways that users won't immediately notice.

DeepEval Integration

DeepEval is an open-source LLM evaluation framework that provides metrics specifically designed for RAG and conversational AI. We integrated it into our Azure DevOps pipelines to run automatically on every deployment.

Metrics We Track

  • Faithfulness: Does the response only contain information that can be attributed to the retrieved context? Catches hallucinations and fabricated details.
  • Answer Relevancy: Does the response actually address what the user asked? Catches responses that are accurate but off-topic.
  • Contextual Relevancy: Did the retrieval step pull the right documents? Catches upstream failures in the RAG pipeline.
  • Hallucination Score: Direct measurement of fabricated information using LLM-as-judge with chain-of-thought reasoning.
  • Bias: Checks for problematic patterns in responses—important for a property company operating in a diverse market.

Pipeline Architecture

# azure-pipelines.yml (simplified)
stages:
  - stage: Deploy
    jobs:
      - deployment: DeployRAG
        # ... deployment steps

  - stage: LLMEvaluation
    dependsOn: Deploy
    jobs:
      - job: RunDeepEval
        steps:
          - script: |
              python -m deepeval test run \
                --test-file tests/llm_eval.py \
                --output results.json
            displayName: 'Run DeepEval Tests'
          
          - task: PublishTestResults@2
            inputs:
              testResultsFiles: 'results.json'
              testRunTitle: 'LLM Evaluation'

Test Case Design

We maintain a curated set of ~200 test cases across both products, organized by:

  • Golden answers: Questions where we know the exact expected response (property prices, availability, policy details).
  • Behavioral tests: Questions designed to trigger specific failure modes (adversarial prompts, out-of-scope queries, ambiguous requests).
  • Regression tests: Real user queries that previously produced bad responses, captured from production logs.

Example Test Case

from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_property_availability():
    test_case = LLMTestCase(
        input="Is unit 1204 in Marina Heights available?",
        actual_output=rag_pipeline.query("Is unit 1204 in Marina Heights available?"),
        retrieval_context=rag_pipeline.get_context("unit 1204 Marina Heights"),
        expected_output="Unit 1204 in Marina Heights is currently available..."
    )
    
    faithfulness = FaithfulnessMetric(threshold=0.8)
    relevancy = AnswerRelevancyMetric(threshold=0.7)
    
    assert_test(test_case, [faithfulness, relevancy])

Prompt Regression Testing

One non-obvious use case: catching prompt regressions. When someone updates a system prompt or tweaks retrieval parameters, the pipeline automatically runs the full test suite. We've caught several cases where "minor" prompt changes caused significant quality degradation that wouldn't have been noticed until users complained.

Cost Management

Running LLM evaluation at scale gets expensive. Each test case requires multiple LLM calls (the actual RAG query, plus evaluator calls for each metric). We manage this by:

  • Running full evaluation only on staging/production deploys, not every PR
  • Using a smaller, faster evaluator model (GPT-3.5-turbo) for initial passes
  • Escalating to GPT-4 only for borderline cases
  • Caching retrieval results to avoid redundant embedding calls

Results

Since implementing this pipeline six months ago:

  • Caught 12 regressions before they reached production
  • Reduced average hallucination rate from 8% to under 2%
  • Built confidence to ship prompt changes more frequently
  • Created a feedback loop from production issues back to test cases

What's Next

We're working on continuous evaluation in production—sampling real user queries and running them through the evaluation pipeline with human review for edge cases. The goal is to detect drift before users report it.