How to Evaluate a Self-Evaluating RAG System using DeepEval / RAGAS with a Golden Dataset?

2 weeks ago 13
ARTICLE AD BOX

I've built a Self-Evaluating RAG System using LangChain, ChromaDB, BM25 hybrid retrieval, query rewriting, and cross-encoder reranking, with LLaMA 3.3 70B via Groq as the LLM.

I'm trying to evaluate it using DeepEval with a golden dataset (QA pairs generated from my documents). Here's my current setup:

Stack:

RAG: LangChain + ChromaDB + BM25 + CrossEncoder reranker

LLM: Groq (LLaMA 3.3 70B)

Evaluation framework: DeepEval v3.9.7

Custom LLM for evaluation: Gemini 1.5 Flash (via Google GenAI SDK)

Golden dataset: Generated using DeepEval Synthesizer from 25 Wikipedia .txt documents

Metrics: Answer Relevancy, Faithfulness, Contextual Precision, Contextual Recall

What I've done so far:

Generated a golden dataset using DeepEval's Synthesizer with a custom HuggingFace embedder and Gemini as critic model

Built test cases by running each golden question through my actual RAG pipeline to get real actual_output and retrieval_context

Running evaluation using DeepEval's evaluate() function with a custom Gemini model

Problems I'm facing:

DeepEval's evaluate() times out after ~30 minutes when running 8 test cases in parallel with 4 metrics

Getting occasional 500 Internal errors from Gemini API during evaluation

Not sure if running evaluation one-by-one using metric.measure() instead of evaluate() is the right approach

Questions:

What's the recommended way to run DeepEval evaluation without hitting timeouts — should I use metric.measure() one by one or is there a way to configure the timeout in evaluate()?

Is there a better open-source/free LLM choice for the evaluation judge model that's more stable than Gemini for DeepEval metrics?

Has anyone successfully used RAGAS instead of DeepEval for a similar setup? Would it be easier to integrate?

Any tips on generating better quality golden datasets without using OpenAI (since I don't have an OpenAI key)?

Any help would be appreciated. Thanks!

Read Entire Article