ARTICLE AD BOX
I've built a Self-Evaluating RAG System using LangChain, ChromaDB, BM25 hybrid retrieval, query rewriting, and cross-encoder reranking, with LLaMA 3.3 70B via Groq as the LLM.
I'm trying to evaluate it using DeepEval with a golden dataset (QA pairs generated from my documents). Here's my current setup:
Stack:
RAG: LangChain + ChromaDB + BM25 + CrossEncoder reranker
LLM: Groq (LLaMA 3.3 70B)
Evaluation framework: DeepEval v3.9.7
Custom LLM for evaluation: Gemini 1.5 Flash (via Google GenAI SDK)
Golden dataset: Generated using DeepEval Synthesizer from 25 Wikipedia .txt documents
Metrics: Answer Relevancy, Faithfulness, Contextual Precision, Contextual Recall
What I've done so far:
Generated a golden dataset using DeepEval's Synthesizer with a custom HuggingFace embedder and Gemini as critic model
Built test cases by running each golden question through my actual RAG pipeline to get real actual_output and retrieval_context
Running evaluation using DeepEval's evaluate() function with a custom Gemini model
Problems I'm facing:
DeepEval's evaluate() times out after ~30 minutes when running 8 test cases in parallel with 4 metrics
Getting occasional 500 Internal errors from Gemini API during evaluation
Not sure if running evaluation one-by-one using metric.measure() instead of evaluate() is the right approach
Questions:
What's the recommended way to run DeepEval evaluation without hitting timeouts — should I use metric.measure() one by one or is there a way to configure the timeout in evaluate()?
Is there a better open-source/free LLM choice for the evaluation judge model that's more stable than Gemini for DeepEval metrics?
Has anyone successfully used RAGAS instead of DeepEval for a similar setup? Would it be easier to integrate?
Any tips on generating better quality golden datasets without using OpenAI (since I don't have an OpenAI key)?
Any help would be appreciated. Thanks!
