LLM as a Judge: Evaluating LLM Outputs and the Challenge of Hallucinations

Large Language Models (LLMs) like ChatGPT, Bard, and GPT-4 have become essential tools in research, writing, and decision-making. They generate text that is often coherent, persuasive, and seemingly factual. However, as reliance on these models grows, a critical question arises:

Can we trust LLMs to evaluate themselves or each other accurately?

LLMs are known to produce "hallucinations"—responses that range from subtly inaccurate to entirely fabricated. Using one LLM to detect another’s errors is akin to relying on a faulty compass for navigation.

This article explores the opportunities and challenges of using LLMs as evaluators, drawing on recent research to outline best practices and future directions.

LLMs as Judges: Promise and Pitfalls

The Role of LLMs in Self-Evaluation

LLMs are increasingly used to assess the quality of text generated by other LLMs through libraries like DeepEval and RAGAS. This self-referential evaluation is particularly useful when human oversight is impractical due to cost or time constraints. LLMs can analyze coherence, relevance, and factual accuracy, making them valuable for tasks such as automated essay grading, content moderation, and systematic literature reviews.

However, significant challenges arise when LLMs act as judges. Chief among them is hallucination, where models generate text that appears credible but is factually incorrect or unsupported. According to Shi et al. (2025), these errors range from minor misquotes to entirely fabricated studies or references. This issue is especially concerning in fields requiring high accuracy, such as scientific writing and systematic reviews.

Hallucinations in LLMs: A Persistent Problem

Hallucinations are well-documented in LLMs. A study by Chelli et al. (2024) found that models frequently generate references that appear legitimate but are entirely fabricated. Their research revealed that:

  • GPT-3.5 hallucinated 39.6% of its references.
  • Bard hallucinated an alarming 91.4% when conducting systematic reviews in the medical field.
  • GPT-4 performed better, but still had a 28.6% hallucination rate—posing a significant risk in critical applications.

Additionally, the study found that LLMs often fail to follow explicit instructions, such as identifying randomized controlled trials or excluding systematic reviews. While they generate text that appears coherent and relevant, they do not always adhere to precise directives, leading to misleading or incorrect outputs.

Evaluating LLM Outputs: The Need for Robust Metrics

To mitigate the risks posed by hallucinations, robust evaluation metrics for LLMs are essential. Traditional metrics like precision, recall, and F1-score are commonly used to assess LLM performance in tasks such as information retrieval. However, these metrics do not fully capture hallucinations—particularly when the generated text appears plausible but is factually incorrect.

Chelli et al. (2024) used these metrics to evaluate LLM performance in systematic reviews and found:

  • GPT-4 had a precision rate of 13.4%, meaning only 13.4% of its references were actually present in the original systematic reviews.
  • GPT-3.5 had a slightly lower precision rate of 9.4%.
  • Bard failed to retrieve any relevant papers.

These findings highlight the limitations of traditional evaluation methods. However, it is important to note that LLMs have continued to evolve since this study, and their performance may differ today.

The Role of Knowledge Graphs in Mitigating Hallucinations

One promising approach to mitigating hallucinations in LLMs is the integration of Knowledge Graphs (KGs). KGs are structured representations of knowledge that provide factual grounding, helping to ensure the accuracy and relevance of generated text. As discussed by Lavrinovics et al. (2025), KGs can fill gaps in an LLM's understanding of certain topics, thereby reducing the likelihood of hallucination

Lavrinovics et al. (2025) propose a categorization of various stages at which KGs can be integrated into LLMs, including pretraining, inference, and post-generation. By incorporating KGs at these stages, LLMs can be conditioned on reliable external knowledge sources, improving factual consistency and reducing the risk of generating incorrect or unsupported information.

When LLMs Evaluate LLMs: Potential Pitfalls

  1. Hallucination Echo Chambers
    If both the content-generating LLM and the evaluating LLM are prone to hallucinations, they may reinforce each other’s errors, creating an echo chamber of misinformation. Without an external ground truth or human oversight, erroneous data can propagate unchecked.
  2. Limited Factual Grounding
    LLMs learn patterns from vast text corpora but lack true "understanding." Novel or complex claims may not be verifiable through pattern matching alone. Lavrinovics et al. (2025) highlight that when LLMs act as fact-checkers, they often rely on surface-level patterns rather than grounded knowledge—especially in specialized domains.
  3. Bias Amplification
    LLMs can inherit biases from their training data and may amplify them when used to evaluate other models. As Chelli et al. (2024) note, biases—such as a preference for American authors or open-access papers—can become more pronounced when a model evaluates content quality.

Best Practices for LLM-as-Judge Scenarios
Despite these challenges, LLMs can still be useful judges if organizations adopt the following strategies:

  1. Multi-Model Consensus
    Use multiple LLMs with different architectures or training datasets to evaluate the same output. Consensus across models increases confidence in the result.
  2. Human-in-the-Loop Oversight
    Retain domain experts to review critical judgments. In high-stakes areas such as healthcare, policy, or academic publishing, human expertise is indispensable.
  3. Clear and Narrow Evaluation Criteria
    Provide explicit, structured criteria for the LLMs to follow (e.g., "Mark only randomized trials"). Narrow, well-defined criteria improve the model's reliability.
  4. Periodic Validation
    Regularly compare LLM judgments against ground-truth data or human evaluations. This feedback loop helps identify drift or emerging biases.
  5. Knowledge Graph Integration
    As proposed by Lavrinovics et al. (2025), integrating knowledge graphs at the pretraining, inference, or post-generation stages can help ground LLM outputs in verifiable facts.

Looking Ahead

The challenges of hallucination, bias, and limited factual grounding mean that, despite their brilliance, LLMs cannot be trusted to evaluate themselves without careful checks and balances. However, when integrated thoughtfully—through multi-model consensus, knowledge graphs, and rigorous human oversight—LLMs can offer scalable, cost-effective evaluations.

The future of AI evaluation is likely to be hybrid

Automated LLM-based screening complemented by specialized verification tools and human experts. Community feedback and transparent reporting will also play a crucial role, enabling practitioners to correct errors and improve these systems over time.

As researchers such as Chelli et al. (2024), Lavrinovics et al. (2025), and Shi et al. (2025) have demonstrated, the key to leveraging LLM judges lies in understanding their limitations. In the world of artificial intelligence, even the judges need judging—a reality we must embrace to responsibly advance AI’s capabilities.

To learn more about how Factored can help you quickly and efficiently build and scale high-caliber machine learning, data science, data engineering, and data analytics teams, contact us at:

sales@factored.ai or call (650) 353-5484.

Factored AI

Center of Excellence: Machine Learning

Team Lead: Kelvin Andre Pacheco

References

  1. Chelli, M., Descamps, J., Lavoué, V., Trojani, C., Azar, M., Deckert, M., Raynier, J. L., Clowez, G., Boileau, P., & Ruetsch-Chelli, C. (2024). Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis. Journal of Medical Internet Research, 26, e53164. https://doi.org/10.2196/53164
  1. Lavrinovics, E., Biswas, R., Bjerva, J., & Hose, K. (2025). Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective. Web Semantics: Science, Services and Agents on the World Wide Web, 85, 100844. https://doi.org/10.1016/j.websem.2024.100844
  1. Shi, P., Wang, Y., & D’Innocente, J. (2025). Understanding Hallucination in Large Language Models: A Comprehensive Survey. Transactions of the Association for Computational Linguistics, 33, 102–118.

Continue reading

We cover 100% of U.S. time zones, becoming a natural extension of your team
Hire the highest-caliber engineers in under a week
Build IP that belongs to you
Accelerate your roadmap
Start Building Your Team