FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models

1Peking University, 2Microsoft Research, 3Westlake University

Demo Video

Abstract

The rapid growth of evaluation methodologies and datasets for large language models (LLMs) has created a pressing need for their cost-effective integration, while ensuring reliability, reproducibility, and efficiency. Additionally, concerns about data contamination and bias often compromise the reliability of evaluation findings, while the efficiency of evaluation processes is frequently overlooked, despite the significant computational costs associated with LLM inference. In response to these challenges, we introduce FreeEval, a modular framework not only for conducting trustworthy and efficient automatic evaluations of LLMs but also a platform for developing and validating new evaluation methodologies. FreeEval addresses key challenges through: (1) unified abstractions that simplify integration of diverse evaluation methods, including dynamic evaluations requiring complex LLM interactions; (2) built-in meta-evaluation techniques such as data contamination detection and human evaluation to enhance result fairness; and (3) a high-performance infrastructure with distributed computation and caching strategies for efficient, large-scale evaluations. The framework features an interactive Visualizer for result analysis and interpretation, further supporting the innovation of evaluation techniques. We open-source all our code at https://github.com/WisdomShell/FreeEval and our demostration video, live demo, installation guides are available at: https://freeeval.zhuohao.me/.