BatchEval: Towards Human-like Text Evaluation

robustness
prompt engineering
Introducing BatchEval paradigm improves text evaluation with large language models by 10.5% while reducing API cost.
Author

gpt-3.5-turbo-1106

Published

December 31, 2023

BatchEval: Towards Human-like Text Evaluation

Abstract

The paper introduces “BatchEval,” a new paradigm for text evaluation that conducts batch-wise evaluation iteratively to address limitations of sample-wise evaluation methods. The proposed approach aims to alleviate sensitivity to prompt design, poor resistance to noise, and inferior ensemble performance by incorporating batch-wise evaluation akin to the way humans assess text. The paper presents comprehensive experiments demonstrating that BatchEval outperforms state-of-the-art methods by 10.5% on Pearson correlations with a lower API cost.

Introduction

The paper outlines the significance of accurate text evaluation in the context of rapid progress in large language models (LLMs) and highlights the limitations of existing automatic evaluation methods in aligning with human judgments.

Background

The paper provides an overview of existing automatic text evaluation methods, including rule-based, embedding-based

Appendix

Date Generated 2024-01-02
HTML https://browse.arxiv.org/html/2401.00437v1
Truncated True
Word Count 15893