Split and Merge: Aligning Position Biases in LLM-based Evaluators

  • Zongjie LI
  • , Chaozheng WANG
  • , Pingchuan MA
  • , Daoyuan WU*
  • , Shuai WANG*
  • , Cuiyun GAO
  • , Yang LIU
  • *Corresponding author for this work

Research output: Book Chapters | Papers in Conference ProceedingsConference paper (refereed)Researchpeer-review

Abstract

Large language models (LLMs) have shown promise as automated evaluators for assessing the quality of answers generated by AI systems. However, LLM-based evaluators exhibit position bias, or inconsistency, when used to evaluate candidate answers in pairwise comparisons, favoring either the first or second answer regardless of content. To address this limitation, we propose PORTIA, an alignment-based system designed to mimic human comparison strategies to calibrate position bias in a lightweight yet effective manner. Specifically, PORTIA splits the answers into multiple segments, taking into account both length and semantics, and merges them back into a single prompt for evaluation by LLMs. Extensive experiments with six LLMs on 11,520 answer pairs demonstrate that PORTIA markedly enhances the consistency rates for all models and forms of comparison tested, achieving an average relative improvement of 47.46%. It also enables PORTIA-enhanced GPT-3.5 to achieve agreement rates with humans comparable to GPT-4 and elevates GPT-4's consistency rate up to 98%. Subsequent human evaluations indicate that the PORTIA-enhanced GPT-3.5 model can even surpass standalone GPT-4 in terms of alignment with human evaluators, highlighting PORTIA's ability to correct position bias, improve LLM consistency, and boost performance while keeping cost efficiency.
Original languageEnglish
Title of host publicationProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
EditorsYaser AL-ONAIZAN, Mohit BANSAL, Yun-Nung CHEN
PublisherAssociation for Computational Linguistics
Pages11084-11108
Number of pages25
ISBN (Electronic)9798891761643
DOIs
Publication statusPublished - 2024
Externally publishedYes
Event2024 Conference on Empirical Methods in Natural Language Processing - Miami, United States
Duration: 12 Nov 202416 Nov 2024

Conference

Conference2024 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2024
Country/TerritoryUnited States
CityMiami
Period12/11/2416/11/24

Bibliographical note

Acknowledgements:
We are grateful to the anonymous reviewers for their valuable comments.

Publisher Copyright:
© 2024 Association for Computational Linguistics.

Funding

The HKUST authors are supported in part by a RGC GRF grant under the contract 16214723, RGC CRF grant under the contract C6015-23G, research fund provided by HSBC, and a Webank research fund WEB24EG01. The HITSZ authors are supported in part by National Natural Science Foundation of China under project (No. 62472126), Natural Science Foundation of Guangdong Province (Project No. 2023A1515011959), Shenzhen-Hong Kong Jointly Funded Project (Category A, No. SGDX20230116091246007), and Shenzhen Basic Research (General Project No. JCYJ20220531095214031).

Fingerprint

Dive into the research topics of 'Split and Merge: Aligning Position Biases in LLM-based Evaluators'. Together they form a unique fingerprint.

Cite this