Abstract
Even though data annotation is extremely important for interpretability, research, and development of artificial intelligence solutions, annotating data remains costly. Research efforts such as active learning or few-shot learning alleviate the cost by increasing sample efficiency, yet the problem of annotating data more quickly has received comparatively little attention. Leveraging a predictor has been shown to reduce annotation cost in practice but has not been theoretically considered. We ask the following question: to annotate a binary classification dataset with N samples, can the annotator answer less than N yes/no questions? Framing this question- and-answer (Q&A) game as an optimal encoding problem, we find a positive answer given by the Huffman encoding of the possible labelings. Unfortunately, the algorithm is computationally intractable even for small dataset sizes. As a practical method, we propose to minimize a cost function a few steps ahead, similarly to lookahead minimization in optimal control. This solution is analyzed, compared with the optimal one, and evaluated using several synthetic and real-world datasets. The method allows a significant improvement (23−86%) in the annotation efficiency of real-world datasets.
| Original language | English |
|---|---|
| Pages (from-to) | 14336-14343 |
| Number of pages | 8 |
| Journal | Proceedings of the AAAI Conference on Artificial Intelligence |
| Volume | 39 |
| Issue number | 13 |
| DOIs | |
| Publication status | Published - 11 Apr 2025 |
| Externally published | Yes |
| Event | 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States Duration: 25 Feb 2025 → 4 Mar 2025 |
Bibliographical note
Publisher Copyright:Copyright © 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Funding
This work was supported by a CIFRE grant from ANRT. It was also partially financed by ANII Uruguay. Centre Borelli is also with Université Paris Cité, SSA and INSERM.