TY - GEN
T1 - Rethinking RobustBench: Is High Synthetic-Test Data Similarity an Implicit Information Advantage Inflating Robustness Scores?
AU - PAN, Chao
AU - TANG, Ke
AU - LI, Qing
AU - YAO, Xin
PY - 2025/11/24
Y1 - 2025/11/24
N2 - Standardized benchmarks like RobustBench are crucial for evaluating adversarial robustness. However, the increasing dominance of models trained on massive synthetic datasets (orders of magnitude larger than original training sets) raises questions about reported performance gains. This work identifies and investigates a potential inflation factor: high feature-level similarity between large-scale synthetic training data and benchmark test sets. We argue this similarity is an inherent characteristic arising from the probabilistic generation process of these large datasets, which naturally produces examples highly similar to test instances in feature space. This creates what we term an “Implicit Information Advantage,” where models effectively train on near-duplicates of test instances. Through comprehensive empirical analysis, we demonstrate that: (1) Synthetic datasets exhibit significantly higher similarity to the test set compared to the original training data. (2) A direct correlation exists between this similarity and robustness outcomes, with test images benefiting most having the highest similarity scores. (3) Strikingly, ablation studies show that training on just a small fraction (e.g., 1%) of the most similar synthetic examples can yield robustness comparable to using the full massive dataset. These findings suggest current benchmarks may overestimate true robust generalization due to this similarity artifact. We call for revised evaluation protocols and greater transparency to ensure benchmarks accurately measure true generalization. Code and data can be found in https://github.com/fzjcdt/RethinkingRobustBench.
AB - Standardized benchmarks like RobustBench are crucial for evaluating adversarial robustness. However, the increasing dominance of models trained on massive synthetic datasets (orders of magnitude larger than original training sets) raises questions about reported performance gains. This work identifies and investigates a potential inflation factor: high feature-level similarity between large-scale synthetic training data and benchmark test sets. We argue this similarity is an inherent characteristic arising from the probabilistic generation process of these large datasets, which naturally produces examples highly similar to test instances in feature space. This creates what we term an “Implicit Information Advantage,” where models effectively train on near-duplicates of test instances. Through comprehensive empirical analysis, we demonstrate that: (1) Synthetic datasets exhibit significantly higher similarity to the test set compared to the original training data. (2) A direct correlation exists between this similarity and robustness outcomes, with test images benefiting most having the highest similarity scores. (3) Strikingly, ablation studies show that training on just a small fraction (e.g., 1%) of the most similar synthetic examples can yield robustness comparable to using the full massive dataset. These findings suggest current benchmarks may overestimate true robust generalization due to this similarity artifact. We call for revised evaluation protocols and greater transparency to ensure benchmarks accurately measure true generalization. Code and data can be found in https://github.com/fzjcdt/RethinkingRobustBench.
U2 - 10.1109/DSAA65442.2025.11248007
DO - 10.1109/DSAA65442.2025.11248007
M3 - Conference paper (refereed)
SN - 9798331511807
T3 - Proceedings of the International Conference on Data Science and Advanced Analytics
BT - 2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA)
PB - IEEE
T2 - 2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA)
Y2 - 9 October 2025 through 12 October 2025
ER -