Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track
Feiyang Kang, Hoang Anh Just, Anit Kumar Sahu, Ruoxi Jia
Traditionally, data selection has been studied in settings where all samples from prospective sources are fully revealed to a machine learning developer. However, in practical data exchange scenarios, data providers often reveal only a limited subset of samples before an acquisition decision is made. Recently, there have been efforts to fit scaling functions that predict model performance at any size and data source composition using the limited available samples. However, these scaling functions are usually black-box, computationally expensive to fit, highly susceptible to overfitting, or/and difficult to optimize for data selection. This paper proposes a framework called