Often when I read ML papers the authors compare their results against a benchmark (e.g. using RMSE, accuracy, …) and say “our results improved with our new method by X%”. Nobody makes a significance test if the new method Y outperforms benchmark Z. Is there a reason why? Especially when you break your results down e.g. to the anaylsis of certain classes in object classification this seems important for me. Or do I overlook something?
They should but many don’t because often their results are not statistically significant or they have to spend a ton of compute to only show very small statistically significant improvements. So, they’ll just put 5 run averages (sometimes even less) and hope for the best. I have been a reviewer on most of the top ML conferences and I’m usually the only reviewer holding people accountable on statistical significance of results when confidence intervals are missing.