Often when I read ML papers the authors compare their results against a benchmark (e.g. using RMSE, accuracy, …) and say “our results improved with our new method by X%”. Nobody makes a significance test if the new method Y outperforms benchmark Z. Is there a reason why? Especially when you break your results down e.g. to the anaylsis of certain classes in object classification this seems important for me. Or do I overlook something?
Without going into some of the fallacies that people posted in the tread, I’ll share some basic strategies I personally use to validate my work:
I repeat the experiment at least 30 times (using small datasets), draw a distribution and analyze the results.
This is very basic, easy and if someone complains about compute, it can be automated to run overnight on commodity hardware or using a smaller dataset or building a simple benchmark and comparing performance.
As to OP’s question, I personally feel that ML is more focused on optimizing metrics to achieve a goal, and less focused on inferential analysis or feasibility of results. As an example, I see a majority of kaggle notebooks using logistic regression without checking for its assumptions.