Often when I read ML papers the authors compare their results against a benchmark (e.g. using RMSE, accuracy, …) and say “our results improved with our new method by X%”. Nobody makes a significance test if the new method Y outperforms benchmark Z. Is there a reason why? Especially when you break your results down e.g. to the anaylsis of certain classes in object classification this seems important for me. Or do I overlook something?

  • me_but_darker@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Without going into some of the fallacies that people posted in the tread, I’ll share some basic strategies I personally use to validate my work:

    • Bootstrap sampling to train and test the model.
    • Modifying the random seed.
    • using inferential statistics ( if you’re a fan of frequentist statistics then CI or ROPE if you are a fan of Bayesian)

    I repeat the experiment at least 30 times (using small datasets), draw a distribution and analyze the results.

    This is very basic, easy and if someone complains about compute, it can be automated to run overnight on commodity hardware or using a smaller dataset or building a simple benchmark and comparing performance.

    As to OP’s question, I personally feel that ML is more focused on optimizing metrics to achieve a goal, and less focused on inferential analysis or feasibility of results. As an example, I see a majority of kaggle notebooks using logistic regression without checking for its assumptions.