Hi all.
I was researching generative model evaluation and found this post interesting: https://deepsense.ai/evaluation-derangement-syndrome-in-gpu-poor-genai
A lot of it kind of corresponds to what I see happening in the industry and feels like a good fit here
The typical measure for most ML conferences is the Fréchet inception distance (FID) but I have seen a number of generative AI papers, and what those values actually mean practically can be extremely obtuse, I appreciate papers that reports both the FID as a metric and also produce some representative examples of the output. (in the suplementary material if space is an issue)