[Discussion] Let's speak theory. Exploring the Potential of Collaborative Training?

paryska99@alien.top · 10 months ago

[Discussion] Let's speak theory. Exploring the Potential of Collaborative Training?

synthphreak@alien.top · 10 months ago

Naively averaging weights of models trained on disjoint datasets won’t work for LLMs or 1+ hidden layer DNNs

Why would simply aggregating the weights like this categorically fail to produce a reasonable model? Assuming of course that the datasets are all “the same” in some meaningful sense (e.g., equally representative of the same underlying X→Y mappings).

ohmygad45@alien.top · 10 months ago

Here’s a simple intuition as to why averaging the weights of a 1+ hidden layer NN won’t work: pick a hidden layer in your model and apply a permutation matrix to its weights (along the input axis) and the inverse permutation matrix to the previous layer (along the output axis). Obviously the model is unchanged (from an input/output perspective). Repeat that N times (where N is the input dimension of the hidden layer you picked). You now have N models that are identical from an input output perspective. If you average those model weights, your hidden layer will output a constant because all its weights will be identical. This averaged weights model is obviously completely broken even though it’s the average of N “identical” (from an input / output perspective) model. QED.

synthphreak@alien.top · 10 months ago

Interesting. I love a good thought experiment :)

But what about the idea of bagging? As in aggregating multiple models together that have all been trained on different examples, and thus learned different things. Why is that not subject to similar criticism?