Hi, you wonderful people!

Here’s a thought that came to my mind: Since training LLMs involves a degree of randomness, is there potentially a way to create an architecture for LLMs (or other AI) that would be somewhat deterministic in its training instead?

What I mean is, could a theoretical architecture exist where everyone could train their own separate checkpoints on different datasets, which, after combining, would result in a checkpoint with combined learning from all these different smaller checkpoints?

What this would allow us to do is let thousands of people create their own checkpoints, which when combined would result in something greater than the individual parts themselves. And since the training process is what takes the longest in developing LLMs (or any AI), this approach would allow almost everyone to contribute their share of processing power towards creating something together.

If viable, this could have huge potential implications for Open Source Software.

I’m looking forward to hearing what all of you smart people have to say about it!

  • synthphreak@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Naively averaging weights of models trained on disjoint datasets won’t work for LLMs or 1+ hidden layer DNNs

    Why would simply aggregating the weights like this categorically fail to produce a reasonable model? Assuming of course that the datasets are all “the same” in some meaningful sense (e.g., equally representative of the same underlying X→Y mappings).

    • ohmygad45@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Here’s a simple intuition as to why averaging the weights of a 1+ hidden layer NN won’t work: pick a hidden layer in your model and apply a permutation matrix to its weights (along the input axis) and the inverse permutation matrix to the previous layer (along the output axis). Obviously the model is unchanged (from an input/output perspective). Repeat that N times (where N is the input dimension of the hidden layer you picked). You now have N models that are identical from an input output perspective. If you average those model weights, your hidden layer will output a constant because all its weights will be identical. This averaged weights model is obviously completely broken even though it’s the average of N “identical” (from an input / output perspective) model. QED.

      • synthphreak@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        10 months ago

        Interesting. I love a good thought experiment :)

        But what about the idea of bagging? As in aggregating multiple models together that have all been trained on different examples, and thus learned different things. Why is that not subject to similar criticism?