I’m trying to understand, at the lowest level where errors are back-propagated through the network, have LLMs basically converged on a single technique or is this evolving rapidly? For example:

  • gradient descent
  • stochastic gradient descent
  • resilient propagation (rprop)
  • rprop variants: rprop+, grprop, arcprop
  • Something else?

Also, it training typically done using fp32? fp16?