I’m trying to understand, at the lowest level where errors are back-propagated through the network, have LLMs basically converged on a single technique or is this evolving rapidly? For example:
- gradient descent
- stochastic gradient descent
- resilient propagation (rprop)
- rprop variants: rprop+, grprop, arcprop
- Something else?
Also, it training typically done using fp32? fp16?