Defeating Nondeterminism in LLM Inference
I didn’t understand why such a high profile lab is working on this. Determinism is useful, but does it worth that much of research?
Today I think I have a decent guess about this.
During RLHF, the label user casted on responses, especially the ones with multiturn input, the label is only useful when we can replicate all the answers during training with policy model’s forward pass. Or the human feed back will be rewarding or punishing to the wrong responses.