Today I see a paper about the loss function choice.
But one of its explaination on DQN loss is novel to me. Goes something like this:
That the update process involves minimizing the differences of 2 outcomes calculated with same model structure with same weights at the different time.
There is weights theta and less updated/synced weights theta^(-1), which should be the previous to be improved upon brain.
I tried a more clearer frames timeline approach, it’s rewarding, refactor accelerates the improvement on existed functions.