[D] how to explain why RL is difficult to someone who knows nothing about it?

savvyms@alien.top · 10 months ago

[D] how to explain why RL is difficult to someone who knows nothing about it?

currentscurrents@alien.top · 10 months ago

Model based RL is looking a little more stable in the last year. Dreamerv3 and TD-MPC2 claim to be able to train on hundreds of tasks with no per-task hyperparameter tuning, and report smooth loss curves that scale predictably.

Have to wait and see if it pans out though.

OptimizedGarbage@alien.top · 10 months ago

I think this is overstating the contribution of these kinds of works. They still learn a Q-function via Mean-Squared Bellman Error, which means they’re subject to the same kind of instability in the value function as DDPG. They use a maximum entropy exploration method on the policy, which doesn’t come with exploration efficiency guarantees (at least not ones that are anywhere near optimal). The issue is that RL is extremely implementation-dependent. You can correctly implement an algorithm that got great results in a paper and have it still crash and burn.

At a basic level, the issue is that we just don’t have sound theory for extending RL to continuous non-linear MDPs. You can try stuff, but it’s all engineers’ algorithms, not mathematicians’ algorithms – you have no idea if or when it’ll all break down, and if it does all break down, they’re not gonna tell you that in the paper. Fundamentally we need theoretical work showing how to correctly solve these kinds of problems, and that’s something a problem that these experimentally-focused papers are not attempting to address.

Progress requires directly addressing these issues. In my opinion, that’s most likely to come though theoretically-driven work. For the value-divergence problem, that means Gradient Temporal Difference algorithms and their practical extensions (such as TD with Regularized Corrections). For exploration, that means using insights from online learning, like best-of-both-worlds algorithms that give a clear “exploration objective” that policies can optimize.