How to explain why RL is difficult to someone who knows nothing about it?

I’ve been working on an RL project at work. The person who assigned it to me is a computer scientist who is not an expert on RL, but understands it’s a difficult problem. (My boss is on equal footing with the person who assigned the project to me. My boss is not a computer scientist and doesn’t know anything about RL.) This guys boss is a business manager who doesn’t know anything about RL and knows very little about ML. The business manager wants a report on how the project is going from me and I’m getting the sense that he doesn’t really understand why this is taking so long.

For context, I’ve been working on this project for about 4 months for 15 hours per week. In that time, I’ve built an entire code base for the problem from scratch and programmed up several models. I have one that mostly works at the moment, but I need to make some changes to the reward functions to get it performing well consistently. I’m the only one working on this project, so I’ve done all of this myself. I also had only done vanilla RL prior to this, so I’ve had to learn a ton about deep RL to make this work. Luckily I know someone who’s an expert in deep RL (outside work) and has been able to give me pointers. I’m feeling like I’ve made a ton of progress and am nearing the home stretch in terms of having a completely polished model. However I’m getting the sense that this guy is not super thrilled with me. This guy doesn’t have any official authority over me, so this is mainly about trying to explain how much work RL is in addition to mg normal slides about the project and where I’m at.

  • Impressive-Cat-2680@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Just a side qiesiton;: can I tell people I understand RL just becuz I ran a lot of bellman equation type of problem in economics ?

  • Creature1124@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Switch the entire project to use neuroevolution. Then you can chart a graph for how long it took similarly complex behavior to occur naturally vs with your implementation. Have some fun with that graph and make it really dramatic, then ask for a raise because you’re millions, if not billions, of times faster than god was.

    • Smart-Emu5581@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      “I’m millions of times faster than god was” is such a fantastic take. I’m going to remember that one.

  • jms4607@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Why didn’t you use existing software to allow you to finish this quicker?

    • savvyms@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      It’s a really specific problem. I drew inspiration from a bunch of other code, but it still took a lot to put it together.

  • sdmat@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Just try a huge number of ways to explain it. See how you go and iterate on the best approaches. Maybe trying to glean some high level concepts about education in the process would help.

  • Piledhigher-deeper@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Did you have to use RL? RL is pretty much just another word for gradient free optimization, which is obviously hard, but I guess that isn’t going to help you.

  • OptimizedGarbage@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    In general? Because deep RL is a skyscraper made of sticks and glue. Nothing, and I mean nothing, is actually guaranteed to work or has any kind of theoretical foundation at all. There are guarantees for toy problems, but everything past that is the wild West. In practice it’s janky in a way no other field of ML is.

    The standard way of learning value functions is to use the Temporal Difference update. Except we’ve known since the 90’s that this doesn’t really work – sometimes the solutions diverge, and there’s no known way of ensuring the neural net weights won’t all go to infinity. In practice this means that frequently authors will do multiple runs, and only report the runs where the weights don’t explode. Even if your weights don’t explode, in general policy class are not expressive enough to learn the optimal max-entropy policy, and even if they are, the loss isn’t convex. It’s possible to learn the right value function and not be able to recover the optimal policy.

    And even granted that those don’t cause issues, you have to have an exploration strategy. Exploration is far and away the hardest problem in machine learning. You have to reason about the expected value of places where you don’t have data to make estimates. And even when you do have estimates, none of your data is iid. It’s basically impossible to do any kind of normal statistics to solve exploration. If you look into the literature on exploration and online learning, you’ll instead find some incredibly unusual math, most frequently involving an algorithm called Mirror Descent that does gradient descent in non-Euclidean geometry. But even that’s really only usable for toy problems right now. The only viable strategy for real problems is trial and error.

    • currentscurrents@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Model based RL is looking a little more stable in the last year. Dreamerv3 and TD-MPC2 claim to be able to train on hundreds of tasks with no per-task hyperparameter tuning, and report smooth loss curves that scale predictably.

      Have to wait and see if it pans out though.

      • OptimizedGarbage@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        10 months ago

        I think this is overstating the contribution of these kinds of works. They still learn a Q-function via Mean-Squared Bellman Error, which means they’re subject to the same kind of instability in the value function as DDPG. They use a maximum entropy exploration method on the policy, which doesn’t come with exploration efficiency guarantees (at least not ones that are anywhere near optimal). The issue is that RL is extremely implementation-dependent. You can correctly implement an algorithm that got great results in a paper and have it still crash and burn.

        At a basic level, the issue is that we just don’t have sound theory for extending RL to continuous non-linear MDPs. You can try stuff, but it’s all engineers’ algorithms, not mathematicians’ algorithms – you have no idea if or when it’ll all break down, and if it does all break down, they’re not gonna tell you that in the paper. Fundamentally we need theoretical work showing how to correctly solve these kinds of problems, and that’s something a problem that these experimentally-focused papers are not attempting to address.

        Progress requires directly addressing these issues. In my opinion, that’s most likely to come though theoretically-driven work. For the value-divergence problem, that means Gradient Temporal Difference algorithms and their practical extensions (such as TD with Regularized Corrections). For exploration, that means using insights from online learning, like best-of-both-worlds algorithms that give a clear “exploration objective” that policies can optimize.

  • tarsiospettro@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    I think it’s a hopeless problem. Maybe someone as third part confirmation to the business manager about the complexity of the task

  • Smart-Emu5581@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Most problems in machine learning have learning of the form “you did something wrong, here is what you should have done instead”. RL has learning of the form “you did something wrong, but I’m not telling you what you should have done instead”. Imagine trying to learn anything difficult like that.

  • testuser514@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    I guess my question would be, what do you want out of this meeting ?

    More resources?

    Make sure they don’t give you trouble for a lack of results ?

    That will basically color your update. Essentially trying to explain the complexity of a technical problem to someone who isn’t interested in learning is not going to be easy.

  • matty961@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Unfortunately it’s hard to sell your project’s difficulty to someone who doesn’t know as much about ML.

    I’m not sure if you’ve done this, but IMO the best way to give non-technical people confidence in a long-running project is to break down your work into milestones each with a concrete deliverable, and then estimate how long you think each milestone will take. That way, the business manager can determine:

    • When will the project be done?

    • Is the project on-track to finish on time?

    • What will be delivered and at what dates?

    Which is probably all that they care about. If the person who assigned you the project and your boss think your timeline is reasonable, your project is on time and you’re delivering what you’re supposed to, then no one can really complain.

    I’ve always been of the opinion that writing timelines for ML projects is a lot harder than non-ML work, since often for ML you are trying to hit some “good enough” bar, but you don’t know exactly what it will take to get there.

    • savvyms@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      I have done that for several other projects (usually they ask at the outset, tells you something about how this one is being run lol). Neither the project lead or the business manager has given me a time limit or target date for this project, but that might be the best way to recalibrate their expectations.