shyamal's space


Applied AI @ OpenAI • Startups GTM • On Deck Fellow • Proud Son • Duke + Wisconsin Alum • Building for impact • Venture Scout • Neo Mentor • Duke AI Advisory Board

11 August 2022

Reinforcement Learning — A Primer

by Shyamal Anadkat

Share on:

Reinforcement Learning (RL) is an exciting area of Machine Learning that has been around since the 1950s. It has produced several interesting applications, particularly in gaming (e.g., you might have heard of DeepMind’s AlphaGo, which is the first computer program to defeat a professional human Go player). How did AlphaGo achieve such a feat? Well, it got better at learning the game and making better sequential decisions via Reinforcement Learning. It had to play against itself thousands of times and learn from actual Go games. This is how powerful RL is. The applications of RL are not constrained to games though. RL finds applications in recommendation systems online for instance to help customers discover new products that they might buy. Other applications include robotics, multi-agent interaction, vehicle navigation/self-driving, and industrial logistics. Reinforcement learning enables an “agent” to learn from its environment by trial and error using feedback (positive/negative) from its own experiences — much like humans learn. The agent ultimately learns to make optimal, calculated sequences of decisions.

There are a few important terms to know when we are framing RL problems. Let’s take an example of a classic board game — Snake & Ladders to understand these terms better. Let’s assume for this game we’re allowing our baby agent to control the number rolled on a die.

:: State (agent’s position): This is the current state of the agent/current situation being returned by the environment. In our example, this would mean the current position of the player/agent on the game board.

:: Environment (agent’s world): The world in which the agent operates (our entire game board with positions of snakes, ladders, and the numbers). Note that Markov Decision Processes (MDP) formally describe an environment for reinforcement learning. Markov Property basically says that the “future is independent of the past given the present”.

:: Reward (return from the environment): Scalar feedback signal telling an agent how well it is doing at a particular time. The agent’s goal is to maximize the cumulative reward. For example, the agent could earn +5 points for landing on shorter ladders, +10 on bigger ladders, lose 5 points landing on shorter snakes, and 10 points on longer snakes.

:: Policy (agent’s strategy): This is the mapping from states to actions. It determines how the agent chooses an action (number on the die).

:: Value function (long-term return with some discount): Expected discounted sum of future rewards under a particular policy. It’s an indicator of goodness/badness of states & actions.

In a nutshell, at each time step, the agent executes an action, receives an observation and a scalar reward. The agent will, over time, select actions to maximize total future rewards. Note that the actions may have long-term consequences and that it may be better to give up short-term rewards to gain more long-term rewards (like a financial investment). So, what makes Reinforcement Learning different from other machine learning types like supervised learning? The answer is that in RL there isn’t really a supervisor. While supervised learning works on existing or given samples of data, RL works on interacting with the environment. RL is all about making good decisions sequentially vs in supervised machine learning, the output decision is made on the initial input. Moreover, an RL agent has a reward/feedback signal which may be delayed and is not instantaneous (unlike supervised machine learning).

Let’s look at some of the pros and cons of Reinforcement Learning. On the good side, RL can be used to solve some complicated problems that cannot be easily tackled by other machine learning methods. The learning model is like how human beings learn and hence it’s close to achieving good long-term results and is more scalable. Moreover, in absence of training datasets, RL can be a powerful technique to learn from the agent’s experience navigating the environment. On the contrary, as we can tell, RL needs a lot of data and is computationally heavy/expensive. It might not be a practical approach to use RL for a task if we’re constrained on resources to train the model. So, when should we consider applying reinforcement learning (and when should not)? At present, RL cannot solve every problem. We should avoid using RL in situations where we cannot afford to make errors. For example, we should be very careful about using reinforcement learning to operate on a patient in a hospital setting. RL can also be difficult to use in cases when all the environment variables have not been quantified or mapped out. Partial information can lead to inaccurate and suboptimal results. Time is another limitation. If learning is mostly online, the trials must be run many times to produce an effective policy. On the other hand, RL can be a good application in scenarios where we can afford to make mistakes/errors, have the time and resources to train, and could potentially benefit from exploring/exploiting the environment to maximize the cumulative discounted rewards.

We often talk about “offline” learning and “online” reinforcement learning settings. What’s the difference? Simply put, online reinforcement learning is a technique where the algorithm ingests data one observation at a time. In some settings, online learning can be impractical due to the data collection being incredibly expensive and time-consuming Most research/implementations in RL revolve around the online learning setting. Offline or batch learning, on the other hand, ingests static data at one time to build the machine learning model. Offline reinforcement learning is often the only option in situations where we cannot afford the risks associated with online learning. In offline RL, no further interaction is required with the environment and the algorithms utilize previously collected data. Offline reinforcement learning algorithms can turn large datasets into robust decision-making engines. Let’s take an example of offline RL from the real world to better understand this. Let’s say we’re diagnosing a patient in a hospital setting. Here, actions are mapped to certain diagnostic tests and observations correspond to the results of the diagnostic tests. In such a scenario, we can use historical data around diagnostic tests of real patients and the optimal recommendations given by the healthcare provider in order to scope our offline reinforcement learning problem. Pretty neat right? Note that one of the biggest challenges with offline RL is that we don’t have the option to improve the reward by exploring the environment since we’re working with static data.

Overall, the attraction of Reinforcement Learning framework is rising. The applications of reliable and efficient reinforcement methods in verticals like healthcare, recommendation engines, autonomous systems, robotics, etc. are creating immense opportunities in industry and research. If you’d like to keep learning more about RL, I highly recommend David Silver’s Course on RL.

tags: Reinforcement Learning - Machine Learning