Last Updated:

Introduction

When I was studying in high school, I had a knack for scientific subjects and courses; things like chemistry, biology, and mathematics were always some of the most interesting things I found in the world. Among them, I chose to side with physics, eventually enrolling in a physics specialization at a prominent high school in my hometown. During that journey, I also gained some achievements in competitions: some in my province, the 30/4 Traditional Olympiad at Le Hong Phong High School for the Gifted in Ho Chi Minh City, and finally the Vietnam National Physics Olympiad (VPhO). Studying physics taught me a lot about how the world works and how some general principles continuously govern their behavior over time. For example, we could fully examine how a ball would fall if we threw it at a given speed and angle, or how velocity and acceleration affect how an object moves. Newton’s laws of motion and classical dynamics are also among the most fundamental and interesting concepts we learn when talking about physics. Studying physics also builds some useful mental models that would be immensely useful for problem solving, regardless of whether problems are in daily life or in academia. Inspired by how physics models the world, I’m fascinated by the continuity of calculus functions and how we can track their changes over time by only caring about their derivatives. With differential equations (specifically ODEs, ordinary differential equations) acting as the language of the world of changes, we learned to solve all kinds of ODEs daily. Sometimes when we stumbled upon a PDE, a partial differential equation, it would look like we had just gotten a scare from nowhere; we would laugh out loud with each other and perhaps run away to find some peace.

When I encountered maths in university, things became so discrete with all of Karnaugh maps, Boolean algebra, and things that must be in either state, only true or only false. Honestly, I have neither interest nor knack for these discrete structures, since I was with physics for a long time and everything there was continuous. As a result, I struggled a lot during my first few semesters in uni, where I had to learn things I didn’t even care about. Only when I heard that AI and specifically deep learning are built mainly with calculus and optimizations (of course, continuous functions are a major factor in this field) did I begin to feel that this field might be the right one for me to dive deep into, something I had longed for through my early years of undergrad.

First I started to learn some basic things about machine learning. To be honest, this field is really a vast area and contains a lot of stuffs that are definitely overwhelming for anyone who has just barely stepped a foot in. Things like bias-variance tradeoff, loss functions, optimizations with first order and second order derivatives, etc. One of the most fascinating parts of machine learning is definitely deep learning, where we are moving away from using simple models such as decision trees to very deep neural networks that may contain up to one thousand layers of parameters. As everyone else, the first one you encounter when dive into deep learning yourself is training a neural network on a gold dataset, a curated and production-ready one that has been extensively preprocessed for us, beginners to step in safely. This type of learning is called Supervised Learning (SL). At this point, again, I feel bored with this kind of training paradigm. Although learning from data is the most straightforward method you can think of when talking about training since we need something already good to mimic them, I want the training to be somehow more universal, since preparing a training dataset can consume an unfathomable amount of time and there are a lot of difficulties with data engineering we have to cope with. That is when I learned about the existence of a very fantastic branch of machine learning, so-called Reinforcement Learning (RL), which aims to create an intelligent agent by training itself through a series of interactions between the agent and the environment. In this blog I would like to share some of my thoughts why I find RL is much more fascinating than traditional SL and why it would likely be the next cornerstone of artificial intelligence frontiers. Following the views of some popular AI researchers, RL is considered to be the paradigm that can lead us to ASI (Artificial Superintelligence) or AGI (Artificial General Intelligence) depending on which word is to your preference.

Fitting a model vs Training an agent

In this section, I would like to compare a little between these two paradigms of training to provide some differences and insights on how they are so different from each other. These comparisons are performed entirely based on my intuition and collective experience up to the time I write this blog post, so things might need to be checked several times more but I will try my best to validate them myself. Now, we would jump to the traditional settings of supervised learning first.

Fitting a model

In supervised learning settings, we are trying to fit a model to a fixed training dataset and evaluate that model on another dataset, a test dataset, to assess its performance in real life scenarios. To create, train and use a model for production, there is a big picture to consider. But here, I just want to point out some important steps in our journey to do such a task:

First, in order to train a model, we must have enough data. Machine learning demands a large amount of dataset for the training step and even larger if we are talking about deep learning with billion-parameters models. Therefore, the first and foremost of everything is gathering a large number of data samples that represent our objectives. This data can be drawn from real life data, synthesized using some kinds of formulas or even using another well-trained model.
Second, we tend to do exploratory data analysis. This is one of the most important steps in our journey, surely those who are working with temporal or financial data must not forget this step. This step provides us with precious information on how our dataset looks like, what are some of its notable traits that need to be considered carefully while modeling. Without this step, we barely know anything about the dataset which makes our choice in model selection suboptimal and even misleading.
Third, from all the insights that the second step has given us, we begin to make some assumptions about what the real world dataset distribution should look like. From these assumptions, we choose a mathematical, specifically a statistical, model that we believe might fit well to the dataset to help us perform prediction for unseen data. Wait, if machine learning allows us to create such an intelligent model that can perform without explicit instructions, then we should choose a model that is as generalized as possible? Unfortunately, that is not the case. We are prevented from this holy blessing, which is described in the No Free Lunch Theorem. No model can be trained without making any assumptions about the dataset we are working on.
Fourth, we try to fit our chosen model to our training dataset. This step contains a lot of stuff that feels fascinating and overwhelming at the same time, especially if we are working with deep learning and very large neural network models. Some things we might consider are: loss functions, optimizers, training steps, learning rate (and its scheduler), feature engineering (to enhance the expressiveness of the data to the model), etc. This step is undeniably the most popular thing when people are talking about doing ML stuff since training a model is such an exciting task (but it does not downgrade the importance of the previous steps).
Finally, we evaluate our best model on a separate test set to see if its performance matches what we have expected earlier and decide whether it is ready for production use. This step often introduces some hidden bugs that we might not expect without a lot of experience. Some unpleasant problems would come to see us including: the test set has such a different distribution from the training set, the model performs not good enough on both sets (underfitting), the model performs too well on the training set but too bad on the test set (overfitting), the model performs very well on both sets but fails dismally on real world scenarios (the test set has a different distribution from the real world distribution), etc. So many problems with only the symptoms would come and we might have to act as doctors to perform thousands of diagnostics to identify which enemy we are fighting against.

Now, let’s look at how RL differs from SL, specifically differs from the aforementioned steps that we were talking about.

Training an agent

Now, we switch our attention to the superstar in my heart, Reinforcement Learning. In this settings, there actually is a never-ending interaction between data and model.

Data Gathering: in the previous settings, we only need to gather the already existed data from various sources from the internet. that might come from articles, blog posts, books, newspapers, images of humans and animals, new anime release announcements, etc. All kinds of data are already there and we just need to collect them. But the story for RL is entirely different. Since we want to train our agent through the interaction with the environment, we can only gather data from that only environment. In other words, we must use the agent to perform a lot of different actions and gather a lot of states, rewards and any useful information about the environment. In this manner, we cannot simply use the data from an energy management environment to train an agent that was supposed to manage the traffic flows on roads through controlling the traffic lights. Therefore, we have to gather data ourselves to guide the agent later without relying on any curated dataset that was done by someone else. This first challenge drives us to parallelize this phase as much as possible for faster collection speed. For example, in order to train Bob, an agent in the channel AI Warehouse, the author had to perform experience collection in 200 scenarios at the same time to gather data as fast as possible.
Non-stationarity: now it’s time for one of the strangest properties that differ RL paradigm from SL paradigm, the non-stationarity in the data distribution. In SL settings, the data distribution is fixed and is assumed to be drawn from the same underlying distribution, which we can theoretically model it using infinite amount of data samples. On the other hand, in RL settings, since our data is drawn from the interaction between the agent and the environment, the data distribution is depending on the model parameters or, policy distribution. During the training progress, that policy distribution is updated continually along the way, which makes the data distribution shifts from what it was before. Therefore, our data distribution is not fixed, it is changing over time. Noting this interesting behaviour, all kinds of policy learning algorithms can be categorized into two categories: on-policy learning vs off-policy learning. The former tells us to train the model using its own policy distribution at hand to generate data and learn from that data. The latter tells us to draw data observation from another distribution (for example, the $\epsilon$-greedy algorithm) and use the data from that distribution to update ours. Of course, since off-policy learning introduces another distribution, we have to perform importance sampling to correct the bias induced by the discrepancy between distributions.

Reinforcement learning as a natural learning process

When we talk about something natural, we usually mean something that can emerge by itself without any human-crafted intervention or manually managed. The evolution process gives us thousand of hints on what is natural and what is not. A dog knows how to bark from the day of its birth. A butterfly knows how to fly from the moment it escapes from its own cocoon to begin its new journey with an entire new body. All animals and species share the same reaction when they are facing something wrong, dangerous and insidious. They run away. Running away from what we fear is definitely one of the most natural phenomenon we can think of. This mechanism has been encoded deeply in our genetics since the day of dawn. Then how about learning? If running is the natural way to do when facing dangerous circumstances, what would it be for learning to be natural? Supervised learning indeed gave a incredibly nice framework to create intelligent agents by training them on a dataset that describe what we want them to learn. This training paradigm has been a long standing in artificial intelligence due to its close relationship with statistics. Learning from dataset is empowered by the magnificence of some of the most fundamental and influential concepts from statistics: maximum likelihood estimation, maximum posteriori estimation, divergences between distributions, etc. The concepts have been established extensively from decades ago with immense mathematical rigour.

Then we question one single question, what if we even don’t have data to train our model, in other words, we need our model to interact with the environment that we would want to deploy it on, and wait for it to learn from that environment by itself? This is where RL comes to the rescue. The most interesting property of RL is we don’t need to prepare any dataset beforehand for our training. We just need our agent (the terminology for the model in context of RL) to interact directly with the environment and gain experience from that same environment. Let’s take a small example on how a child discover for themselves that fire is dangerous and must be stayed away using no fancy jargons of RL. Let’s name our cute child Bob and see how Bob learns that fire is dangerous by himself:

Bob sees the fire is sparkling at the corner of the house and has no clue what that brightness is.
Bob tries to approach the fire out of pure curiosity and want to touch it to feel what it is like.
The heat from the fire immediately inflicts insufferable pain on Bob’s hand. This is the first time Bob experiences something so harsh in his life, nearly just 2-3 years from the day he was brought into this world.
Bob cannot stop himself from panicking and crying out loud from immense pain. He immediately deems fire something that would cause him such pain when he touches it.
From this moment onwards, fire is something to be deemed dangerous by Bob, and Bob would never dare to come close to it any longer.

From the above example, we can clearly see that we don’t need any dataset to train Bob to learn the danger of fire. All we need to do is watch closely what he has done and how he reacts to the consequences following his actions. Of course, we would want to ensure his safety, because not everything in life gives us a second chance if we fail; some things are far too dangerous to begin with. Nevertheless, learning through trial and error seems to be the ultimate, natural, and automatic solution when an agent has to make its way into how the system works and how the surrounding environment works. Of course, we can use other examples such as learning to ride a bike, control a game controller, fix a water leak, use computer devices, etc. There are thousands of realistic skills that can be unlocked with this kind of learning, as long as it is intrinsically based on trial and error. Now we shall see the final fascinating property that makes RL so compelling, but also so nightmarish for us human beings.

Instability as a Service

In software engineering and cloud computing, there are several popular architectures or models: Software as a Service, Infrastructure as a Service, Platform as a Service. Turning to reinforcement learning, I would like to talk about a special kind of service that is nearly the heart of its complexity: instability.

Overfitting & underfitting

In machine learning, we usually encounter two very popular terms when talking about the performance of the model after training: overfitting and underfitting. Despite being equally considered in a logical sense, I believe the former is way more ubiquitous than the latter. In such settings, we want our model to fit as well as possible to the true distribution of the world. However, since access to the true distribution is impossible, we try to fit our model to the data distribution, which acts as a proxy for the true distribution. This training paradigm has a very simple property: the data distribution cannot be affected by the model parameters at any time. Since our data is fixed and painstakingly curated beforehand, the information flow is one-way, from the data to the model. The model only updates from the data, not the other way around. However, this is not the case for the reinforcement learning framework. The data does not stand at one single time through the process; it moves along with the model parameters!

Non-stationary data

Now we come to the magic of reinforcement learning. Since our agent can only learn from its direct interaction with the environment, the data used to train it cannot be obtained beforehand. We can only let the agent interact with the environment and get the data directly from that interaction. Since at the beginning we have no clue about the optimal policy (distribution of actions over states), we naturally set all model weights randomly. These random parameters drive the model to discover a wide range of different actions with equal chance of being chosen at each state. This instantiation lets our model explore the state-action space freely and without bias as much as possible. Assuming at the start of training, our data is distributed randomly and uniformly across all states and actions. After a period of time, our agent has been updated and learned something about the environment using Bellman optimality equations or some training framework. At this time, it begins to discover that there are some actions that should be preferred over others. Its policy distribution begins to change and diverge gradually from its initial randomized distribution. The data distribution also changes its shape gradually along with the policy distribution. Since some actions are preferred over others, they are chosen much more regularly than before. This behavior leads to the probability mass of the states that follow those actions becoming larger, resulting in some spikes in the data distributions in some regions. As the policy distribution changes even more, the data distribution no longer shares any attributes with its initialization; now it gives much more probability to some states and nearly no probability to others (because these states follow actions that are never chosen by the model). Since this behavior is intrinsically unavoidable, we come up with many algorithms and training methods that can be separated into two main divisions: on-policy learning and off-policy learning. The following section will talk about these two paradigms at a general level, suitable for anyone who is not an expert in the field or just has a first taste of what is going on in reinforcement learning.

On-policy and off-policy

Among RL practitioners, there are several ways to differentiate algorithms into classes. One way is to separate them based on whether they are on-policy or off-policy algorithms. I will use two major representatives of these classes: Proximal Policy Optimization (on-policy) and Soft Actor-Critic (off-policy). Since our data does not persist its distributions and statistics over time, we have several approaches to designing an algorithm to tackle this. The former is called on-policy because it always uses the current weights and parameters of the model to generate action probabilities at each time step, without using any other parameterized or specifically designed probabilities. With this approach, our data constantly changes according to this policy distribution. To maintain the currency of the policy distribution, all old data must be eliminated after each update to ensure the current data is the latest possible; otherwise, old data with its own distribution may affect the training process. This method continually discards all previous data after each update, thus leading to massive data loss. This drawback can be a significant disadvantage. Since there are some problems where the data gathering process is very expensive, on-policy algorithms like PPO may not be the optimal choice in these scenarios. On the other hand, SAC chooses another approach. It keeps all the data with us from the past permanently throughout the training process. This solves our disadvantage about expensive data collection but introduces new challenges since the old data belongs to the old distribution. How can we reduce the bias from the old data? This is where the policy distribution introduces its counterpart. In such settings, we may want to sample data from another distribution, such as $\epsilon$-greedy selection or stochastic action selection based on probabilities. Since the data are sampled using a different distribution from the one coming from the current network parameters, we must resolve this discrepancy using a nice technique in statistics: Importance Sampling. Long story short, importance sampling helps us modify the importance of the current data samples based on their expectation under the sampling distribution and the policy distribution, since the data may not hold the same expectation value from both distributions. They may have a higher expected value in one and a lower value in the other. This is an indispensable component for off-policy techniques; otherwise, it would introduce a massive statistical bias to our estimation and possibly cause error accumulation.

In summary, reinforcement learning introduces new challenges but also new opportunities for us to think about what learning actually is. Depending on the situation at hand, we might prefer on-policy approach over the other or vice versa.

Push and pull duality

While supervised learning (SL) and reinforcement learning (RL) are both machine learning subfields, I discovered they have an interesting push-and-pull relationship. This is just my personal thought on the relationship between these two branches. I try to make some connections between them intuitively to facilitate my learning and gain a mental model to think properly about machine learning as an interdisciplinary field with a broader view.

In supervised learning settings, we want to train our model to fit the training dataset properly without overfitting so it performs well on the test set (and in real-world scenarios) as well. Since our target is clear, we can easily define an objective function to act as a steering wheel to guide our model to the ground truth. Our target is usually fixed in this context, since the ground truth values in the dataset don’t change over time. They are static, stationary, and tractable. The objective function produces the loss value to signal how wrong the model is with its current predictions, then calculates the gradients and updates the model parameters to better align with the ground truth values. All of these things are static. Perhaps the two most important things we want to consider are numerical stability in calculating gradients and regularization to prevent overfitting. There are no dynamic components in this setup.

On the other hand, the main objective function for training an RL agent is to increase the total expected reward over a period of time. There are no ground truth values about the optimal value of expected reward an agent can get given its current state and available set of actions. Therefore, this objective is non-stationary and dynamic, based on what states the agent encounters during its training and what actions it has taken given those states. The interaction between states can be unexpectedly complicated, since one state can have bad estimated values but lead to a much more valuable state where the agent can get a significant amount of expected reward. Therefore, there is a natural framework in reinforcement learning, the Actor-Critic model. In such settings, we have two main components: the Actor, which actually takes actions given the current state of the environment, and the Critic, which tries to evaluate the value of the current state as precisely as possible. Wrong estimation can directly lead to poor performance, because the value of other states is bootstrapped from those estimations. Besides bootstrapping methods, we also have Monte Carlo simulation-based methods to counter this estimation bias, but that topic is beyond the scope of this post.

In conclusion, supervised learning provides us with an exact target and a clear goal for our model to reach; reinforcement learning does not do such benevolent things. This training framework forces the agent to interact with the environment by itself and use that data to update its internal policy distribution. Therefore, providing an agent with data beforehand is not the correct way to train a reinforcement learning agent. However, there is a discipline called Imitation Learning, notably Behavioral Cloning, to help the agent warm-start its training by using expert data. This training is usually used as the first phase of training an agent due to its simplicity inherited from supervised learning.

Challenges

All the aforementioned sections have introduced how interesting this machine learning discipline is, along with its differences from other domains of artificial intelligence. However, interesting things usually introduce new challenges. This section briefly covers a handful of popular challenges in reinforcement learning that are still on the frontier of academic research.

Credit Assignment

Beyond the dynamic nature of the data, RL introduces some fundamental hurdles that SL simply doesn’t have to deal with. The most famous one is probably the Credit Assignment Problem. This is perhaps one of the most interesting challenges in reinforcement learning, and I believe there are no mature mathematical frameworks to tackle it effectively. Let’s have an example to see what this challenge is about. Imagine we are training a robot to walk to a specified destination. The robot trains at the first stage, performing around 100 different joint movements and eventually falls over. Was it the very last move at step 100 that caused the fall, or a poor decision made back at step 20? We don’t actually know what the real cause is. In another scenario, if the robot successfully walks and reaches the target as defined earlier, we also cannot specify directly which actions in its trajectory led to this result. Determining what the real factors are that actually contribute to the agent’s learning progress and assigning them the credit they deserve is at the heart of this challenge. Hence the name Credit Assignment Problem.

Basically, the credit assignment problem has two forms: one spatial (space) and one temporal (time).

Temporal assignment: this is the case I talked about in the previous paragraph. In a long trajectory of actions that the agent has performed in its life, we cannot actually know which actions or decisions in the past led to a particular result in the future, thus creating difficulty when trying to find the best policy to teach the agent.
Spatial assignment: this case does not occur in a single-agent setup because its primary domain is multi-agent. In that scenario, we have multiple agents acting together collaboratively or competitively based on how we define the problem. When a specific result happens to all agents in the system, we cannot actually know which agent is the true cause, assuming we only care about which agent is responsible (not a specific action in a particular agent’s trajectory, which would be a space-time assignment problem, much harder).

Exploration-Exploitation trade-off

Another thing I find quite exciting is the Exploration-Exploitation trade-off. In SL, the model just tries to fit the data you give it. But in RL, the agent faces a dilemma: should it stick with the best action it knows so far to get a guaranteed reward (Exploitation), or should it try something completely random and potentially suboptimal to see if there is a better path it hasn’t discovered yet (Exploration)?

We often use a simple strategy like $\epsilon$-greedy to handle this:

\[\pi(a|s) = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg\max_a Q(s,a) & \text{with probability } 1-\epsilon \end{cases}\]

This reminds me of thermal noise in physics—sometimes you need a bit of heat or randomness in the system to jump out of a local trap and find the true global optimum. For someone who likes the smooth shapes and continuous representations of physics, watching an agent navigate these high-dimensional policy spaces and solve these hurdles is much more satisfying than just fitting a curve to a static cloud of points.

This is the end of my first technical post on my personal website. If you are also interested in reinforcement learning (especially deep reinforcement learning) and diffusion models, feel free to contact me via my email and GitHub profile on my page. Below are some resources you may find useful and interesting for your learning journey.

Resources to look at

If you’re interested in this stuff, I really recommend checking out these labs and blogs. They helped me bridge the gap between standard AI and the more advanced research that’s happening right now:

1. Research Labs and Academic Blogs

UC Berkeley BAIR Lab: This is a gold mine for anyone into robotics and RL. This is probably the strongest lab in robotics, with very prominent names in robotics and reinforcement learning, including Sergey Levine.

2. Blogs Post

Lilian Weng’s Blog (Lil’Log): Lilian is an expert at summarizing massive amounts of research. Her posts are like the ultimate cheat sheets for technical summaries on RL and many other fields of research. She had been doing research at OpenAI and currently is working at Thinking Machines Lab with John Schulman, one of the pioneers in reinforcement learning at OpenAI (he is the father of both Trust Region Policy Optimization and Proximal Policy Optimization).

Endings

This blog is just my personal thoughts on reinforcement learning and how I’m fascinated by its power, especially compared to our old friend supervised training. This being said, I bear no ill thoughts about supervised training. I only say that in our long journey of creating something intelligent, supervised training is definitely not our ultimate weapon. Learning is more complicated than we thought. Optimizing parameters based on ground truth on a dataset may just be one form of what we call learning. Although reinforcement learning has seen massive progress, from its early days with DeepBlue to AlphaGo, AlphaFold, and AlphaTensor, Google DeepMind seems to have a strong affection for the $\alpha$ letter when naming their innovations. Perhaps one day I will be a researcher contributing to these magnificent advancements.