Introduction
When I was studying in high school, I have a knack for scientific subjects and courses, things like chemistry, biology, mathematics are always some of the most interesting stuffs I found in the world. Among them, I chose to side with physics, eventually enrolled myself with the specialization in physics at a prominent high school in my hometown. Among the journey, I also gained some achievements in competitions: some competitions in my province, 30/4 Traditional Olympiad at Le Hong Phong High school for the Gifted in Ho Chi Minh City, and finally the Vietnam National Physics Olympiad (VPhO). Studying physics taught me a lot about how the world works and how some general principles continuously governs their behavior over time. For example, we could full examine how the ball would fall if we throw it at a given speed and angle, or how the velocity and acceleration affect how an object moves, Newton’s laws on motions and classical dynamics are also one of the most fundamental and interesting we learn when talking about physics. Studying physics also build some useful mental model that would be immensely useful for problem solving, regardless that problems in daily lives or in academia. Inspired from how physics model the world, I’m fascinated by continuity of calculus functions and we can track their changes over time by only to care about their derivatives. With differential equations (specifically ODEs, ordinary differential equations) act as the language of the world of changes, we learnt to solve all kinds of ODEs day to day. Sometimes when we stumble upon one PDEs, partial differential equations, it would look like that we has just got a jumpscare from nowhere, laugh out loud with each other and perhaps run away to find some peace.
When I encountered maths in university, things became so discrete with all of Karnaugh maps, boolean algebra and things that must be in either state, only true or only false. Honestly I don’t have interest and any knack for these discrete structures, since I was with physics for a long time and everything there was all continuous. As a result, I kind of struggled a lot within a first few semesters in uni, where I have to learn those things that I don’t even care about. Only when I heard that AI and specifically deep learning are built with mainly calculus and optimizations (of course, continuous functions are a major factor in this field), I began to feel that this field may be the right field for me to dive deep into it, something that I have longed for its appearance through my early years of undergrad.
First I started to learn some basic things about machine learning. To be honest, this field is really a vast area and contains a lot of stuffs that are definitely overwhelming for anyone who has just barely stepped a foot in. Things like bias-variance tradeoff, loss functions, optimizations with first order and second order derivatives, etc. One of the most fascinating parts of machine learning is definitely deep learning, where we are moving away from using simple models such as decision trees to very deep neural networks that may contain up to one thousand layers of parameters. As everyone else, the first one you encounter when dive into deep learning yourself is training a neural network on a gold dataset, a curated and production-ready one that has been extensively preprocessed for us, beginners to step in safely. This type of learning is called Supervised Learning (SL). At this point, again, I feel bored with this kind of training paradigm. Although learning from data is the most straightforward method you can think of when talking about training since we need something already good to mimic them, I want the training to be somehow more universal, since preparing a training dataset can consume an unfathomable amount of time and there are a lot of difficulties with data engineering we have to cope with. That is when I learned about the existence of a very fantastic branch of machine learning, so-called Reinforcement Learning (RL), which aims to create an intelligent agent by training itself through a series of interactions between the agent and the environment. In this blog I would like to share some of my thoughts why I find RL is much more fascinating than traditional SL and why it would likely be the next cornerstone of artificial intelligence frontiers. Following the views of some popular AI researchers, RL is considered to be the paradigm that can lead us to ASI (Artificial Superintelligence) or AGI (Artificial General Intelligence) depending on which word is to your preference.
Fitting a model vs Training an agent
In this section, I would like to compare a little between these two paradigms of training to provide some differences and insights on how they are so different from each other. These comparisons are performed entirely based on my intuition and collective experience up to the time I write this blog post, so things might need to be checked several times more but I will try my best to validate them myself. Now, we would jump to the traditional settings of supervised learning first.
Fitting a model
In supervised learning settings, we are trying to fit a model to a fixed training dataset and evaluate that model on another dataset, a test dataset, to assess its performance in real life scenarios. To create, train and use a model for production, there is a big picture to consider. But here, I just want to point out some important steps in our journey to do such a task:
- First, in order to train a model, we must have enough data. Machine learning demands a large amount of dataset for the training step and even larger if we are talking about deep learning with billion-parameters models. Therefore, the first and foremost of everything is gathering a large number of data samples that represent our objectives. This data can be drawn from real life data, synthesized using some kinds of formulas or even using another well-trained model.
- Second, we tend to do exploratory data analysis. This is one of the most important steps in our journey, surely those who are working with temporal or financial data must not forget this step. This step provides us with precious information on how our dataset looks like, what are some of its notable traits that need to be considered carefully while modeling. Without this step, we barely know anything about the dataset which makes our choice in model selection suboptimal and even misleading.
- Third, from all the insights that the second step has given us, we begin to make some assumptions about what the real world dataset distribution should look like. From these assumptions, we choose a mathematical, specifically a statistical, model that we believe might fit well to the dataset to help us perform prediction for unseen data. Wait, if machine learning allows us to create such an intelligent model that can perform without explicit instructions, then we should choose a model that is as generalized as possible? Unfortunately, that is not the case. We are prevented from this holy blessing, which is described in the No Free Lunch Theorem. No model can be trained without making any assumptions about the dataset we are working on.
- Fourth, we try to fit our chosen model to our training dataset. This step contains a lot of stuff that feels fascinating and overwhelming at the same time, especially if we are working with deep learning and very large neural network models. Some things we might consider are: loss functions, optimizers, training steps, learning rate (and its scheduler), feature engineering (to enhance the expressiveness of the data to the model), etc. This step is undeniably the most popular thing when people are talking about doing ML stuff since training a model is such an exciting task (but it does not downgrade the importance of the previous steps).
- Finally, we evaluate our best model on a separate test set to see if its performance matches what we have expected earlier and decide whether it is ready for production use. This step often introduces some hidden bugs that we might not expect without a lot of experience. Some unpleasant problems would come to see us including: the test set has such a different distribution from the training set, the model performs not good enough on both sets (underfitting), the model performs too well on the training set but too bad on the test set (overfitting), the model performs very well on both sets but fails dismally on real world scenarios (the test set has a different distribution from the real world distribution), etc. So many problems with only the symptoms would come and we might have to act as doctors to perform thousands of diagnostics to identify which enemy we are fighting against.
Now, let’s look at how RL differs from SL, specifically differs from the aforementioned steps that we were talking about.
Training an agent
Now, we switch our attention to the superstar in my heart, Reinforcement Learning. In this settings, there actually is a never-ending interaction between data and model.
- Data Gathering: in the previous settings, we only need to gather the already existed data from various sources from the internet. that might come from articles, blog posts, books, newspapers, images of humans and animals, new anime release announcements, etc. All kinds of data are already there and we just need to collect them. But the story for RL is entirely different. Since we want to train our agent through the interaction with the environment, we can only gather data from that only environment. In other words, we must use the agent to perform a lot of different actions and gather a lot of states, rewards and any useful information about the environment. In this manner, we cannot simply use the data from an energy management environment to train an agent that was supposed to manage the traffic flows on roads through controlling the traffic lights. Therefore, we have to gather data ourselves to guide the agent later without relying on any curated dataset that was done by someone else. This first challenge drives us to parallelize this phase as much as possible for faster collection speed. For example, in order to train Bob, an agent in the channel AI Warehouse, the author had to perform experience collection in 200 scenarios at the same time to gather data as fast as possible.
- Non-stationarity: now it’s time for one of the strangest properties that differ RL paradigm from SL paradigm, the non-stationarity in the data distribution. In SL settings, the data distribution is fixed and is assumed to be drawn from the same underlying distribution, which we can theoretically model it using infinite amount of data samples. On the other hand, in RL settings, since our data is drawn from the interaction between the agent and the environment, the data distribution is depending on the model parameters or, policy distribution. During the training progress, that policy distribution is updated continually along the way, which makes the data distribution shifts from what it was before. Therefore, our data distribution is not fixed, it is changing over time. Noting this interesting behaviour, all kinds of policy learning algorithms can be categorized into two categories: on-policy learning vs off-policy learning. The former tells us to train the model using its own policy distribution at hand to generate data and learn from that data. The latter tells us to draw data observation from another distribution (for example, the $\epsilon$-greedy algorithm) and use the data from that distribution to update ours. Of course, since off-policy learning introduces another distribution, we have to perform importance sampling to correct the bias induced by the discrepancy between distributions.
Reinforcement learning as a natural learning process
When we talk about something natural, we usually mean something that can emerge by itself without any human-crafted intervention or manually managed. The evolution process gives us thousand of hints on what is natural and what is not. A dog knows how to bark from the day of its birth. A butterfly knows how to fly from the moment it escapes from its own cocoon to begin its new journey with an entire new body. All animals and species share the same reaction when they are facing something wrong, dangerous and insidious. They run away. Running away from what we fear is definitely one of the most natural phenomenon we can think of. This mechanism has been encoded deeply in our genetics since the day of dawn. Then how about learning? If running is the natural way to do when facing dangerous circumstances, what would it be for learning to be natural? Supervised learning indeed gave a incredibly nice framework to create intelligent agents by training them on a dataset that describe what we want them to learn. This training paradigm has been a long standing in artificial intelligence due to its close relationship with statistics. Learning from dataset is empowered by the magnificence of some of the most fundamental and influential concepts from statistics: maximum likelihood estimation, maximum posteriori estimation, divergences between distributions, etc. The concepts have been established extensively from decades ago with immense mathematical rigour.
Then we question one single question, what if we even don’t have data to train our model, in other words, we need our model to interact with the environment that we would want to deploy it on, and wait for it to learn from that environment by itself? This is where RL comes to the rescue. The most interesting property of RL is we don’t need to prepare any dataset beforehand for our training. We just need our agent (the terminology for the model in context of RL) to interact directly with the environment and gain experience from that same environment. Let’s take a small example on how a child discover for themselves that fire is dangerous and must be stayed away using no fancy jargons of RL. Let’s name our cute child Bob and see how Bob learns that fire is dangerous by himself:
-
Bob sees the fire is sparkling at the corner of the house and has no clue what that brightness is.
-
Bob tries to approach the fire out of pure curiosity and want to touch it to feel what it is like.
-
The heat from the fire immediately inflict an insufferable pain on Bob’s hand. This is the first time Bob experience something so harsh from his life, nearly just 2-3 years from the day he was brought into this world.
-
Bob cannot stop himself from panicking and crying out loud out of immense pain. He immediately deems fire is something that would cause me such pain when he touches it.
-
From this moment onwards, fire is something to be deemed dangerous by Bob and Bob would never dare to come close to this witchcraft any longer.
From the above example, we can clearly see that we don’t need any dataset to train Bob to learning the danger of fire. All we need to do is watch him closely what he has done and how he reacts with what the consequence may happen following his actions. Of course, we would want to ensure his safety, because not everything in life gives us a second chance if we fail, something is far too dangerous to begin with. Nevertheless, learning through trials and errors seems to be the ultimate, natural and automatic solution when an agent has to make its way into how the system works, how the surrounding environment works. Of course, we can use other examples such as learning to: ride a bike, control the game controller, fix the water plumber, use computer devices, etc. There are thousands of realistic skills that can be unlocked with this kind of learning, as long as it is intrinsically based on trials and errors of the learning process. Now we shall see the final fascinating property that make RL so sexy, but also so nightmarish for us human beings.
Instability as a Service
In software engineering and cloud computing, there are several popular architectures or models: Software as a Service, Infrastructure as a Service, Platform as a Service. Now come to reinforcement learning, I would like to talk about a special kind of service that is nearly the heart of its complexity: instability.
Overfitting & underfitting
In machine learning, we usually encounters two very popular terms when talking about the performance of the model after training: overfitting and underfitting. Despite being equally considered in logical sense, I believe the former is way more ubiquitous than the latter. In such settings, we would our model to fit as good as possible to the true distribution of the world. However, since access to the true distribution is impossible, we try to fit our model into the data distribution, which acts a proxy for the true distribution. This training paradigm has a very simple property: the data distribution cannot be affected by the model parameters at any time. Since our data is fixed and painstakingly curated beforehand, the information flow is one-way, from the data to the model. The model only updates from the data but not the other way. However, this is not the case for our lovely reinforcement learning framework. Where the data does not stand at one single time through the process, it moves along with the model parameters!
Non-stationary data
Now we come to the magic of reinforcement learning. Since our agent can only learn from its direct interaction with the environment, the data used to train it cannot be obtained beforehand. We can only let the agent interact with the environment and get the data directly from those data. Since at the beginning we have no clue about the optimal policy (distribution of actions over states), we naturally set all model weights randomly. These random parameters drive the model to discover a wide range of different actions with equal chance of being chosen at each state. This instantiation let our model explore the state-action space freely and without bias as much as possible. Assuming at the start of the training, our data is distributed randomly and uniformly across all states and actions. After a period of time, our agent has been updated and learnt something about the environment using Bellman optimality equations or some kinds of training framework. At this time, it begins to discover that there are some actions should be preferred over some other actions. Its policy distribution begins to change and diverge gradually from its initial randomized distributions. The data distribution is also changing its shape gradually along with the policy distribution. Since some actions are preferred over some other, they are chosen much more regularly than before. This behavior leads to the probability mass of the states that follow those actions become larger, resulting in some spikes in the data distributions in some regions. As the policy distribution changes even more, the data distribution is no longer share any same attributes with its initialization, now it gives much more probability to some states and nearly no probability on some other (because these states follow actions that are never chosen by the model). Since this behavior is intrinsically unavoidable, we come up with lots of algorithms and training methods that can be separated into two main divisions: on-policy learning and off-policy learning. The following section will talk about these two paradigms at a general view, suitable for anyone who are not expert in the field or just have some first taste about what are going on in reinforcement learning setup.
On-policy and off-policy
In reinforcement learning folks, there are several ways to differentiate between different algorithms into classes. Among them, one way to do that is separate them based on whether they are on-policy or off-policy algorithm. I would use two major representatives of these classes of algorithms: Proximal Policy Optimization (on-policy) and Soft Actor-Critic (off-policy). Since our data does not persist its distributions and statistics along the time, we have several approaches in designing an algorithm to tackle this. The former one is called on-policy because it always use the current weights and parameters of the model to generate the action probabilities at each time step, without using any other parameterized or specifically designed probabilities, hence the name on-policy. With this approach, our data is constantly changing according to this policy distributions. To maintain the timely property of the policy distributions, all the old data must be eliminated after each update to ensure the current data is the latest possible, otherwise old data with its own distributions may affect the training process. This method continually discard all the previous data after update, thus leads to massive data loss. This drawback sometimes would be a painful disadvantage. Since there are some problems which the data gathering process is very expensive, on-policy algorithms like PPO may not be the optimal choice in these scenarios. On the other hand, SAC choose another approach. It keeps all the data with us from the past permanently in the whole training process. this solves our our disadvantage about the expensive data collection, but introduce new challenges since the old data belongs to the old distribution. Then how can we reduce the bias from the old data? This is where the policy distribution introduce his friend. In such settings, we may want to sample the data from another distribution, such as: $\epsilon$-greedy selection, stochastic action selection based on probabilities, etc. Since the data are sampled using a different distribution from the distribution come from the current network parameters, we must resolve this discrepancy using a nice technique in statistics: Importance Sampling. Long story short, importance sampling helps us modify the importance of the current data samples based on its expectation with the sampling distribution and the policy distribution, since the data may not hold the same expectation value from both distributions. They may have higher expected value in one and lower value in another. This is an indispensable component for off-policy techniques, otherwise it would introduce a massive statistical bias to our estimation and possibly cause errors accumulation.
In summary, reinforcement learning introduces new challenges but also new opportunities for us to think about what learning actually is. Depending on the situation at hand, we might prefer on-policy approach over the other or vice versa.
Push and pull duality
While supervised learning (SL) and reinforcement learning (RL) are all machine learning subfields, I discovered they have a push-and-pull interesting relationship. This is just my personal thoughts about the relationship between these two machine learning branches. I try to make some connections between them intuitively to facilitate my learning and gain a mental model to think properly about machine learning as an interdisciplinary fields with a broader view.
In supervised learning settings, we want to train our model to fit into the training dataset properly without overfitting in order for it to perform well on test set (and in real world scenario) as well. Since our target is clear, we can easily define an objective function to act as a steering wheel to guide our model to the ground truth. Our target is usually fixed in this context, since the ground truth values in the dataset don’t change over time. They are static, stationary and tractable. The objective functions produce the loss value to notify our model how wrong it is with the current predictions, then calculating the gradients, updating the model parameters to better align with the ground truth values. All of these things are static. Perhaps two most important things we want to consider is the numerical stability in calculating gradients and regularizations to prevent overfitting phenomenon. There are changing or dynamics components in this setup.
On the other hand, the main objective function for training an RL agent is to increase the total expected reward over a period of time. There are no ground truth values about the optimal value of expected reward an agent can get given its current state and available set of actions. Therefore, this objective is non-stationary and dynamic based on what states the agent encounter during it training and what actions it has taken given those states. The interaction between states can be unexpectedly complicated, since one state can have bad estimated values, but it can lead to a much more valuable state where the agent can get a significant amount of expected reward from that state. Therefore, there is a naturally a framework in reinforcement learning, so-called Actor-Critic model. In such settings, we have two main components in the system: Actor is the one who actually is taking actions given the current state of the environment, and the Critic is the one who tries to evaluate the value of the current state as precisely as possible. The wrong estimation can directly leads to poor performance, because the value of other states are bootstrapped from those estimations. Besides bootstrapping methods, we also have Monte Carlo simulation-based methods to counter this estimated bias problems but that topic is beyond the scope of this talk.
In conclusion, supervised learning provides us with an exact target and a clear goal for our model to reach, reinforcement learning on the other hand does not do such benevolent things. This training framework forces the agent to interact with the environment by itself and use that data to update its internal policy distribution. Therefore, provide an agent with data beforehand is not the correct way to train a reinforcement learning agent. However, there is a discipline called Imitation Learning and notably, Behavioral Cloning, to help the agent warm start its training by using expert data. This training is usually used as the first phase of training an agent due to its simplicity inherited from supervised learning.
Challenges
All the aforementioned sections already had introduced how interesting this machine learning discipline is along with its difference from other domains of artificial intelligence. However, interesting stuffs usually introduce new things, and most of them are challenges. This section is going to talk briefly about a handful of popular challenges in reinforcement learning that are still on the frontier of academic research.
Credit Assignment
Beyond the dynamic nature of the data, RL introduces some fundamental hurdles that SL simply doesn’t have to deal with. The most famous one is (probably) the Credit Assignment Problem. This is perhaps one of the most interesting challenges in reinforcement learning, and I believe there are not any mature mathematical frameworks to tackle this challenge effectively. Let’s has an example to see what this challenge is talking about. Let’s imagine we are training a robot to walk to a specified destination. The robot is training at the first stage of the process, performing around 100 different joint movements and then eventually falls over. Was it the very last move at step 100 that caused the fall, or was it a poor decision made back at step 20? We don’t actually know who is the real culprit that leads to the downfall of the robot. In another scenarios, if the robot successfully walks and reaches the target as we defined earlier, we also cannot specify directly which actions in its trajectories had lead to this result. Determining who or what are the real factors that actually make credits to the agent’s learning progress and assigns to them the credits they deserved is at the heart of this challenge. Hence, we have the name Credit Assignment Problem.
Basically, the credit assignment problem actually has 2 forms: one spatial (space) and one temporal (time).
-
Temporal assignment: this is the case I talked about in the previous paragraph, in a long trajectory of actions that the agent has performed in its life, we cannot actually know what actions or what decisions specifically in the past that lead to a particular result in the future, thus it creates some kind of difficulty when trying to find the best policy to teach the agent.
-
Spatial assignment: this case is actually not happening in a single-agent setup, because its primary domain is in multi-agent setup. In that scenario, we have multiple agents acting together collaboratively or competitively based on how we define the problem. When a specific result happens to all the agents in the system, we cannot actually know what is the true culprit of that result, assuming that we only care about which agent is the culprit (not a specific action in a particular agent’s action trajectories, that would be a space-time assignment problem, which is much harder).
Exploration-Exploitation trade-off
Another thing I find quite exciting is the Exploration-Exploitation trade-off. In SL, the model just tries to fit the data you give it. But in RL, the agent faces a dilemma: should it stick with the best action it knows so far to get a guaranteed reward (Exploitation), or should it try something completely random and potentially stupid to see if there is a better path it hasn’t discovered yet (Exploration)?
We often use a simple strategy like $\epsilon$-greedy to handle this:
\[\pi(a|s) = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg\max_a Q(s,a) & \text{with probability } 1-\epsilon \end{cases}\]This reminds me of thermal noise in physics—sometimes you need a bit of heat or randomness in the system to jump out of a local trap and find the true global optimum. For someone who likes the smooth shape and continuous representations of physics, watching an agent navigate these high-dimensional policy spaces and solve these hurdles is much more satisfying than just fitting a curve to a static cloud of points.
This is the end of my beginning blog on technical things on my personal website. If you (I don’t know whether someone will actually discover this place) are also interested in reinforcement learning (especially deep reinforcement learning, I like deep things) and diffusion models or just one of them, feel free to contact me via my email and github profile on my page. Below are some resources that you may find useful and interesting to use as materials for your learning journey.
Resources to look at
If you’re interested in this stuff, I really recommend checking out these labs and blogs. They helped me bridge the gap between standard AI and the more advanced research that’s happening right now:
1. Research Labs and Academic Blogs
- UC Berkeley BAIR Lab: This is a gold mine for anyone into robotics and RL. This is probably the strongest lab in robotics, with very prominent names in robotics and reinforcement learning, sir Sergey Levine.
2. Blogs Post
- Lilian Weng’s Blog (Lil’Log): Lilian is an expert at summarizing massive amounts of research. Her posts are like the ultimate cheat sheets for technical summaries on RL and many other fields of research. She had been doing research at OpenAI and currently is working at Thinking Machines Lab with John Schulman, one of the pioneers in reinforcement learning at OpenAI (he is the father of both Trust Region Policy Optimization and Proximal Policy Optimization).
Endings
This blog is just my personal thoughts on reinforcement learning and how I’m fascinated by its magic trick, especially compared to our old friend supervised training. This being said, I bare no ill-thoughts about supervised training. I only say that in our long journey of creating something intelligent, supervised training is definitely not our ultimate weapons. Learning is more complicated than we thought. Optimizing parameters based on the ground truth on a dataset may just be one form of the thing we call learning. Although reinforcement learning has seen massive progress, from its first impression from the days of DeepBlue, to AlphaGo, AlphaFold and AlphaTensor. Google Deepmind seems to have a strong affection with the $\alpha$ letter when naming their innovations. Perhaps one day I would be a researcher contributing to these magnificent advancements.