Last Updated:

Yesterday, I happened to see something about “The Bitter Lesson” written by Richard Sutton. Obviously, I immediately recognized the name of the scientist that is the author of my favourite book, which is Reinforcement Learning: An Introduction. Along with Andrew Barto, he has been deemed to be the godfather of Reinforcement Learning due to his great contribution to the field, especially with the foundational book that has been considered to be the bible, the holy grail, for those who are interested in reinforcement learning. The book contains all of the most important materials in the development of the field from its basic form in tabular methods to the modern methodologies which use deep neural networks as the feature extractor for observation representations.

According to the essay, Sutton helped us recognize something that is so subtle yet so important to notice in artificial intelligence development across decades. He told us that in the long run, all of the manual methods of developing the feature space for the architecture by humans will only gain us a very small amount of improvement at the start. But as time passes, methods that are more general and computationally intensive are undeniably much more efficient and yield much better results. Although they require a much larger amount of computation to do the task and look more like black box methods, they provide the architecture, especially something as complicated and dynamic as neural networks, a large space of representations that they can gradually learn from the data itself, rather than learn from the human crafted features. History has a long record showing us that this observation is indeed harsh to believe, yet it is still true.

After reading the blog, I have quite a few thoughts that somehow, personally, I thought about why this observation turns out to be intrinsically true, which might not need us to recognize it in real life but rather can deduce it from our logic. One reason is that when working with hand crafting features manually, we cannot directly tell the computer as expressively as we can with other people, as human beings. They cannot understand directly our ideas and thoughts or clues that we think can help us and them to solve the problem, such as chess game or games that need extensive logical reasoning. In order to bypass this difficulty, we need to encode them somehow for the computer to understand, by one-hot encoding or any encoding methods that we can think of. But as the number of parameters in the model increases, the total size of the space that they can represent is much larger than anything we can think of. Therefore, our representation of the handcrafted clues that we inject into the model seems to be so small and, as painful to say, negligible in that vast space. Instead, there might be an emergence of something that is far more informative and inexplicable that we cannot look into the network architecture to see directly. Those features are what the network has actually learnt throughout its training process, not our injected features.

Second, as we have so many problems to solve, such methods as injecting our features, our clues into the network that are specific to the task at hand are indeed too narrow and ungeneralizable. It ties neatly and indispensably with the problem we are trying to solve, therefore cannot allow the network to effectively solve a large number of problems that are different from the settings and assumptions we have had at the time of experiment, although they might share the same logical structure or some kinds of common latent representations, such as chess and Go. As our features are sometimes so complicated for a machine to understand, we cannot ensure that our encoding methods used to encode that information into the network is abstract enough. Some features that we inject into the model in forms of some numbers might need to be encoded with much larger vectors or matrices. Therefore, there are more reasons that could make our handcrafted features fail than we thought. To name some of these:

Are features we handcraft and inject into the model actually correct, are they misleading or short-sighted that may guide the model into the wrong way and end up with much worse results?
Whether the representations for these features are rich and informative enough for the model to understand their meanings, perhaps our features are right and correct in the sense that it could not possibly lead to failure, but how about their representation?
Finally, the dynamics of the game or the environment or the data distributions, whatever we are dealing with at the time, might not be considered and interpreted the same way between us and the machines. Maybe with the same meanings, our interpretation and computers’ interpretation about the features are entirely in different forms.

These are some kinds of reasons and failure modes that I can think of about the collapse of these kinds of methods. Of course, there are a few more such as scalability. Such methods that require us to do the handcrafting each time we encounter a new problem surely prove themselves to have no scalability and transferability for other problems. But we are focusing on the abstract thinking and meanings behind those features, so technical stuff like this is temporarily ignored.

Final thoughts

Finally, the bitter lesson that Sutton talked about is about the handcrafted features. As the large language models are becoming larger and larger as time passes, sometimes I think that the next bitter lesson that poses a new direction in our thinking about intelligence might be hidden somewhere in the model’s number of parameters. Although I don’t have much experience with complicated architectures and their topologies, I think that the number of parameters might become saturated at some point and increasing them at that point might not yield us much better performance. There must be something to do with the network’s architecture and its mechanism of learning representations from the data. There might be some more effective architectures that are waiting for us to discover. As network architectures are increasingly getting more and more complicated in both size and connections between different modules, fields that consider general structures such as topologies can help us yield some insights into this messy, dynamic, high dimensional space of neural architectures.

Also, the objective functions, also known as the loss functions in traditional supervised learning settings. Of course, objective functions sound more general than loss functions, as they are not telling us about which criteria we choose to be the objective, so the optimization on the functions might be maximization or minimization, depending on our function definitions and its role in the problem. As we want to guide our models to achieve better results at some tasks, we need a better way to convey such an abstract idea into our bits-and-bytes machines. If the manifestations of the ideas are wrong, then they might destroy what we want from the model and guide the model to go in the wrong way. Our objective functions act as a proxy to our true intentions, they might never be the true manifestation of our intentions regardless of how hard we try but we still can manage to convey as much information as possible and ensure the model has no or little way to hack it, which means, they can achieve really low loss values and high evaluation results but perform poorly on real datasets or might have some actions that are unexpected, in the worst case, might be harmful.

References

You can find the original essay by Richard Sutton here: The Bitter Lesson

On the bitter lesson

Maybe there are more bitter lessons out there

Final thoughts

References