Blog

The Bitter Lesson of Domain Knowledge

Charles K. Fisher

March 11, 2024

Since the beginning of the field, machine learning researchers have been tempted by the alluring idea of creating models with pre-existing domain knowledge baked into the model architecture. This lives on today in applications of knowledge graphs, symbolic AI, and feature engineering. In general, I don’t think this is a good idea and, in fact, I think there’s a much better way to provide a model with domain knowledge.

In my opinion, there are at least three reasons why the idea of creating models with pre-existing domain knowledge is alluring. First, it feels like a model is more interpretable if you forced a particular structure on it. Second, it makes you feel clever when you find a neat architecture trick that makes your model better. And, third, it does, in fact, usually make your model better — but only in the short run. The first two reasons are delusions; although, I guarantee that myself and every other machine learning researcher who’s been at it for a while has fallen for them. The third reason, however, is worth exploring further.

To paraphrase Richard Sutton, the bitter lesson of AI that has repeatedly presented itself over the past 30 years is that unconstrained architectures eventually surpass architectures that incorporate specific domain knowledge once the training dataset and compute are large enough. Therefore, an architecture that incorporates pre-existing domain knowledge may perform better today, but probably not in a year, and definitely not in a decade.

Creating architectures that incorporate pre-existing domain knowledge solves a today problem by creating a tomorrow problem.

Obviously, injecting domain knowledge into the architecture of a model is bad if that domain knowledge is actually wrong. Even though it’s obvious, this has happened frequently in the history of machine learning for biology, particularly in the applications of knowledge graphs built from literature. The literature says that increasing protein A decreases protein B, so I put that in my knowledge graph, but later on it turns out the experiment wasn’t performed correctly and that fact isn’t true at all. That’s a clear failure mode and, unfortunately, it can hide itself for a long time if that part of the knowledge graph isn’t needed for the examples encountered during training. I may not find out until my faulty incorrect knowledge graph surprises me and messes up my model predictions when I actually needed them most.

A second problem with incorporating domain knowledge into the model architecture is that even if the knowledge is correct, it may not be obvious how to directly encode it into the model. I know that if there is a four way stop sign, then the car that came to a stop first has the right of way. But, that’s not true if one of the other vehicles is an ambulance with its siren on, for example. Or, what if I came to a stop first but there’s a group of pedestrians crossing the street in front of me? Can one of the other cars go instead? What if the other car is speeding down the road like a madman and doesn’t look like it’s going to stop? Should I go anyway and tempt fate? And the list of possibilities goes on. How do I code all this stuff up into some kind of machine learning architecture with the right pre-existing knowledge?

Even though it’s difficult for me as a human to formulate my knowledge in a precise mathematical way that I can force onto a machine learning architecture, it’s actually quite easy to generate examples. I don’t know how to precisely describe all the possibilities in the four way stop sign problem, but I can easily provide some examples that illustrate the concept. It’s through providing a model with these generated examples during training that I can teach it my domain knowledge without forcing it into the architecture. This is the right way to inject domain knowledge into a machine learning model, in my opinion.

By using synthetic data, I’m acting like a teacher and my model is acting like a student. I’m teaching it to make certain types of predictions that agree with my pre-existing domain knowledge, but I’m not constraining how it makes those predictions. This immediately circumvents the second problem of having to figure out how to turn my knowledge into a precise algorithm. I no longer have to do that, all I have to do is generate examples, which is usually pretty easy. In addition, it helps to mitigate the first problem because I can always just delete bad training examples and retrain the model if I have to. In addition, if I get new knowledge I can just make some new examples to add to the training set and, voila, the model will have the knowledge too. Even further, if there’s a new type of architectural or methodological breakthrough in machine learning, then I can leverage it immediately by training it on my synthetic examples to get a state-of-the-art model that incorporates my domain knowledge. This way, I’m never getting left behind.

To be more succinct, I think it’s better to use domain knowledge to create synthetic examples for training machine learning models with unconstrained architectures than to try to directly incorporate that knowledge into the architecture itself.