The Importance of Good Models and Better Data
I remember learning about linear and quadratic regression in middle school. Bored out of my mind in my after-school program, I took out a periodic table and typed every element’s atomic number and the corresponding atomic weight into my graphing calculator. I attempted to create equations for both linear and quadratic regression, but neither equations were very accurate; as I would find out years later, it would closely follow a power-law fit instead. Little did I know then that linear regression is a basic yet powerful technique in machine learning. Essentially, machine learning aims to infer patterns from data and make predictions, something that linear regression does quite well.
Machine learning algorithms are everywhere and come in various models. For instance, the book "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville explains how logistic regression can model and recommend cesarean deliveries, while a naive Bayes classifier might be better for tasks like spam detection. The choice of algorithm heavily depends on how data is represented. Here’s a simple visualization:
An old acronym from the US Navy, KISS (Keep It Simple, Stupid), is something I’ve embraced in problem-solving. It’s incredibly relevant in machine learning. You can craft a sophisticated model with cutting-edge methods, but if your data is flawed, you won’t get far. As I delve deeper into machine learning, I sometimes forget this, chasing after fancy architectures and new libraries. It’s important to occasionally step back and remember the basics.
For example, imagine building a model to predict someone’s age. We might hypothesize that older individuals use a more complex vocabulary and discuss serious issues, while younger people use slang and talk about casual topics. However, I’ve heard older people say things like “the steaks are kinda mid but the burgers hit different” and “sheeeesh, you slay queen,” so this feature alone might not be reliable. We could consider athletic data, but athleticism varies widely within age groups. Appearance features like wrinkles, facial structure, and hair color might be more indicative of age, but even these can vary by race.
To create an effective machine learning model, we need to choose datasets wisely. Athletic ability might not correlate well with age, so we could focus on linguistic and appearance features. Alternatively, we might use just facial features if we have enough data. The success of the model depends heavily on the quality, quantity, and relevance of the collected data.
While an age prediction model might be straightforward, other models can be more complex. In a recent project, I worked on predicting the binding affinities between molecules and proteins. My dataset included three building block SMILES strings for molecules, the full SMILES string, the interacting protein, and a Boolean value for binding affinity. I decided to exclude the building block SMILES strings, thinking the full SMILES string was sufficient since it contained all the data the building blocks had. I converted these into graph objects and used graph neural networks for predictions.
This novel approach was exciting, but I overlooked the importance of the building blocks. Many molecules shared one or two building blocks, and by encoding this data, I could infer much about the binding relationships. I failed to realize this in my graph neural network models, making generalization difficult. I was so focused on the neural network architectures that I lost sight of effectively using the data. This mistake cost me around 50 hours, spent creating, debugging, and training the model. Had I taken the time to understand the data better, I could have built a more effective model. It sucks to not make progress but you live and you learn.