I undertook a slightly ambitious exploration of machine learning approaches to predict where proteins localize within cells. It is a very important part of understanding how proteins function and cells organize. To do this, the comprehensive dataset of human proteins from UniProt was used. Data was extracted using their API. I also want to emphasize that this is all human proteins and it is also a smaller subset of reviewed dataset to make sure that I have high quality data. This is all to say the best possible data was collected. Ideally, by building a protein localization model we can significantly inform drug discovery and basic research.

The data analysis was pretty complex. I started by pulling data from UniProt using their API, but noticed there were quite a few missing values for protein locations. To fix this, I used the Human Protein Atlas database to fill in the gaps. This helped reduce missing values by 56%, which was a big improvement. I then did some exploratory analysis looking at amino acid patterns. The PCA analysis was interesting – it showed that protein data is super complex, needing 17 components to explain 95% of the variance. I also tried different clustering approaches to see if proteins naturally group based on their amino acid makeup.

When it came to the models, I tested quite a few different approaches. I used Support Vector Machines (SVM), Decision Trees, Naïve Bayes, Regression, and some Ensemble methods like XGBoost and Bagging. Most of them ended up with similar accuracy around 39-40%. What’s really interesting is that all the models were better at predicting nuclear proteins compared to other locations. The ensemble methods (XGBoost and Bagging) both hit about 39.6% accuracy, and SVM with different kernels was pretty similar. Since different approaches got similar results, this might be as good as we can get just using sequence data.

Biology is a little complex for some of these models. The biggest challenge was that proteins don’t always stay in one place in the cell. While nuclear proteins were easier to predict because they have some clear patterns in their sequence, proteins in other places like the cytoplasm were harder to figure out. This makes sense biologically – many proteins move between different parts of the cell, and some can be found in multiple locations. Just looking at the amino acid sequence gives us some good information, but it’s probably not enough to catch all the subtle signals that tell a protein where to go.

Figure/Graphic 1: This is nice 3D plot of one of the k means clusters that formed k=3.

Looking ahead, there are a few ways this work could be improved. First, we could try adding more types of data beyond just amino acid sequences. Things like protein structure or modifications could help make better predictions. Second, instead of trying to predict all locations at once, we might do better making specialized models for each location, especially for nuclear proteins since they have clearer patterns. Finally, deep learning might be worth trying – it’s good at finding complex patterns that simpler models might miss. The main thing I learned is that protein localization is really complex, and we probably need more than just sequence data to predict it accurately.

Figure 2: A visualization of a deep neural network architecture with multiple hidden layers. The network shows how complex patterns can be learned through multiple layers of processing, making it particularly suitable for analyzing the intricate patterns in protein sequences.

source: https://ecraft2learn.github.io/ai/AI-Teacher-Guide/chapter-6.html

Things I have also tried

Although not shown in this work, I’ve actually used LSTM networks with pretty promising results. There’s also potential in using transformer models to create protein embeddings. Deep learning seems to work better for this kind of problem because it can pick up on complex patterns in the protein sequences that simpler models might miss. The multiple layers shown in the figure above help the network learn increasingly complex features – kind of like how proteins have different levels of structure that all contribute to where they end up in the cell. Looking ahead, deep learning definitely seems like the way to go for improving protein localization prediction accuracy beyond what the traditional machine learning models achieved here.