Ensemble – MLMichael

This ML technique is where many models are combined together in an ‘ensemble’ to improve overall performance.Though the aggregation of predictions from multiple models, robustness is increased. Some of the methods used are bagging, boosting, stacking, and voting. Boosting trains models sequentially, where each subsequent model corrects the errors of the previous one, while bagging trains models in parallel using random subsets of the data. .

Example
Boosting is as technique where models are trained sequentially where the error of the prior model is corrected by the next model that will take in the weighted data points with higher weight assigned to misclassification. The goal is that by the end of the training process the ensemble of models would have leant to accurately predictive model.

Bagging uses a different method where models are trained in parallel(not sequential) using random subsets of the training data through bootstrap sampling. In this ensemble each model is trained separately, and their predictions are combined through voting/averaging to create to bus model that reduces variance. In short, the predictions are aggregated from all individual models hence bootstrap aggregating or bagging.

Here the figure shows the parallel training process in bagging, where models are trained independently on random subsets of data, compared to the sequential process in boosting, where each model corrects the errors of its predecessor.

Source: (https://medium.com/@senozanAleyna/ensemble-boosting-bagging-and-stacking-machine-learning-6a09c31df778)

Code

1st Model

XGboost

The python code below applies XGBoost.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import XGBClassifier

file_path = "C:/Users/GAMING PC/Downloads/for_SVM/processed_V1.xlsx"  
df = pd.read_excel(file_path)


X = df['Encoded_Sequence'].apply(lambda x: eval(x)).tolist() 
y = df['Encoded_Location']


from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
X = mlb.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

xgb = XGBClassifier(eval_metric='mlogloss', random_state=42)
xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='pink', cbar=True)
plt.title("Confusion Matrix - XGBoost", fontsize=14)
plt.xlabel("Predicted", fontsize=12)
plt.ylabel("Actual", fontsize=12)
plt.show()

feature_importance = xgb.feature_importances_
plt.figure(figsize=(10, 6))
plt.bar(range(len(feature_importance)), feature_importance, color='pink')
plt.title("Feature Importance - XGBoost", fontsize=14)
plt.xlabel("Feature Index", fontsize=12)
plt.ylabel("Importance", fontsize=12)
plt.show()

Results for XGBoost

Note: I explored two modifications to address the class imbalance issue. First, I applied SMOTE-based oversampling, which generates synthetic samples for the minority classes to balance the training data. Second, I implemented cost-sensitive learning by adjusting class weights to penalize misclassifications of underrepresented classes. Both approaches aimed to improve the model’s ability to classify minority classes, such as Cytoplasm, Cell Membrane, and Secreted proteins. However, neither modification resulted in a significant improvement, with the accuracy remaining approximately 39.5%. This suggests that the feature representation or inherent separability of the data may be the primary limitations, rather than the class imbalance alone.

2nd model

Bagging

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier 

file_path = "C:/Users/GAMING PC/Downloads/for_SVM/processed_V1.xlsx"
df = pd.read_excel(file_path)


X = df['Encoded_Sequence'].apply(lambda x: eval(x)).tolist()
y = df['Encoded_Location']

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
X = mlb.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100, 
    max_samples=0.8,  
    max_features=0.8, 
    random_state=42
)
bagging.fit(X_train, y_train)


y_pred = bagging.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

plt.figure(figsize=(6, 6))
sns.heatmap(confusion_matrix(y_test, y_pred), 
            annot=True, 
            fmt='d', 
            cmap='Greens', 
            cbar=True)
plt.title("Confusion Matrix - Bagging Classifier", fontsize=14)
plt.xlabel("Predicted", fontsize=12)
plt.ylabel("Actual", fontsize=12)
plt.show()

feature_importance = np.mean([tree.feature_importances_ 
                            for tree in bagging.estimators_
                            if hasattr(tree, 'feature_importances_')], 
                           axis=0)

plt.figure(figsize=(10, 6))
plt.bar(range(len(feature_importance)), feature_importance, color='green')
plt.title("Average Feature Importance - Bagging Classifier", fontsize=14)
plt.xlabel("Feature Index", fontsize=12)
plt.ylabel("Importance", fontsize=12)
plt.show()

The Accuracy result from the XGBoost and Bagging classifier is around 39.6%. There is not much difference.

Results

XGBoost confusion matrix

XGBoost feature importance graph

Bagging classifier confusion matrix

Bagging classifiers feature importance graph

Summary

in short, through the analysis of the XGBoost and Bagging classifiers,it is apparent that they both got similar accuracies of around 39.6%. When I look at the confusion matrices, I can see that both models are mostly predicting class 0, which means I probably have a class imbalance issue to deal with. Looking at the feature importance plots, I noticed that all my features have pretty similar importance values between 0.05 and 0.07, so none of them are standing out as super important for predictions. Based on these results, I think my location prediction task is pretty tricky, probably because the features for different locations are too similar. To improve this, I might need to work on the class imbalance problem and maybe come up with better features.

***Biology is a little complex for some of these models tested***