FeatureK-meansHierarchicalDBSCAN
Basic PrinciplePartitions data into K clusters, each described by the mean of the samples in the clusterCreates a tree-like hierarchy of clustersGroups together points that are closely packed in space
Number of ClustersRequires pre-specification of KCan produce any number of clustersAutomatically determines the number of clusters
Cluster ShapeAssumes convex, isotropic blob shapesCan handle various shapes, but often produces globular clustersCan find clusters of arbitrary shapes
ScalabilityVery large n_samples, medium n_clustersLarge n_samples and n_clustersVery large n_samples, medium n_clusters
Handling OutliersSensitive to outliersCan be sensitive to outliers, depending on the linkage methodRobust to outliers, marks them as noise points
Handling Uneven Cluster SizesAssumes even cluster sizesCan handle uneven cluster sizesCan handle uneven cluster sizes
Memory RequirementsLow to MediumHighLow to Medium
Sensitivity to InitializationHighly sensitive to initial centroidsN/ANo
ParametersNumber of clusters (K)Number of clusters or distance threshold, linkage typeeps (neighborhood size), min_samples
InterpretabilityEasy to interpret centroidsDendrogram provides hierarchical viewCore points, border points, and noise points concept
Handling High-dimensional DataCan struggle with high-dimensional dataCan struggle with high-dimensional dataCan perform well if density is well-defined
ConsistencyResults can vary due to random initializationConsistent results for same parametersConsistent results for same parameters
Use CasesGeneral-purpose, even cluster size, flat geometryMany clusters, possibly with hierarchy, connectivity constraintsNon-flat geometry, uneven cluster sizes, noise removal

Table 1: Comparison table

Data Preparation

To explore clustering patterns based on amino acid composition, proteins with exclusive localizations to the Nucleus, Cytoplasm, and Cell Membrane were selected. This focus on specific subcellular localizations aims to investigate whether proteins with similar localizations exhibit similar amino acid profiles that can form distinct clusters. Extracting and separating these proteins facilitates a clear comparison of clustering results. So, in essence, this approach simplifies the dataset, making it easier to analyze clustering patterns linked to specific localizations.

The Python code below prepares the data to that extent.

import pandas as pd

# Loading the data (the original CSV data)
file_path = r'path_to_Cleaned_raw_2.csv'
data = pd.read_csv(file_path)

# Defining the amino acid percentage columns to extract
amino_acid_columns = ['Percent_A', 'Percent_C', 'Percent_D', 'Percent_E', 'Percent_F',
                      'Percent_G', 'Percent_H', 'Percent_I', 'Percent_K', 'Percent_L', 
                      'Percent_M', 'Percent_N', 'Percent_P', 'Percent_Q', 'Percent_R', 
                      'Percent_S', 'Percent_T', 'Percent_V', 'Percent_W', 'Percent_Y']

# Extracting proteins with exclusive subcellular localization - Nucleus
nucleus_exclusive = data[data['Subcellular_Locations'].apply(lambda x: x.strip() == 'Nucleus')]
nucleus_exclusive_amino_acids = nucleus_exclusive[amino_acid_columns].copy()  
nucleus_exclusive_amino_acids['Subcellular_Location'] = 'Nucleus'  

# Extracting proteins with exclusive subcellular localization - Cytoplasm
cytoplasm_exclusive = data[data['Subcellular_Locations'].apply(lambda x: x.strip() == 'Cytoplasm')]
cytoplasm_exclusive_amino_acids = cytoplasm_exclusive[amino_acid_columns].copy() 
cytoplasm_exclusive_amino_acids['Subcellular_Location'] = 'Cytoplasm' 

# Extracting proteins with exclusive subcellular localization - Cell membrane
cell_membrane_exclusive = data[data['Subcellular_Locations'].apply(lambda x: x.strip() == 'Cell membrane')]
cell_membrane_exclusive_amino_acids = cell_membrane_exclusive[amino_acid_columns].copy() 
cell_membrane_exclusive_amino_acids['Subcellular_Location'] = 'Cell membrane' 

# Concatenating the data
merged_data_with_location = pd.concat([nucleus_exclusive_amino_acids, cytoplasm_exclusive_amino_acids, cell_membrane_exclusive_amino_acids], ignore_index=True)

# Saving the merged data with Subcellular_Location to an Excel file
excel_file_with_labels_path = r'path_output_amino_acid_seq_Nucleus_Cytoplasm_Cell_membrane_WITH_LABELS.xlsx'
merged_data_with_location.to_excel(excel_file_with_labels_path, index=False)

# Saving the merged data without Subcellular_Location to another Excel file
merged_data_without_labels = merged_data_with_location.drop(columns=['Subcellular_Location']) 
excel_file_without_labels_path = r'path_output_amino_acid_seq_Nucleus_Cytoplasm_Cell_membrane_LABELS_REMOVED.xlsx'
merged_data_without_labels.to_excel(excel_file_without_labels_path, index=False)

# Just a little print message
print('Merging complete')


Link to the excel file that was generated form the python code above. I will be using the one without labels ofcourse but i have teh one with lables separately in an excel file. The only difference bewtten the two is jsut one column that has the label for the 3 subcellular localization.

Here are the the links to the two excel files:

1.amino_acid_seq_Nucleus_Cytoplasm_Cell_membrane_WITH_LABELS.xlsx and 2.amino_acid_seq_Nucleus_Cytoplasm_Cell_membrane_LABELS_REMOVED.xlsx

Screenshot showing the data with the Subcellular_location label .

Screenshot showing the data without the Subcellular_location label .

Normalizing the data

Python code used to normalize the no label data from above.

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Loading the dataset without labels
file_path = r'path_to_amino_acid_seq_Nucleus_Cytoplasm_Cell_membrane_LABELS_REMOVED.xlsx'
data = pd.read_excel(file_path)

# Normalizing the data
scaler = StandardScaler()
normalized_data = scaler.fit_transform(data)

# Converting back to a DataFrame
normalized_data_df = pd.DataFrame(normalized_data, columns=data.columns)

# Printing the first few rows of the normalized data
print(normalized_data_df.head())

# Saving the normalized data to a new Excel file
normalized_file_path = r'path_to_normalized_amino_acid_data.xlsx'
normalized_data_df.to_excel(normalized_file_path, index=False)

print("Normalization complete")

Link to the normalized excel file : normalized_amino_acid_data.xlsx

Screenshot showing the normalized data from the code above.

Performing PCA

import pandas as pd
from sklearn.decomposition import PCA

# Load the normalized data
file_path = r'path_to_normalized_amino_acid_data.xlsx'
normalized_data = pd.read_excel(file_path)

# Performing PCA with 3 components
pca = PCA(n_components=3)
data_pca = pca.fit_transform(normalized_data)

# Getting the explained variance ratio (percentage of variance retained)
explained_variance = pca.explained_variance_ratio_
total_variance_retained = sum(explained_variance) * 100

# Print the variance retained by the 3 principal components
print(f'Explained variance by 3 components: {explained_variance}')
print(f'Total variance retained: {total_variance_retained:.2f}%')

# Convert the PCA-reduced data back to a DataFrame
data_pca_df = pd.DataFrame(data_pca, columns=['PC1', 'PC2', 'PC3'])

# Saving the PCA-reduced data to an Excel file
pca_file_path = r'path_to_pca_reduced_amino_acid_data.xlsx'
data_pca_df.to_excel(pca_file_path, index=False)

print("PCA complete")

Percentage of Variance Explained by Top 3 Principal Components

  • PC1 (Principal Component 1): 21.44% of the variance.
  • PC2 (Principal Component 2): 13.78% of the variance.
  • PC3 (Principal Component 3): 10.11% of the variance.

After reducing the data to 3 dimensions, I retain 45.33% of the original information.

Screenshot showing PCA-reduced data as executed by the python code above .

Performing K-means Clustering with Silhouette Method

Testing k values of 2,3,4 and 5

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Loading the PCA_reduced data
file_path = r'=path_to_pca_reduced_amino_acid_data.xlsx'
data_pca = pd.read_excel(file_path)

# Defining the range of k values to test
#picking 2,3,4 and 5
k_values = range(2, 6)  

# Storing silhouette scores for each k valesu listed above
silhouette_scores = {}

# Looping through each k value and computing the Silhouette score
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(data_pca)
    
    # Calculate the Silhouette score
    silhouette_avg = silhouette_score(data_pca, labels)
    silhouette_scores[k] = silhouette_avg
    
    print(f"For n_clusters = {k}, the Silhouette score is {silhouette_avg:.4f}")

#cleaner data ouput visualization
# Sorting and displaing the best k values based on Silhouette score
sorted_silhouette_scores = sorted(silhouette_scores.items(), key=lambda x: x[1], reverse=True)
print("\nBest k values based on Silhouette score:")
for k, score in sorted_silhouette_scores:
    print(f"k = {k}, Silhouette score = {score:.4f}")

Ouput of the code above

For n_clusters = 2, the Silhouette score is 0.3381
For n_clusters = 3, the Silhouette score is 0.3693
For n_clusters = 4, the Silhouette score is 0.3920
For n_clusters = 5, the Silhouette score is 0.3254

Best k values based on Silhouette score:
k = 4, Silhouette score = 0.3920
k = 3, Silhouette score = 0.3693
k = 2, Silhouette score = 0.3381
k = 5, Silhouette score = 0.3254

Takeaway: The silhouette method shows that 4 clusters are best, with a score of 0.3920. This means the clusters are well-separated. Using 3 or 2 clusters is also okay, but the separation is not as clear. Using 5 clusters is the worst, so adding more clusters probably won’t help much for this data. The selected clusters, k=2, k=3, and k=4, will now be used to create a 3D k-means clustering plot.

3D K-Means Clustering data (k=2, 3, 4)

import pandas as pd
from sklearn.cluster import KMeans
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import os

# Loading PCA-reduced data
file_path = r'path_to_pca_reduced_amino_acid_data.xlsx'
data_pca = pd.read_excel(file_path)

# Ensuring the data has the correct structure
required_columns = ['PC1', 'PC2', 'PC3']
if not all(col in data_pca.columns for col in required_columns):
    raise ValueError(f"Data must contain columns named {', '.join(required_columns)}.")

# Defining the k values we are interested in (k = 4, 3, 2 based on Silhouette scores)
k_values = [4, 3, 2]

# Creating a color map for up to 4 clusters
color_maps = {
    0: 'rgb(31, 119, 180)',  # Blue
    1: 'rgb(255, 127, 14)',  # Orange
    2: 'rgb(44, 160, 44)',   # Green
    3: 'rgb(214, 39, 40)'    # Red
}

# Directory to save the plots
output_dir = r'C:\Users\GAMING PC\Downloads\XA'

# Looping through each value of k and creating a plot
for k in k_values:
    # Performing KMeans clustering
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(data_pca)

    # Adding cluster labels to the dataframe
    data_pca['Cluster'] = labels

    # Creating an interactive 3D scatter plot
    fig = make_subplots(rows=1, cols=1, specs=[[{'type': 'scatter3d'}]])

    # Creating colors based on the cluster labels
    colors = [color_maps[label] for label in labels]

    # Adding scatter plot with hover information
    scatter = go.Scatter3d(
        x=data_pca['PC1'],
        y=data_pca['PC2'],
        z=data_pca['PC3'],
        mode='markers',
        marker=dict(
            size=3,
            color=colors,
            opacity=0.8,
            line=dict(width=0)
        ),
        text=[
            f'Protein: {idx}<br>Cluster: {label}<br>PC1: {pc1}<br>PC2: {pc2}<br>PC3: {pc3}'
            for idx, (label, pc1, pc2, pc3) in enumerate(zip(labels, data_pca['PC1'], data_pca['PC2'], data_pca['PC3']), start=1)
        ],
        hoverinfo='text',
        name='Proteins'  # Renamed Trace 0
    )
    fig.add_trace(scatter)  # Trace 0

    # Plotting the centroids
    centroids = kmeans.cluster_centers_
    centroid_trace = go.Scatter3d(
        x=centroids[:, 0],
        y=centroids[:, 1],
        z=centroids[:, 2],
        mode='markers',
        marker=dict(size=6, color='black', symbol='x', line=dict(width=2)),
        name='Centroid'  # Trace 1
    )
    fig.add_trace(centroid_trace)  # Trace 1

    # Updating layout with customized axes and background
    fig.update_layout(
        title=f"KMeans Clustering of Proteins with k={k}",
        scene=dict(
            xaxis=dict(
                title='PC1',
                backgroundcolor='white',
                gridcolor='lightgrey',
                linecolor='black',
                showbackground=True,
                zerolinecolor='black'
            ),
            yaxis=dict(
                title='PC2',
                backgroundcolor='white',
                gridcolor='lightgrey',
                linecolor='black',
                showbackground=True,
                zerolinecolor='black'
            ),
            zaxis=dict(
                title='PC3',
                backgroundcolor='white',
                gridcolor='lightgrey',
                linecolor='black',
                showbackground=True,
                zerolinecolor='black'
            ),
            bgcolor='white',  
            camera=dict(
                eye=dict(x=1.25, y=1.25, z=1.25) 
            )
        ),
        width=900,
        height=700,
        margin=dict(r=20, b=10, l=10, t=40),
        paper_bgcolor='white', 
        plot_bgcolor='white'    
    )

    # Saving each plot as an HTML #I thought it would be a better way to visualize than a static image 
    output_file = os.path.join(output_dir, f'kmeans_plot_k{k}_interactive.html')
    fig.write_html(output_file, include_plotlyjs=True, full_html=True)
    print(f"The interactive plot for k={k} has been saved as 'kmeans_plot_k{k}_interactive.html' in {output_dir}")

The code above will generate an interactive 3D visualization to display the K-means clustering results.

Summary

As k increases from 2 to 4, the clusters become more refined. For k = 2, the clusters group the data into broader categories, likely oversimplifying the differences between proteins localized to the nucleus, cytoplasm, and cell membrane. At k = 3, the clustering better aligns with the three known labels, suggesting that this value of k is well-suited for capturing the distinctions between these subcellular locations. By k = 4, the clusters are even more distinct, but may overdivide the data, indicating that k = 3 might be the best representation of the subcellular localization patterns in this case.

Hierarchical Clustering

import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Loading the PCA-reduced data
file_path = r'path_to_pca_reduced_amino_acid_data.xlsx'
data_pca = pd.read_excel(file_path)

# Performing Hierarchical Clustering using Ward's method 
# it helps to minimize variance within clusters
linked = linkage(data_pca, method='ward')

# Creating the dendrogram plot
plt.figure(figsize=(12, 8))  
dendrogram(linked,
           orientation='top',       
           distance_sort='descending', 
           show_leaf_counts=True)   

# Customizing the title and axis labels for better readability
plt.title('Hierarchical Clustering', fontsize=16)
plt.xlabel('Protein Index', fontsize=12)  
plt.ylabel('Distance', fontsize=12) 

# Cleaning up the plot
plt.gca().spines['top'].set_visible(False) 
plt.gca().spines['right'].set_visible(False) 

# Saving the dendrogram plot
output_file = r'path_to_hierarchical_clustering.png'
plt.savefig(output_file, dpi=300, bbox_inches='tight')

# Displaying
plt.show()

print("Dendrogram saved as 'updated_hierarchical_clustering.png'.")

Comparing the dendrogram results to the kmeans results

The K-means clustering approach is straightforward, providing a fixed number of clusters (e.g., k = 2, 3, 4), which can simplify the data but may overlook finer structures. On the other hand, hierarchical clustering, as visualized above through the dendrogram, offers greater flexibility by allowing clusters to be formed at different distance thresholds.

For example, when setting the threshold to 5 in the hierarchical clustering analysis, 113 distinct clusters emerged. This provides a much more detailed picture of how the data is organized compared to simpler methods. While K-means clustering is useful when the number of groups is predetermined, hierarchical clustering creates a sort of family tree for the data. It reveals not just the groups, but how they’re all related to each other. This is esepcially helpful when dealing with complex datasets where the relationships aren’t immediately obvious.

DBSCAN clustering

python code

import pandas as pd
from sklearn.cluster import DBSCAN
import plotly.graph_objects as go
import numpy as np
import os

# Loading PCA-reduced data
file_path = r'path_to_pca_reduced_amino_acid_data.xlsx'
data_pca = pd.read_excel(file_path)

# Performing DBSCAN clustering
# DBSCAN eps parameter
eps_value = 0.5 
# Minimum points to form a cluster
min_samples_value = 5  
dbscan = DBSCAN(eps=eps_value, min_samples=min_samples_value)
dbscan_labels = dbscan.fit_predict(data_pca)

# Adding cluster labels to the dataframe
data_pca['Cluster'] = dbscan_labels

# Identifying the number of clusters (excluding noise)
n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
n_noise = list(dbscan_labels).count(-1)

# Printing cluster and noise information
print(f"Number of clusters found: {n_clusters}")
print(f"Number of noise points: {n_noise}")

# Creating a 3D scatter plot for DBSCAN results
fig = go.Figure()

# Creating color map for clusters (using black for noise)
unique_labels = set(dbscan_labels)
colors = np.linspace(0, 1, len(unique_labels))

for label, color_value in zip(unique_labels, colors):
    if label == -1:
        # Using black for noise
        color = 'rgb(0, 0, 0)'
    else:
        # Assigning color based on the label
        color = f'rgb({color_value * 255}, {100}, {255 - color_value * 255})' 

    # Filtering data points for the current cluster
    cluster_data = data_pca[dbscan_labels == label]

    # Adding scatter plot for each cluster
    fig.add_trace(go.Scatter3d(
        x=cluster_data['PC1'],
        y=cluster_data['PC2'],
        z=cluster_data['PC3'],
        mode='markers',
        marker=dict(
            size=4,
            color=color,
            opacity=0.8
        ),
        name=f'Cluster {label}' if label != -1 else 'Noise'
    ))

# Updating plot layout
fig.update_layout(
    title=f'DBSCAN Clustering on 3D PCA Data (eps={eps_value}, min_samples={min_samples_value})',
    scene=dict(
        xaxis_title='PC1',
        yaxis_title='PC2',
        zaxis_title='PC3'
    ),
    width=900,
    height=700
)

# Saving the plot as an HTML file
output_dir = r'path'
output_file = os.path.join(output_dir, 'dbscan_plot_interactive.html')
fig.write_html(output_file, include_plotlyjs=True, full_html=True)

# Printing confirmation of the saved plot
print("Saved as 'dbscan_plot_interactive.html'")

The DBSCAN groups points based on their density, identifying regions of high point density as clusters and labeling sparser regions as noise. In this case, the DBSCAN code identified 8 distinct clusters from the 3D PCA-reduced data, indicating that there are 8 dense groups of points within the dataset. Additionally, 350 points were classified as noise, meaning they didn’t belong to any cluster based on the eps=0.5 and min_samples=5 parameters. The interactive 3D plot visualizes these clusters, allowing for exploration of how the data points are distributed in the reduced feature space.

***

Very different and less cleaner compared to the clustering I have seen from the K-means and Hierarchical.

Comparing DBSCAN clustering tothe dendrogram and the kmeans results

K-means Clustering
In K-means clustering with k=3, distinct clusters emerged that closely corresponded to the three subcellular localizations (Nucleus, Cytoplasm, and Cell Membrane). However, K-means assumes spherical clusters and forces all points into clusters, potentially misclassifying points that don’t fit well.

Hierarchical Clustering
Hierarchical clustering using Ward’s method illustrated how clusters are hierarchically merged, offering a clear tree structure that reflects relationships between data points. However, it didn’t delineate the subcellular localization clusters as strongly as K-means.

DBSCAN
DBSCAN, with eps=0.6 and min_samples=5, formed 47 clusters and identified 269 points as noise, demonstrating its strength in handling outliers and arbitrary-shaped clusters. However, it resulted in fragmented clusters without clear alignment to subcellular localizations due to its density-based nature, which excels in high-density regions but can struggle with less defined datasets.

Conclusions

From what I see, the K-means clustering really nailed it with well-separated clusters for the three subcellular localizations (Nucleus, Cytoplasm, and Cell Membrane), but it can struggle with outliers since it forces all points into clusters. Hierarchical clustering showed a more flexible merging process, while DBSCAN was great at spotting outliers but ended up creating a lot of smaller clusters, so overall, K-means offered the clearest separation for the defined groups.