Clustering – MLMichael

Clustering is the process of grouping similar data points to uncover hidden patterns within a dataset. The primary goal is to maximize similarity within clusters while minimizing differences between them. Various algorithms achieve this, each with its own advantages and disadvantages, helping us gain insights into complex datasets. In this analysis, I will compare and contrast three algorithms: K-means, Hierarchical clustering, and DBSCAN, which represent different approaches to the clustering problem.

Feature	K-means	Hierarchical	DBSCAN
Basic Principle	Partitions data into K clusters, each described by the mean of the samples in the cluster	Creates a tree-like hierarchy of clusters	Groups together points that are closely packed in space
Number of Clusters	Requires pre-specification of K	Can produce any number of clusters	Automatically determines the number of clusters
Cluster Shape	Assumes convex, isotropic blob shapes	Can handle various shapes, but often produces globular clusters	Can find clusters of arbitrary shapes
Scalability	Very large n_samples, medium n_clusters	Large n_samples and n_clusters	Very large n_samples, medium n_clusters
Handling Outliers	Sensitive to outliers	Can be sensitive to outliers, depending on the linkage method	Robust to outliers, marks them as noise points
Handling Uneven Cluster Sizes	Assumes even cluster sizes	Can handle uneven cluster sizes	Can handle uneven cluster sizes
Memory Requirements	Low to Medium	High	Low to Medium
Sensitivity to Initialization	Highly sensitive to initial centroids	N/A	No
Parameters	Number of clusters (K)	Number of clusters or distance threshold, linkage type	eps (neighborhood size), min_samples
Interpretability	Easy to interpret centroids	Dendrogram provides hierarchical view	Core points, border points, and noise points concept
Handling High-dimensional Data	Can struggle with high-dimensional data	Can struggle with high-dimensional data	Can perform well if density is well-defined
Consistency	Results can vary due to random initialization	Consistent results for same parameters	Consistent results for same parameters
Use Cases	General-purpose, even cluster size, flat geometry	Many clusters, possibly with hierarchy, connectivity constraints	Non-flat geometry, uneven cluster sizes, noise removal

Table 1: Comparison table

Data Preparation

To explore clustering patterns based on amino acid composition, proteins with exclusive localizations to the Nucleus, Cytoplasm, and Cell Membrane were selected. This focus on specific subcellular localizations aims to investigate whether proteins with similar localizations exhibit similar amino acid profiles that can form distinct clusters. Extracting and separating these proteins facilitates a clear comparison of clustering results. So, in essence, this approach simplifies the dataset, making it easier to analyze clustering patterns linked to specific localizations.

The Python code below prepares the data to that extent.

import pandas as pd

# Loading the data (the original CSV data)
file_path = r'path_to_Cleaned_raw_2.csv'
data = pd.read_csv(file_path)

# Defining the amino acid percentage columns to extract
amino_acid_columns = ['Percent_A', 'Percent_C', 'Percent_D', 'Percent_E', 'Percent_F',
                      'Percent_G', 'Percent_H', 'Percent_I', 'Percent_K', 'Percent_L', 
                      'Percent_M', 'Percent_N', 'Percent_P', 'Percent_Q', 'Percent_R', 
                      'Percent_S', 'Percent_T', 'Percent_V', 'Percent_W', 'Percent_Y']

# Extracting proteins with exclusive subcellular localization - Nucleus
nucleus_exclusive = data[data['Subcellular_Locations'].apply(lambda x: x.strip() == 'Nucleus')]
nucleus_exclusive_amino_acids = nucleus_exclusive[amino_acid_columns].copy()  
nucleus_exclusive_amino_acids['Subcellular_Location'] = 'Nucleus'  

# Extracting proteins with exclusive subcellular localization - Cytoplasm
cytoplasm_exclusive = data[data['Subcellular_Locations'].apply(lambda x: x.strip() == 'Cytoplasm')]
cytoplasm_exclusive_amino_acids = cytoplasm_exclusive[amino_acid_columns].copy() 
cytoplasm_exclusive_amino_acids['Subcellular_Location'] = 'Cytoplasm' 

# Extracting proteins with exclusive subcellular localization - Cell membrane
cell_membrane_exclusive = data[data['Subcellular_Locations'].apply(lambda x: x.strip() == 'Cell membrane')]
cell_membrane_exclusive_amino_acids = cell_membrane_exclusive[amino_acid_columns].copy() 
cell_membrane_exclusive_amino_acids['Subcellular_Location'] = 'Cell membrane' 

# Concatenating the data
merged_data_with_location = pd.concat([nucleus_exclusive_amino_acids, cytoplasm_exclusive_amino_acids, cell_membrane_exclusive_amino_acids], ignore_index=True)

# Saving the merged data with Subcellular_Location to an Excel file
excel_file_with_labels_path = r'path_output_amino_acid_seq_Nucleus_Cytoplasm_Cell_membrane_WITH_LABELS.xlsx'
merged_data_with_location.to_excel(excel_file_with_labels_path, index=False)

# Saving the merged data without Subcellular_Location to another Excel file
merged_data_without_labels = merged_data_with_location.drop(columns=['Subcellular_Location']) 
excel_file_without_labels_path = r'path_output_amino_acid_seq_Nucleus_Cytoplasm_Cell_membrane_LABELS_REMOVED.xlsx'
merged_data_without_labels.to_excel(excel_file_without_labels_path, index=False)

# Just a little print message
print('Merging complete')

Link to the excel file that was generated form the python code above. I will be using the one without labels ofcourse but i have teh one with lables separately in an excel file. The only difference bewtten the two is jsut one column that has the label for the 3 subcellular localization.

Here are the the links to the two excel files:

1.amino_acid_seq_Nucleus_Cytoplasm_Cell_membrane_WITH_LABELS.xlsx and 2.amino_acid_seq_Nucleus_Cytoplasm_Cell_membrane_LABELS_REMOVED.xlsx

Screenshot showing the data with the Subcellular_location label .

Screenshot showing the data without the Subcellular_location label .

Normalizing the data

Python code used to normalize the no label data from above.

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Loading the dataset without labels
file_path = r'path_to_amino_acid_seq_Nucleus_Cytoplasm_Cell_membrane_LABELS_REMOVED.xlsx'
data = pd.read_excel(file_path)

# Normalizing the data
scaler = StandardScaler()
normalized_data = scaler.fit_transform(data)

# Converting back to a DataFrame
normalized_data_df = pd.DataFrame(normalized_data, columns=data.columns)

# Printing the first few rows of the normalized data
print(normalized_data_df.head())

# Saving the normalized data to a new Excel file
normalized_file_path = r'path_to_normalized_amino_acid_data.xlsx'
normalized_data_df.to_excel(normalized_file_path, index=False)

print("Normalization complete")

Link to the normalized excel file : normalized_amino_acid_data.xlsx

Screenshot showing the normalized data from the code above.

Performing PCA

import pandas as pd
from sklearn.decomposition import PCA

# Load the normalized data
file_path = r'path_to_normalized_amino_acid_data.xlsx'
normalized_data = pd.read_excel(file_path)

# Performing PCA with 3 components
pca = PCA(n_components=3)
data_pca = pca.fit_transform(normalized_data)

# Getting the explained variance ratio (percentage of variance retained)
explained_variance = pca.explained_variance_ratio_
total_variance_retained = sum(explained_variance) * 100

# Print the variance retained by the 3 principal components
print(f'Explained variance by 3 components: {explained_variance}')
print(f'Total variance retained: {total_variance_retained:.2f}%')

# Convert the PCA-reduced data back to a DataFrame
data_pca_df = pd.DataFrame(data_pca, columns=['PC1', 'PC2', 'PC3'])

# Saving the PCA-reduced data to an Excel file
pca_file_path = r'path_to_pca_reduced_amino_acid_data.xlsx'
data_pca_df.to_excel(pca_file_path, index=False)

print("PCA complete")

Percentage of Variance Explained by Top 3 Principal Components

PC1 (Principal Component 1): 21.44% of the variance.
PC2 (Principal Component 2): 13.78% of the variance.
PC3 (Principal Component 3): 10.11% of the variance.

After reducing the data to 3 dimensions, I retain 45.33% of the original information.

Screenshot showing PCA-reduced data as executed by the python code above .

Performing K-means Clustering with Silhouette Method

Testing k values of 2,3,4 and 5

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Loading the PCA_reduced data
file_path = r'=path_to_pca_reduced_amino_acid_data.xlsx'
data_pca = pd.read_excel(file_path)

# Defining the range of k values to test
#picking 2,3,4 and 5
k_values = range(2, 6)  

# Storing silhouette scores for each k valesu listed above
silhouette_scores = {}

# Looping through each k value and computing the Silhouette score
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(data_pca)
    
    # Calculate the Silhouette score
    silhouette_avg = silhouette_score(data_pca, labels)
    silhouette_scores[k] = silhouette_avg
    
    print(f"For n_clusters = {k}, the Silhouette score is {silhouette_avg:.4f}")

#cleaner data ouput visualization
# Sorting and displaing the best k values based on Silhouette score
sorted_silhouette_scores = sorted(silhouette_scores.items(), key=lambda x: x[1], reverse=True)
print("\nBest k values based on Silhouette score:")
for k, score in sorted_silhouette_scores:
    print(f"k = {k}, Silhouette score = {score:.4f}")

Ouput of the code above

For n_clusters = 2, the Silhouette score is 0.3381
For n_clusters = 3, the Silhouette score is 0.3693
For n_clusters = 4, the Silhouette score is 0.3920
For n_clusters = 5, the Silhouette score is 0.3254

Best k values based on Silhouette score:
k = 4, Silhouette score = 0.3920
k = 3, Silhouette score = 0.3693
k = 2, Silhouette score = 0.3381
k = 5, Silhouette score = 0.3254

Takeaway: The silhouette method shows that 4 clusters are best, with a score of 0.3920. This means the clusters are well-separated. Using 3 or 2 clusters is also okay, but the separation is not as clear. Using 5 clusters is the worst, so adding more clusters probably won’t help much for this data. The selected clusters, k=2, k=3, and k=4, will now be used to create a 3D k-means clustering plot.

3D K-Means Clustering data (k=2, 3, 4)

import pandas as pd
from sklearn.cluster import KMeans
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import os

# Loading PCA-reduced data
file_path = r'path_to_pca_reduced_amino_acid_data.xlsx'
data_pca = pd.read_excel(file_path)

# Ensuring the data has the correct structure
required_columns = ['PC1', 'PC2', 'PC3']
if not all(col in data_pca.columns for col in required_columns):
    raise ValueError(f"Data must contain columns named {', '.join(required_columns)}.")

# Defining the k values we are interested in (k = 4, 3, 2 based on Silhouette scores)
k_values = [4, 3, 2]

# Creating a color map for up to 4 clusters
color_maps = {
    0: 'rgb(31, 119, 180)',  # Blue
    1: 'rgb(255, 127, 14)',  # Orange
    2: 'rgb(44, 160, 44)',   # Green
    3: 'rgb(214, 39, 40)'    # Red
}

# Directory to save the plots
output_dir = r'C:\Users\GAMING PC\Downloads\XA'

# Looping through each value of k and creating a plot
for k in k_values:
    # Performing KMeans clustering
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(data_pca)

    # Adding cluster labels to the dataframe
    data_pca['Cluster'] = labels

    # Creating an interactive 3D scatter plot
    fig = make_subplots(rows=1, cols=1, specs=[[{'type': 'scatter3d'}]])

    # Creating colors based on the cluster labels
    colors = [color_maps[label] for label in labels]

    # Adding scatter plot with hover information
    scatter = go.Scatter3d(
        x=data_pca['PC1'],
        y=data_pca['PC2'],
        z=data_pca['PC3'],
        mode='markers',
        marker=dict(
            size=3,
            color=colors,
            opacity=0.8,
            line=dict(width=0)
        ),
        text=[
            f'Protein: {idx}<br>Cluster: {label}<br>PC1: {pc1}<br>PC2: {pc2}<br>PC3: {pc3}'
            for idx, (label, pc1, pc2, pc3) in enumerate(zip(labels, data_pca['PC1'], data_pca['PC2'], data_pca['PC3']), start=1)
        ],
        hoverinfo='text',
        name='Proteins'  # Renamed Trace 0
    )
    fig.add_trace(scatter)  # Trace 0

    # Plotting the centroids
    centroids = kmeans.cluster_centers_
    centroid_trace = go.Scatter3d(
        x=centroids[:, 0],
        y=centroids[:, 1],
        z=centroids[:, 2],
        mode='markers',
        marker=dict(size=6, color='black', symbol='x', line=dict(width=2)),
        name='Centroid'  # Trace 1
    )
    fig.add_trace(centroid_trace)  # Trace 1

    # Updating layout with customized axes and background
    fig.update_layout(
        title=f"KMeans Clustering of Proteins with k={k}",
        scene=dict(
            xaxis=dict(
                title='PC1',
                backgroundcolor='white',
                gridcolor='lightgrey',
                linecolor='black',
                showbackground=True,
                zerolinecolor='black'
            ),
            yaxis=dict(
                title='PC2',
                backgroundcolor='white',
                gridcolor='lightgrey',
                linecolor='black',
                showbackground=True,
                zerolinecolor='black'
            ),
            zaxis=dict(
                title='PC3',
                backgroundcolor='white',
                gridcolor='lightgrey',
                linecolor='black',
                showbackground=True,
                zerolinecolor='black'
            ),
            bgcolor='white',  
            camera=dict(
                eye=dict(x=1.25, y=1.25, z=1.25) 
            )
        ),
        width=900,
        height=700,
        margin=dict(r=20, b=10, l=10, t=40),
        paper_bgcolor='white', 
        plot_bgcolor='white'    
    )

    # Saving each plot as an HTML #I thought it would be a better way to visualize than a static image 
    output_file = os.path.join(output_dir, f'kmeans_plot_k{k}_interactive.html')
    fig.write_html(output_file, include_plotlyjs=True, full_html=True)
    print(f"The interactive plot for k={k} has been saved as 'kmeans_plot_k{k}_interactive.html' in {output_dir}")

The code above will generate an interactive 3D visualization to display the K-means clustering results.

Summary

As k increases from 2 to 4, the clusters become more refined. For k = 2, the clusters group the data into broader categories, likely oversimplifying the differences between proteins localized to the nucleus, cytoplasm, and cell membrane. At k = 3, the clustering better aligns with the three known labels, suggesting that this value of k is well-suited for capturing the distinctions between these subcellular locations. By k = 4, the clusters are even more distinct, but may overdivide the data, indicating that k = 3 might be the best representation of the subcellular localization patterns in this case.

Hierarchical Clustering

import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Loading the PCA-reduced data
file_path = r'path_to_pca_reduced_amino_acid_data.xlsx'
data_pca = pd.read_excel(file_path)

# Performing Hierarchical Clustering using Ward's method 
# it helps to minimize variance within clusters
linked = linkage(data_pca, method='ward')

# Creating the dendrogram plot
plt.figure(figsize=(12, 8))  
dendrogram(linked,
           orientation='top',       
           distance_sort='descending', 
           show_leaf_counts=True)   

# Customizing the title and axis labels for better readability
plt.title('Hierarchical Clustering', fontsize=16)
plt.xlabel('Protein Index', fontsize=12)  
plt.ylabel('Distance', fontsize=12) 

# Cleaning up the plot
plt.gca().spines['top'].set_visible(False) 
plt.gca().spines['right'].set_visible(False) 

# Saving the dendrogram plot
output_file = r'path_to_hierarchical_clustering.png'
plt.savefig(output_file, dpi=300, bbox_inches='tight')

# Displaying
plt.show()

print("Dendrogram saved as 'updated_hierarchical_clustering.png'.")

Comparing the dendrogram results to the kmeans results

The K-means clustering approach is straightforward, providing a fixed number of clusters (e.g., k = 2, 3, 4), which can simplify the data but may overlook finer structures. On the other hand, hierarchical clustering, as visualized above through the dendrogram, offers greater flexibility by allowing clusters to be formed at different distance thresholds.

For example, when setting the threshold to 5 in the hierarchical clustering analysis, 113 distinct clusters emerged. This provides a much more detailed picture of how the data is organized compared to simpler methods. While K-means clustering is useful when the number of groups is predetermined, hierarchical clustering creates a sort of family tree for the data. It reveals not just the groups, but how they’re all related to each other. This is esepcially helpful when dealing with complex datasets where the relationships aren’t immediately obvious.

DBSCAN clustering

python code

import pandas as pd
from sklearn.cluster import DBSCAN
import plotly.graph_objects as go
import numpy as np
import os

# Loading PCA-reduced data
file_path = r'path_to_pca_reduced_amino_acid_data.xlsx'
data_pca = pd.read_excel(file_path)

# Performing DBSCAN clustering
# DBSCAN eps parameter
eps_value = 0.5 
# Minimum points to form a cluster
min_samples_value = 5  
dbscan = DBSCAN(eps=eps_value, min_samples=min_samples_value)
dbscan_labels = dbscan.fit_predict(data_pca)

# Adding cluster labels to the dataframe
data_pca['Cluster'] = dbscan_labels

# Identifying the number of clusters (excluding noise)
n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
n_noise = list(dbscan_labels).count(-1)

# Printing cluster and noise information
print(f"Number of clusters found: {n_clusters}")
print(f"Number of noise points: {n_noise}")

# Creating a 3D scatter plot for DBSCAN results
fig = go.Figure()

# Creating color map for clusters (using black for noise)
unique_labels = set(dbscan_labels)
colors = np.linspace(0, 1, len(unique_labels))

for label, color_value in zip(unique_labels, colors):
    if label == -1:
        # Using black for noise
        color = 'rgb(0, 0, 0)'
    else:
        # Assigning color based on the label
        color = f'rgb({color_value * 255}, {100}, {255 - color_value * 255})' 

    # Filtering data points for the current cluster
    cluster_data = data_pca[dbscan_labels == label]

    # Adding scatter plot for each cluster
    fig.add_trace(go.Scatter3d(
        x=cluster_data['PC1'],
        y=cluster_data['PC2'],
        z=cluster_data['PC3'],
        mode='markers',
        marker=dict(
            size=4,
            color=color,
            opacity=0.8
        ),
        name=f'Cluster {label}' if label != -1 else 'Noise'
    ))

# Updating plot layout
fig.update_layout(
    title=f'DBSCAN Clustering on 3D PCA Data (eps={eps_value}, min_samples={min_samples_value})',
    scene=dict(
        xaxis_title='PC1',
        yaxis_title='PC2',
        zaxis_title='PC3'
    ),
    width=900,
    height=700
)

# Saving the plot as an HTML file
output_dir = r'path'
output_file = os.path.join(output_dir, 'dbscan_plot_interactive.html')
fig.write_html(output_file, include_plotlyjs=True, full_html=True)

# Printing confirmation of the saved plot
print("Saved as 'dbscan_plot_interactive.html'")

The DBSCAN groups points based on their density, identifying regions of high point density as clusters and labeling sparser regions as noise. In this case, the DBSCAN code identified 8 distinct clusters from the 3D PCA-reduced data, indicating that there are 8 dense groups of points within the dataset. Additionally, 350 points were classified as noise, meaning they didn’t belong to any cluster based on the eps=0.5 and min_samples=5 parameters. The interactive 3D plot visualizes these clusters, allowing for exploration of how the data points are distributed in the reduced feature space.

***

Very different and less cleaner compared to the clustering I have seen from the K-means and Hierarchical.

Comparing DBSCAN clustering tothe dendrogram and the kmeans results

K-means Clustering
In K-means clustering with k=3, distinct clusters emerged that closely corresponded to the three subcellular localizations (Nucleus, Cytoplasm, and Cell Membrane). However, K-means assumes spherical clusters and forces all points into clusters, potentially misclassifying points that don’t fit well.

Hierarchical Clustering
Hierarchical clustering using Ward’s method illustrated how clusters are hierarchically merged, offering a clear tree structure that reflects relationships between data points. However, it didn’t delineate the subcellular localization clusters as strongly as K-means.

DBSCAN
DBSCAN, with eps=0.6 and min_samples=5, formed 47 clusters and identified 269 points as noise, demonstrating its strength in handling outliers and arbitrary-shaped clusters. However, it resulted in fragmented clusters without clear alignment to subcellular localizations due to its density-based nature, which excels in high-density regions but can struggle with less defined datasets.

Conclusions

From what I see, the K-means clustering really nailed it with well-separated clusters for the three subcellular localizations (Nucleus, Cytoplasm, and Cell Membrane), but it can struggle with outliers since it forces all points into clusters. Hierarchical clustering showed a more flexible merging process, while DBSCAN was great at spotting outliers but ended up creating a lot of smaller clusters, so overall, K-means offered the clearest separation for the defined groups.