Association Rule Mining (ARM)

This is a data mining technique that is used to discover interesting relationships or patterns in large datasets. Specifically, it aims to find associations between different items or events that frequently occur together.There are three most fundamental metrics: 1) Support 2) Confidence 3) Lift

Breif explanation of these three key metrics guiding ARM:

Lift: Lift helps determine whether the association is stronger than random chance. For example, a lift greater than 1 indicates that the presence of a particular amino acid pattern increases the likelihood of a protein being in a specific localization. For example, if a lift of 1.5 is calculated, it means that the proteins with those amino acids are 1.5 times more likely to localize to that specific part of the cell than by chance.

Support: This measures how frequently a particular combination of items (in this case, amino acids or localization patterns) appears in the dataset. For example, if certain amino acids are found in 30% of proteins that localize to the nucleus, the support for this itemset is 0.30. This tells us how common that association is within the dataset.

Confidence: Confidence measures the reliability of an association. It answers the question: given that a protein contains certain amino acids (A), how likely is it to be found in a specific subcellular location (B)? For instance, if 70% of proteins with a certain amino acid profile are localized to the cytoplasm, the confidence for that rule is 0.70.

Figure 1: A network graph shows the relationships between items based on the discovered association rules, with nodes representing items and edges representing rules connecting them. The labels on the edges indicate which rule each connection represents, depicting how different items are associated in the dataset.Due to its multiple connections in the Association Rule Network, it tells us that milk is a central item in the dataset, frequently purchased alongside various other products like Eggs, Bread, and Coke.

Figure 2 : A heatmap that visualizes the presence (dark red) or absence (light yellow) of items in each transaction, with rows representing transactions and columns representing items. It provides a quick overview of which items frequently appear together across different transactions.

Data Prep

import pandas as pd

# Loading the data without labels
file_path = r'path_to_amino_acid_seq_Nucleus_Cytoplasm_Cell_membrane_LABELS_REMOVED.xlsx'
data = pd.read_excel(file_path)

# Converting the percentages to decimal format
data = data / 100

# Setting the binarization threshold
threshold = 0.1

# Applying binarization: convert values above threshold to 1 and below to 0
binary_data = data.applymap(lambda x: 1 if x > threshold else 0)

# Saving the binarized data to a new Excel file
binary_file_path = r'path_to_binary_amino_acid_data.xlsx'
binary_data.to_excel(binary_file_path, index=False)

# Showing a tiny bit of the the binarized data
print(binary_data.head())

Binarized excel file link: binary_amino_acid_data.xlsx

Screenshot of the binarized data from above

I binarized the amino acid composition data to create a transaction-like format. Each protein is treated as a transaction, and each amino acid is treated as an item. This format is required for the Apriori algorithm to identify frequent amino acid patterns across proteins and generate useful association rules.

The transaction data consists of amino acid compositions (represented as binary values) for proteins localized to different subcellular regions (e.g., nucleus, cytoplasm, cell membrane). Each protein would be treated as a “transaction,” and each amino acid type (e.g., A, C, D, E) would be treated as a “product” or “item” in the transaction.

Apriori Algorithm

# Loading libraries needed
library(arules)
library(arulesViz)
library(readxl)
library(htmlwidgets)

# Setting my working directory
setwd("path")

# Load your binary data from the Excel file
file_path <- "path_to_binary_amino_acid_data.xlsx" 
data <- read_excel(file_path, sheet = 1)

# Converting the data to a transaction format
binary_data <- as(as.matrix(data), "transactions")

# Explore the loaded data
# just to see 
summary(binary_data)

#1: Run the Apriori algorithm with adjusted support and confidence thresholds
# Adjust the support and confidence to ensure 15+ rules
rules <- apriori(binary_data, parameter = list(support = 0.004, confidence = 0.4, minlen = 2))

# Checking how many rules were generated
cat("Number of rules generated:", length(rules), "\n")

#2: Sorting rules by lift, confidence, and support
# Sorting and printing top 15 rules by Lift
sorted_rules_by_lift <- sort(rules, by = "lift", decreasing = TRUE)
cat("\nTop 15 rules sorted by Lift:\n")
inspect(head(sorted_rules_by_lift, 15))

# Sorting and printing top 15 rules by Confidence
sorted_rules_by_confidence <- sort(rules, by = "confidence", decreasing = TRUE)
cat("\nTop 15 rules sorted by Confidence:\n")
inspect(head(sorted_rules_by_confidence, 15))

# Sorting and printing top 15 rules by Support
sorted_rules_by_support <- sort(rules, by = "support", decreasing = TRUE)
cat("\nTop 15 rules sorted by Support:\n")
inspect(head(sorted_rules_by_support, 15))

#3: Visualizing the top 15 rules by lift
if (length(sorted_rules_by_lift) > 0) {
    top_rules <- head(sorted_rules_by_lift, 15)
    plot_obj <- plot(top_rules, method = "graph", engine = "htmlwidget")
    
    # Saving the plot as an interactive HTML file
    htmlwidgets::saveWidget(plot_obj, file = "top_15_rules.html")
    cat("Interactive_plot_or_chart saved as 'top_15_rules.html'.\n")
} else {
    cat("No rules available for plotting.\n")
}

# Step 4: Printing most frequent items in the rules
cat("\nTop 10 most frequent items in rules:\n")
itemFrequencyPlot(binary_data, topN = 10, type = "absolute")

# Step 5: Printing summary of the binary transaction data
cat("\nSummary of Binary Transaction Data:\n")
summary(binary_data)

Refrence: Dr. Ami Gates, CU Boulder(https://gatesboltonanalytics.com/?page_id=268)

Top 15 Association Rules for Amino Acid Percentages(it is interactive)

This graph visualizes the top 15 association rules identified in the amino acid percentage data, based on the Apriori algorithm. The rules were generated using a support threshold of 0.004 (0.4%), a confidence threshold of 0.4 (40%), and a minimum rule length of 2. The nodes represent individual amino acid percentages, while the arrows illustrate the direction and strength of the association, with the thickness of the lines indicating the strength (lift) of the rule. These rules help uncover relationships between amino acids in different proteins, providing insight into common patterns.

Ineterpreting the chart above

When hovering over Rule 1 in the association graph, we can see the following insights:

Support: 0.00687 – This indicates that 0.687% of all transactions (protein compositions) contain both Percent_P, Percent_R, and Percent_A.
Confidence: 0.653 – This means that 65.3% of the time, if Percent_P and Percent_R are present, Percent_A will also be present.
Lift: 5 – The lift value suggests that the presence of Percent_P and Percent_R makes it 5 times more likely that Percent_A will also occur compared to if they were independent.
Count: 32 – This is the actual number of transactions (rows in the data) that support this rule.

Note(just a reminder): The values like Percent_A, Percent_P, and Percent_R represent the percentage of each specific amino acid(1 of 20) in a given protein.

Association Rule Localization Analysis for Predicting Subcellular Protein Locations

Note: I thought of adding this(even though, I don’t think is required) as it can be used as an addition to the analysis of association rules. By mapping the rules to specific subcellular localizations (Nucleus, Cytoplasm, and Cell Membrane), this script helps identify which amino acid patterns are most predictive of a particular protein’s location within the cell.

import pandas as pd

# Loading the labeled data containing amino acid compositions with subcellular locations
file_path = 'path_to_amino_acid_seq_Nucleus_Cytoplasm_Cell_membrane_WITH_LABELS.xlsx'
labeled_data = pd.read_excel(file_path)

# Loading the rules output from the previous association rule analysis
rules_output = pd.read_excel('path_to_rules_output.xlsx')

# Defining a function to check if a rule's lhs (left-hand side) and rhs (right-hand side) amino acids are matching the protein composition
def rule_matches_protein(rule, protein_data):
    # Splitting the rule into lhs and rhs components
    lhs_rhs = rule.strip('{}').split('=>')
    lhs = lhs_rhs[0].replace('{', '').replace('}', '').split(',')
    rhs = lhs_rhs[1].strip().replace('{', '').replace('}', '')
    
    # Removing any extra spaces in lhs and rhs
    lhs = [item.strip() for item in lhs]
    rhs = rhs.strip()
    
    try:
        # Ensuring 'Percent_' prefix for columns to match the data format
        lhs_columns = [aa if aa.startswith('Percent_') else f'Percent_{aa}' for aa in lhs]
        rhs_column = rhs if rhs.startswith('Percent_') else f'Percent_{rhs}'
        
        # Checking if all amino acids in the lhs have non-zero percentages
        lhs_matches = all(protein_data[aa] > 0 for aa in lhs_columns)
        
        # Checking if the rhs amino acid has a non-zero percentage
        rhs_matches = protein_data[rhs_column] > 0
    except KeyError:
        return False
    
    return lhs_matches and rhs_matches

# Initializing a dictionary to store counts for each subcellular location
localization_counts = {'Nucleus': [], 'Cytoplasm': [], 'Cell membrane': []}

# Iterating over each rule and counting occurrences in proteins from different localizations
for rule in rules_output['rules']:
    for loc in ['Nucleus', 'Cytoplasm', 'Cell membrane']:
        # Filtering proteins by subcellular location
        localized_proteins = labeled_data[labeled_data['Subcellular_Location'] == loc]
        
        # Counting how many proteins match the rule
        match_count = sum(localized_proteins.apply(lambda row: rule_matches_protein(rule, row), axis=1))
        localization_counts[loc].append(match_count)

# Creating a DataFrame to store the localization counts for each rule
localization_df = pd.DataFrame(localization_counts, index=rules_output['rules'])

# Adding a column to identify the dominant location for each rule based on the highest match count
localization_df['Dominant_Location'] = localization_df.idxmax(axis=1)

# Saving the results to an Excel file
output_file = 'path_to_rule_dominant_location_analysis.xlsx'
localization_df.to_excel(output_file)
print(f'Dominant location analysis saved to {output_file}')

Link to the excel file created :rule_dominant_location_analysis.xlsx

Screenshot of the excel file created

This Python script above identifies which subcellular location (nucleus, cytoplasm, or cell membrane) is most strongly associated with each of the top association rules derived from the amino acid composition data. It does so by comparing how often each rule matches proteins in different subcellular locations, providing insight into which rules best predict the localization of proteins based on their amino acid content.

Conclusions

The analysis of amino acid percentages using association rule mining revealed several patterns that predominantly predict nuclear localization. All of the top 15 rules had a strong association with proteins localized in the nucleus, as seen by the dominant “Nucleus” prediction across all rules. The confidence and lift values indicate that specific combinations of amino acids, such as {Percent_P, Percent_R} => {Percent_A}, are strongly indicative of nuclear proteins. No significant patterns emerged for cytoplasmic or cell membrane localization, suggesting that the data, in its current form, is more predictive of nuclear proteins. Further refinement of the rules or exploration of additional features may be necessary to uncover patterns linked to cytoplasm or cell membrane proteins.