Aggregation-prone regions Analysis

Datasets¶

In this analysis, we compare Eisenberg's dataset of interacting APRs (labeled as Template structures) and our own created dataset of hexapeptide pairs (extracted from known amyloid proteins in the PDB database and repaired with FoldX). Eisenberg's dataset contains 181 structures, our dataset ~tens of thousands.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import norm
from Bio.PDB import PDBParser, PPBuilder

import math
import os

# RSA #
from Bio.PDB import PDBParser
from Bio.Data.IUPACData import protein_letters_3to1
from Bio.PDB.Polypeptide import is_aa
from Bio.PDB import SASA
from Bio.PDB.SASA import ShrakeRupley
from collections import defaultdict

from concurrent.futures import ProcessPoolExecutor, as_completed
from Bio.PDB import PDBParser, PDBExceptions
#######

# Shape Analysis #
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering

from sklearn.decomposition import PCA
import umap

from skbio.stats.ordination import pcoa
from skbio import DistanceMatrix
##################

# Trees #
import scipy.cluster.hierarchy as sch
from scipy.spatial.distance import squareform
#########


sns.set_theme(style="whitegrid")

Some structures needed to be standardized before feeding them in FoldX's Repair command. We load our datasets and remove non-standardized structures (which have a standardized equivalent).

# Get rid of non-standardized files if standardized exists

def filter_non_standardized(df):
    df = df.copy()
    df['base_id'] = df['PDB_File'].apply(get_base_id)

    # Find which base IDs have both versions
    has_standardized = set(
        df[df['PDB_File'].str.endswith('_standardized_Repair.pdb')]['base_id']
    )

    is_standardized = set(
        df[df['PDB_File'].str.endswith('_standardized_Repair.pdb')]['PDB_File']
    )

    # Keep:
    # - all standardized files
    # - non-standardized files only if no standardized version exists
    filtered_df = df[
        (
            ~df['PDB_File'].str.endswith('_Repair.pdb') | 
            (df['PDB_File'].str.endswith('_Repair.pdb') & ~df['base_id'].isin(has_standardized)) |
            df['PDB_File'].isin(is_standardized)
        )
    ]

    # Drop helper column
    filtered_df = filtered_df.drop(columns=['base_id'])
    
    return filtered_df

# Extract base ID (before _Repair.pdb or _standardized_Repair.pdb)
def get_base_id(filename):
    if filename.endswith('_standardized_Repair.pdb'):
        return filename.replace('_standardized_Repair.pdb', '')
    elif filename.endswith('_Repair.pdb'):
        return filename.replace('_Repair.pdb', '')
    else:
        return filename

# Stability datasets
print("Reading stability datasets...")

## Read template (Eisenberg's) dataset
template_df = pd.read_csv("./results/foldx_stability_results_templates.csv")
## Read combined stability results for unrepaired and repaired fragment pairs and split them
combined_stability_df = pd.read_csv("./results/foldx_stability_results_fragment_pairs.csv")
extracted_df = combined_stability_df[~combined_stability_df['PDB_File'].str.endswith('Repair.pdb')]
repaired_df = combined_stability_df[combined_stability_df['PDB_File'].str.endswith('Repair.pdb')]
print("Loaded\n")

## Remove unwanted elements
print(f"Removing unwanted elements from the datasets...")
print(f"Extracted: {len(extracted_df[extracted_df['PDB_File'].str.endswith('standardized.pdb')])} standardized extracts out of {len(extracted_df)}")
print(f"Repaired: {len(repaired_df[repaired_df['PDB_File'].str.endswith('standardized_Repair.pdb')])} standardized repairs out of {len(repaired_df)}")
extracted_df = extracted_df[~extracted_df['PDB_File'].str.endswith('standardized.pdb')] # Remove standardized unrepaired duplicates
repaired_df = filter_non_standardized(repaired_df) # remove Repairs with standardized duplicates
print(f"Extracted: {len(extracted_df)} dihexapeptide fragment pairs after initial filtering")
print(f"Repaired: {len(repaired_df)} dihexapeptide fragment pairs after initial filtering")

Reading stability datasets...
Loaded

Removing unwanted elements from the datasets...
Extracted: 1508 standardized extracts out of 193431
Repaired: 1508 standardized repairs out of 193431
Extracted: 191923 dihexapeptide fragment pairs after initial filtering
Repaired: 191923 dihexapeptide fragment pairs after initial filtering

We are going to be working with repaired_df (or its subsample) which is our extracted dataset after repair.

# Repaired dataset's info
repaired_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 191923 entries, 1 to 386861
Data columns (total 8 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   PDB_File               191923 non-null  object 
 1   Total_Energy           191923 non-null  float64
 2   Backbone_HBond         191923 non-null  float64
 3   Sidechain_HBond        191923 non-null  float64
 4   Van_der_Waals          191923 non-null  float64
 5   Electrostatics         191923 non-null  float64
 6   Solvation_Polar        191923 non-null  float64
 7   Solvation_Hydrophobic  191923 non-null  float64
dtypes: float64(7), object(1)
memory usage: 13.2+ MB

Here's a quick summary of all the three dataset's (including estimation of total energy):

# Print how many unique PDB ids are in each dataframe
pdb_ids_t = template_df["PDB_File"].str[:4].unique()
pdb_ids_e = extracted_df["PDB_File"].str[:4].unique()
pdb_ids_r = repaired_df["PDB_File"].str[:4].unique()

K = pd.DataFrame({
    "Dataframe": ["Template", "Extracted", "Repaired"],
    "Unique PDB IDs": [len(pdb_ids_t), len(pdb_ids_e), len(pdb_ids_r)],
    "Total Entries": [len(template_df), len(extracted_df), len(repaired_df)],
    "Mean Total Energy": [template_df["Total_Energy"].mean(), extracted_df["Total_Energy"].mean(), repaired_df["Total_Energy"].mean()],
    "Median Total Energy": [template_df["Total_Energy"].median(), extracted_df["Total_Energy"].median(), repaired_df["Total_Energy"].median()],
    "Max Total Energy": [template_df["Total_Energy"].max(), extracted_df["Total_Energy"].max(), repaired_df["Total_Energy"].max()],
    "Min Total Energy": [template_df["Total_Energy"].min(), extracted_df["Total_Energy"].min(), repaired_df["Total_Energy"].min()],
})

K.head()

	Dataframe	Unique PDB IDs	Total Entries	Mean Total Energy	Median Total Energy	Max Total Energy	Min Total Energy
0	Template	80	181	12.377997	14.6891	57.066	-44.0693
1	Extracted	500	191923	49.973542	48.2727	538.420	-30.4614
2	Repaired	500	191923	30.177946	29.5613	270.050	-44.7015

During extraction, information about each structure was written down in a protein_properties csv. Here are few sanity checks if our pipeline worked as expected.

# Protein properties dataset
print("Loading protein properties dataset...")

protein_properties = pd.read_csv("./results/protein_data_pdb_amyloid_structures.csv")
protein_properties['pdb_file'] = protein_properties['pdb_file'].str[:4]
protein_properties = protein_properties.rename(columns={"pdb_file": "PDB ID"})
print("Loaded\n")

## Simple qualities
print(f"We have {len(protein_properties[protein_properties['num_of_pairs'] == 0])}/{len(protein_properties)} proteins with zero pairs.")
print(f"{len(protein_properties[protein_properties['avg_num_of_layers'] != 5])} proteins have an average number of layers different from 5.\n")

## Correction info
corrected_succ = protein_properties[(protein_properties['num_of_layers_before'] != 5) & (protein_properties['avg_num_of_layers'] == 5)]['PDB ID'].unique()
corrected_unsucc = protein_properties[protein_properties['avg_num_of_layers'] != 5]['PDB ID'].unique()
non_five_layer_proteins = protein_properties[protein_properties['avg_num_of_layers'] != 5]['PDB ID']
num_pairs_non_five = len(repaired_df[(repaired_df['PDB_File'].str[:4]).isin(non_five_layer_proteins)])
print(f"{len(corrected_succ)+len(corrected_unsucc)} proteins needed correction, {len(corrected_succ)} were successful, {len(corrected_unsucc)} were not!")
print(f"{len(repaired_df[(repaired_df['PDB_File'].str[:4]).isin(non_five_layer_proteins)])}/{len(repaired_df)} fragments come from NON-5-layered structs!")
print(f"{len(protein_properties[(protein_properties['avg_num_of_layers'] == 5) & (protein_properties['num_of_pairs'] > 0)])} 5-layered proteins contributed with {len(repaired_df) - num_pairs_non_five}/{len(repaired_df)} pairs.")

Loading protein properties dataset...
Loaded

We have 2/502 proteins with zero pairs.
8 proteins have an average number of layers different from 5.

344 proteins needed correction, 336 were successful, 8 were not!
12592/191923 fragments come from NON-5-layered structs!
492 5-layered proteins contributed with 179331/191923 pairs.

As we can see, few of our structures failed the layer correction. Some of them still contributed to our datasets. We will remove them.

# Protein properties dataset's info
protein_properties.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 502 entries, 0 to 501
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   PDB ID                 502 non-null    object 
 1   num_of_polypeptides    502 non-null    int64  
 2   num_of_windows         502 non-null    int64  
 3   num_of_stacks          502 non-null    int64  
 4   avg_num_of_layers      502 non-null    float64
 5   num_of_pairs           502 non-null    int64  
 6   num_of_pps_before      502 non-null    int64  
 7   num_of_stacks_before   502 non-null    int64  
 8   num_of_layers_before   502 non-null    float64
 9   num_of_shorter_layers  502 non-null    int64  
 10  antiparallel_count     502 non-null    int64  
dtypes: float64(2), int64(8), object(1)
memory usage: 43.3+ KB

# Remove fragment pairs coming from NON-5-layered structures
print("Removing every fragment from the stability dataset's which comes from NON-5-layered structures...")

repaired_df = repaired_df[~(repaired_df['PDB_File'].str[:4]).isin(non_five_layer_proteins)]
extracted_df = extracted_df[~(extracted_df['PDB_File'].str[:4]).isin(non_five_layer_proteins)]
print(f"{len(extracted_df)} is the length of the Extracted dataset after correction")
print(f"{len(repaired_df)} is the length of the Repaired dataset after correction.")

Removing every fragment from the stability dataset's which comes from NON-5-layered structures...
179331 is the length of the Extracted dataset after correction
179331 is the length of the Repaired dataset after correction.

For our repaired_df dataset to be managable during computationally-heavy analyses, we take a 20% sample.

# Create a 20% sample of the repaired dataset for certain heavy analyses
sample_repaired = repaired_df.sample(frac=0.20, random_state=42)
print(f"Sampled {len(sample_repaired)} rows from repaired_df for later analyses.")

paths = set(sample_repaired["PDB_File"].unique())
with open("pdb_file_list_test.txt", "w") as f:
    for pdb in paths:
        f.write(pdb + "\n")
# Then use:
# rsync -av --progress --partial --append-verify --files-from=pdb_file_list.txt meta:/storage/praha1/home/tobiasma/switch-lab/amyloid-interactions/fragment_pairs/ ./fragment_pairs/

Sampled 35866 rows from repaired_df for later analyses.

Our catch-all condition needs further filtering. Let's filter the structure on distance - take structures which:

have at least 5 residues in interaction (within one layer) <= 4.5 Angstroms
at least 2 residues are used on each side of the layer

# Filter the sample
stricter_paths = []
with open("fragment_pairs_filtered_five_two_sample.txt", "r") as f:
    for line in f:
        name = line.split('/')[-1]
        name = name if name[-1] != '\n' else name[:-1] 
        stricter_paths.append({"PDB_File": name})

filter_helper = pd.DataFrame(stricter_paths)
filter_helper = filter_helper[filter_helper["PDB_File"].isin(set(sample_repaired["PDB_File"]))]
sample_repaired = pd.merge(sample_repaired, filter_helper, on=["PDB_File"], how='right')
sample_repaired = sample_repaired.dropna()
print(f"Correctly filtered on a condition, real sample has {len(sample_repaired)} structures")
sample_repaired.head()

Correctly filtered on a condition, real sample has 12304 structures

	PDB_File	Total_Energy	Backbone_HBond	Sidechain_HBond	Van_der_Waals	Electrostatics	Solvation_Polar	Solvation_Hydrophobic
0	7qkj_pair_765_Repair.pdb	18.45160	-30.7581	-8.22142	-42.7342	3.79509	61.2674	-53.1292
1	8olq_pair_012_Repair.pdb	50.32270	-36.4475	-31.34510	-59.5809	11.08980	101.5600	-67.5650
2	8ci8_pair_225_Repair.pdb	12.05720	-36.2584	-6.14683	-67.4686	4.42839	89.9917	-93.9525
3	7nrq_pair_279_Repair.pdb	49.53490	-31.8512	-9.81061	-45.1977	3.33110	86.0854	-51.5352
4	7qkk_pair_121_Repair.pdb	8.61727	-46.0144	-19.36770	-57.6874	6.74647	89.7038	-71.9427

# Filter the whole repaired_df as well
stricter_paths = []
with open("fragment_pairs_filtered_five_two.txt", "r") as f:
    for line in f:
        name = line.split('/')[-1]
        name = name if name[-1] != '\n' else name[:-1] 
        stricter_paths.append({"PDB_File": name})

filter_repaired = pd.DataFrame(stricter_paths)
repaired_df = pd.merge(repaired_df, filter_repaired, on=["PDB_File"], how='right')
repaired_df = repaired_df.dropna()
print(f"Correctly filtered on a condition, repaired df has {len(repaired_df)} structures now")
repaired_df.head()

Correctly filtered on a condition, repaired df has 61951 structures now

	PDB_File	Total_Energy	Backbone_HBond	Sidechain_HBond	Van_der_Waals	Electrostatics	Solvation_Polar	Solvation_Hydrophobic
0	8v1n_pair_342_Repair.pdb	75.12660	-24.4209	-10.17090	-35.7653	9.865810	81.5012	-34.3863
1	8otg_pair_073_Repair.pdb	7.37734	-48.2354	-16.49210	-53.7371	5.152580	79.6257	-68.4023
2	7xo2_pair_150_Repair.pdb	38.10280	-29.0117	-8.14856	-39.2094	2.617550	63.2756	-48.1347
3	7xo3_pair_287_Repair.pdb	30.79310	-32.1495	-10.58240	-46.0042	-0.740443	73.4309	-57.7334
4	8ot4_pair_130_Repair.pdb	75.06350	-16.4509	-20.34710	-43.9366	10.024300	95.2448	-44.8586

Stability¶

The first thing to look at is if our dataset's estimation of total energy (with FoldX's Stability command) matches with the Eisenberg's

# Do a histogram of total energy for both dataframes for comparison
template_label = f"Template (n={len(template_df)})"
extracted_label = f"Extracted (n={len(extracted_df)})"
repaired_label = f"Repaired (n={len(repaired_df)})"

all_stability_data = np.concatenate([
    template_df["Total_Energy"].values,
    #extracted_df["Total_Energy"].values,
    repaired_df["Total_Energy"].values
])

# Define bin edges
bins = np.linspace(all_stability_data.min(), all_stability_data.max(), 106)

plt.figure(figsize=(10, 6))
sns.histplot(template_df["Total_Energy"], bins=bins, color="mediumorchid", label=template_label, kde=False, stat="density")
#sns.histplot(extracted_df["Total_Energy"], bins=50, color="red", label=extracted_label, kde=True, stat="density")
sns.histplot(repaired_df["Total_Energy"], bins=bins, color="cornflowerblue", label=repaired_label, kde=False, stat="density")

plt.xlim(-50, 110)
plt.xlabel("Total Energy")
plt.ylabel("Probability Density")
plt.title("Total Energy Histograms of Template and Repaired fragment pairs (Normalized)")
plt.grid(False)
plt.axvline(0, color="lightgray", linewidth=2, linestyle="-")
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_color('dimgray')
plt.gca().set_yticks([0, 0.01, 0.02])

plt.legend()
plt.savefig("plots/total_energy_histogram_fragments.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/total_energy_histogram_fragments.pdf", bbox_inches="tight")
plt.savefig("plots/total_energy_histogram_fragments.svg", bbox_inches="tight")
plt.show()

# Best fragments
repaired_df.nsmallest(50, 'Total_Energy').head()

	PDB_File	Total_Energy	Backbone_HBond	Sidechain_HBond	Van_der_Waals	Electrostatics	Solvation_Polar	Solvation_Hydrophobic
4514	8pkg_pair_302_Repair.pdb	-44.7015	-51.9016	-1.37386	-69.9902	4.036450	70.8653	-107.5340
7157	8q98_pair_008_Repair.pdb	-39.7086	-51.9347	-18.13530	-70.7382	1.057940	76.6335	-103.1080
29008	8q98_pair_309_Repair.pdb	-39.7062	-51.6920	-15.25990	-70.4398	1.018250	76.2852	-102.5010
44807	7r5h_pair_010_Repair.pdb	-39.6906	-50.3670	-12.58230	-70.4432	0.979359	74.7961	-102.7270
13972	8q98_pair_305_Repair.pdb	-38.9145	-52.5437	-20.16940	-66.3700	0.840698	73.8202	-94.4693

# Worst fragments
repaired_df.nlargest(50, 'Total_Energy').head()

	PDB_File	Total_Energy	Backbone_HBond	Sidechain_HBond	Van_der_Waals	Electrostatics	Solvation_Polar	Solvation_Hydrophobic
16336	8bg9_pair_084_Repair.pdb	215.356	-34.8066	-10.9174	-51.8010	44.2806	110.281	-59.6230
28803	8bg9_pair_091_Repair.pdb	215.356	-34.8066	-10.9174	-51.8010	44.2806	110.281	-59.6230
11842	8bg9_pair_104_Repair.pdb	210.551	-35.2233	-11.3284	-52.7205	44.2306	109.575	-61.5739
42611	8bg9_pair_097_Repair.pdb	210.551	-35.2233	-11.3284	-52.7205	44.2306	109.575	-61.5739
8658	8bg9_pair_098_Repair.pdb	201.444	-37.8533	-17.3518	-50.9960	34.9611	110.994	-55.5531

Ramachandran¶

To quickly check whether we found some new shapes which are not present in Eisenberg's dataset, we plot a Ramachandran plot.

# Create a Ramachandran plot of every hexapeptide within the repaired dataset's sample
def extract_phi_psi_angles(folder, sample=None):
    parser = PDBParser(QUIET=True)
    ppb = PPBuilder()
    phi_psi_all = []

    for pdb_file in os.listdir(folder):
        if not pdb_file.endswith(".pdb") or (sample and pdb_file not in sample):
            continue

        structure = parser.get_structure(pdb_file, os.path.join(folder, pdb_file))
        for pp in ppb.build_peptides(structure):
            phi_psi = pp.get_phi_psi_list()
            for phi, psi in phi_psi:
                if phi and psi:  # skip None values (termini)
                    phi_psi_all.append((math.degrees(phi), math.degrees(psi)))
    return phi_psi_all

# === Extract phi/psi angles if not computed already ===
if os.path.exists("./results/phi_psi_angles_sample.npz") and os.path.exists("./results/phi_psi_angles_sample.npz"):
    angles1 = np.load("./results/phi_psi_angles_sample.npz")["angles"]
    angles2 = np.load("./results/phi_psi_angles_templates.npz")["angles"]
else:
    angles1 = extract_phi_psi_angles("fragment_pairs/", set(sample_repaired["PDB_File"].unique())) # only from the sample
    angles2 = extract_phi_psi_angles("templates/")
    np.savez_compressed("./results/phi_psi_angles_sample.npz", angles=np.array(angles1))
    np.savez_compressed("./results/phi_psi_angles_templates.npz", angles=np.array(angles2))

fig, axs = plt.subplots(1, 2, figsize=(12, 6), sharex=True, sharey=True)

phi1, psi1 = zip(*angles1)
axs[0].scatter(phi1, psi1, s=1, alpha=0.008, color="cornflowerblue")
axs[0].set_title(f"Ramachandran Plot - Extracted (n={len(phi1)})")
axs[0].set_xlabel("ϕ")
axs[0].set_ylabel("ψ")
axs[0].grid(False)
for spine in axs[0].spines.values():
    spine.set_color('gray')
axs[0].axvline(0, color="lightgray", linewidth=2, linestyle="-")
axs[0].axhline(0, color="lightgray", linewidth=2, linestyle="-")

phi2, psi2 = zip(*angles2)
axs[1].scatter(phi2, psi2, s=1, alpha=0.1, color="mediumorchid")
axs[1].set_title(f"Ramachandran Plot - Templates (n={len(phi2)})")
axs[1].set_xlabel("ϕ")
axs[1].grid(False)
for spine in axs[1].spines.values():
    spine.set_color('gray')
axs[1].axvline(0, color="lightgray", linewidth=2, linestyle="-")
axs[1].axhline(0, color="lightgray", linewidth=2, linestyle="-")

# Set shared limits
for ax in axs:
    ax.set_xlim(-180, 180)
    ax.set_ylim(-180, 180)

plt.savefig("plots/ramachandran.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/ramachandran.pdf", bbox_inches="tight")
plt.savefig("plots/ramachandran.svg", bbox_inches="tight")
plt.tight_layout()
plt.show()

Sequence detail¶

To assess each amino-acid's contribution to the stability their structures, we use FoldX's SequenceDetail command.

# Load sequence detail
print ("Loading sequence detail datasets...")
seqdetail_templates = pd.read_csv("./results/foldx_seqdetail_results_templates.csv")
seqdetail = pd.read_csv("./results/foldx_seqdetail_results_fragment_pairs.csv")
## Get rid of non-standardized files
seqdetail = filter_non_standardized(seqdetail)
print("Loaded")

print("Head of Sequence detail dataset of Repaired fragments")
seqdetail.head()

Loading sequence detail datasets...
Loaded
Head of Sequence detail dataset of Repaired fragments

	PDB_File	Three_Letter	Chain	Residue_Number	Total_Energy	BackHBond
0	./SD_2n0a_pair_2691_Repair.pdb	ALA	C	85	-0.032793	0.000000
1	./SD_2n0a_pair_2691_Repair.pdb	GLY	C	86	0.657411	0.000000
2	./SD_8oq4_pair_180_Repair.pdb	GLN	F	37	0.643152	0.000000
3	./SD_2n0a_pair_5190_Repair.pdb	H2S	F	50	0.514245	-0.485031
4	./SD_8uq7_pair_304_Repair.pdb	GLY	A	335	0.397621	-0.596954

# Check how many residues in seqdetail come from pairs of NON-5-layer proteins.
percentage=round(100*len(seqdetail[(seqdetail['PDB_File'].str[5:9]).isin(non_five_layer_proteins)])/len(seqdetail), 4)
print(f"{len(seqdetail[(seqdetail['PDB_File'].str[5:9]).isin(non_five_layer_proteins)])}/{len(seqdetail)} ({percentage}%) elements come from NON-5-layered structures")

# Remove unwanted elements from the dataframe
seqdetail = seqdetail[~(seqdetail['PDB_File'].str[5:9]).isin(non_five_layer_proteins)]
print(f"{len(seqdetail)} is the length of the seqdetail dataset after correction.")

151104/10910230 (1.385%) elements come from NON-5-layered structures
10759126 is the length of the seqdetail dataset after correction.

# Distance filtering
seqdetail = seqdetail[(seqdetail['PDB_File'].str[5:]).isin(set(repaired_df["PDB_File"]))]
print(f"After filtering based on length, we have {len(seqdetail)} residues")

After filtering based on length, we have 3716740 residues

Let's check the dataset's redundancies (in sequence).

# TODO: Use already known functions (to get the ordered list of atoms - and then residues)
print("TODO")

TODO

After seqdetail dataset's initial filtering, we check the residue distribution between the sequence detail datasets of both repaired hexapeptide pairs (our repaired extracted fragments) and template structures (Eisenberg's).

seqdetail['Three_Letter'].value_counts()

Three_Letter
GLY    600140
VAL    509335
ALA    356712
LYS    318728
THR    261343
SER    250223
ILE    181325
GLN    179907
LEU    165575
ASN    157148
GLU    141643
PRO    102752
ASP     79595
TYR     75767
H1S     71267
PHE     65764
H2S     55404
MET     53825
CYS     37430
ARG     25908
HIS     18514
TRP      7930
PTR       505
Name: count, dtype: int64

# FIXME Get rid of hetatoms? e.g. ZN
print("Residues and het-atoms in the template dataset")
seqdetail_templates['Three_Letter'].value_counts()

Residues and het-atoms in the template dataset

Three_Letter
GLY    1930
VAL    1760
ALA     950
THR     920
ASN     850
SER     840
LEU     840
ILE     800
TYR     370
PHE     330
GLN     310
MET     300
LYS     240
GLU     130
H1S     111
ZN       70
ASP      40
HIS      32
PRO      30
H2S      27
TRP      20
Name: count, dtype: int64

Total Energies¶

Let's see the total energy of each individual residue - again comparing ours and Eisenberg's dataset.

desc = seqdetail['Total_Energy'].describe()
print(desc.apply(lambda x: f"{x:.6f}"))

count    3716740.000000
mean           0.489640
std            1.178426
min           -4.176860
25%           -0.152940
50%            0.527413
75%            1.212220
max           14.548800
Name: Total_Energy, dtype: object

desc = seqdetail_templates['Total_Energy'].describe()
print(desc.apply(lambda x: f"{x:.6f}"))

count    10906.000000
mean         0.205430
std          1.178158
min         -4.129810
25%         -0.446808
50%          0.259494
75%          0.908089
max          7.062650
Name: Total_Energy, dtype: object

Our best (lowest total energy) and worst (highest) residues:

# Best residues according to seqdetail energy estimation
best_repaired = repaired_df.nsmallest(100, "Total_Energy").head() # TODO

seqdetail.nsmallest(100, "Total_Energy").head()

	PDB_File	Three_Letter	Chain	Residue_Number	Total_Energy	BackHBond	SideHBond
3577542	./SD_8r47_pair_205_Repair.pdb	ARG	C	28	-4.17686	-0.247444	-1.69147
7354890	./SD_8r47_pair_199_Repair.pdb	ARG	C	28	-4.17686	-0.247444	-1.69147
4467223	./SD_7u11_pair_006_Repair.pdb	TYR	A	125	-4.08744	-1.371750	-1.56968
650860	./SD_9fnb_pair_004_Repair.pdb	TYR	B	125	-3.96797	-1.542940	-1.04820
7055515	./SD_9fnb_pair_005_Repair.pdb	TYR	B	125	-3.95643	-1.549240	-1.04820

# Worst residues according to seqdetail energy estimation
worst_repaired = repaired_df.nlargest(100, "Total_Energy").head() # TODO

seqdetail.nlargest(100, "Total_Energy").head()

	PDB_File	Three_Letter	Chain	Residue_Number	Total_Energy	BackHBond
4182516	./SD_8bg9_pair_085_Repair.pdb	GLU	I	22	14.5488	-0.687146
7962205	./SD_8bg9_pair_092_Repair.pdb	GLU	I	22	14.5488	-0.687146
805688	./SD_8bg9_pair_091_Repair.pdb	GLU	A	22	14.3261	-0.622249
1120332	./SD_8bg9_pair_084_Repair.pdb	GLU	A	22	14.3261	-0.622249
4180836	./SD_8bg9_pair_085_Repair.pdb	GLU	A	22	14.2986	-0.626909

To see which residues make our worst extracted hexapeptides pairs bad, we take a sample which consists of those whose total energy (estimated with Stability command) is among the 1000 largest. We compare those bad fragments with all of our fragments.

# Label bad fragment's sequence detail elements (highest energy fragments according to Stability)
print(len(repaired_df[repaired_df['Total_Energy'] >= 80]))
print(f"{100*len(repaired_df[repaired_df['Total_Energy'] >= 80])/len(repaired_df)} %")

highest_energy_repaired = repaired_df.nlargest(800, "Total_Energy")
seqdetail_bad = seqdetail[(seqdetail["PDB_File"].str[5:]).isin(highest_energy_repaired["PDB_File"])]
# Add a category column to differentiate both datasets
seqdetail.loc[:, "Category"] = "All Fragments"
seqdetail_bad = seqdetail_bad.copy()
seqdetail_bad.loc[:, "Category"] = "Bad Fragments"

799
1.2897289793546514 %

paths = set(seqdetail_bad["PDB_File"].str[5:].unique())
with open("pdb_file_list_bad_test.txt", "w") as f:
    for pdb in paths:
        f.write(pdb + "\n")
# Then use:
# rsync -av --progress --partial --append-verify --files-from=pdb_file_list_bad_test.txt meta:/storage/praha1/home/tobiasma/switch-lab/amyloid-interactions/fragment_pairs/ ./fragment_pairs/

# Define amino acid groups (uppercase)
amino_acid_groups = {
    "Hydrophobic": ["ALA", "VAL", "LEU", "ILE", "MET", "PHE", "TRP", "PRO"],
    "Polar": ["SER", "THR", "ASN", "GLN", "TYR", "CYS", "GLY"],  # Added Glycine to Polar
    "Charged_Positive": ["LYS", "ARG", "HIS"],
    "Charged_Negative": ["ASP", "GLU"]
}

# Define the custom order for x-axis (grouped)
custom_order = (
    amino_acid_groups["Hydrophobic"] +
    amino_acid_groups["Polar"] +
    amino_acid_groups["Charged_Positive"] +
    amino_acid_groups["Charged_Negative"]
)

# Define color palette for groups
group_colors = {
    "Hydrophobic": "yellow", # #bdc6d5
    "Polar": "cyan", # #e7c65b
    "Charged_Positive": "red", # #4c99cf
    "Charged_Negative": "blue", # #df6b6a
    "Unknown": "gray"
}

# Merge both datasets
combined_data = pd.concat([seqdetail, seqdetail_bad])

# Assign the "Unknown" group for amino acids not in the predefined groups
all_amino_acids = set(combined_data["Three_Letter"].unique())  # Get unique amino acids from data
known_amino_acids = set(sum(amino_acid_groups.values(), []))  # Flatten list of known amino acids
amino_acid_groups["Unknown"] = list(all_amino_acids - known_amino_acids) # Find any unknown amino acids
custom_order += (amino_acid_groups["Unknown"])

aa_color_map = {aa: group_colors[group] for group, amino_acids in amino_acid_groups.items() for aa in amino_acids}

plt.figure(figsize=(18, 6))

# Create a violin plot with split violins
sns.violinplot(
    x="Three_Letter", y="Total_Energy", hue="Category", 
    data=combined_data, split=True, inner="quartile",
    palette={"All Fragments": "cornflowerblue", "Bad Fragments": "lightcoral"},
    order=custom_order
)

plt.xlabel("Amino Acid")
plt.ylabel("Total Energy")
plt.title("Total Energy Contribution of Each Amino Acid of All Fragments and 100 Worst Fragments")
plt.legend(title="Fragment Type")
plt.xticks(rotation=45)
plt.ylim(-6, 13.5)

ax = plt.gca()
handles, labels = ax.get_legend_handles_labels()
hue_labels = {
    "All Fragments": f"All Fragments (n={len(seqdetail)} Res)",
    "Bad Fragments": f"Bad Fragments (n={len(seqdetail_bad)} Res)"
}
new_labels = [hue_labels.get(label, label) for label in labels]
ax.legend(handles, new_labels, title="Fragment Type")

for label in ax.get_xticklabels():
    aa = label.get_text()
    color = aa_color_map.get(aa)
    label.set_bbox(dict(facecolor=color, edgecolor='none', boxstyle='round,pad=0.3', alpha=0.5)) # Apply background color

plt.show()

Let's plot the estimated energy of each residue to compare the datasets.

seqdetail_templates.loc[:, "Category"] = "Template Fragments"

plt.figure(figsize=(10, 6))

# Merge both dataset (the whole seqdetail and template's seqdetail)
combined_data = pd.concat([seqdetail_templates, seqdetail])
combined_data = combined_data.reset_index(drop=True)

# kde plot of seqdetail templates vs seqdetail repaired total energy (of all residues, not-type specific)
sns.kdeplot(
    data=combined_data, x="Total_Energy",
    hue="Category",
    fill=True, common_norm=False,
    palette={"Template Fragments": "mediumorchid", "All Fragments": "cornflowerblue"},
    alpha=0.5
)

plt.axvline(0, color="lightgray", linewidth=2, linestyle="--")
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_color('dimgray')
plt.gca().set_yticks([0, 0.1, 0.2, 0.3, 0.4])

plt.xlabel("Total Energy")
plt.ylabel("Density")
plt.grid(False)
plt.xlim(-6, 6)
plt.title("Per-residue Total Energy Distribution of Extracted Fragments and Template Fragments")
plt.savefig("plots/total_energy_kde_residues.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/total_energy_kde_residues.pdf", bbox_inches="tight")
plt.savefig("plots/total_energy_kde_residues.svg", bbox_inches="tight")
plt.show()

Relative Solvent Accessibility¶

We compute and compare the Relative Solvent Accessibility (RSA) of our bad sample's residues and the residues from structures in our repaired_df sample.

# Biopython's solvent accessibility per residue (one number per residue)

AA_TO_MAX_SASA_MAP = {
"ALA": 129,
"ARG": 274,
"ASN": 195,
"ASP": 193,
"CYS": 167,
"GLN": 223,
"GLU": 225,
"GLY": 104,
"HIS": 224,
"ILE": 197,
"LEU": 201,
"LYS": 236,
"MET": 224,
"PHE": 240,
"PRO": 159,
"SER": 155,
"THR": 155,
"TRP": 285,
"TYR": 263,
"VAL": 174
}

def clip_value(lower, value, upper):
    return lower if value < lower else upper if value > upper else value

def remove_heteroatoms(structure):
    # Cannot delete heteroatoms (which are considered as residues in biopython) while iterating over structure.get_residues(), so we first need to get a list of their ids
    chain_to_heteroatoms_map = defaultdict(list)

    model = structure[0]

    for residue in model.get_residues():
        if not is_aa(residue, standard=True):
            chain_to_heteroatoms_map[residue.parent.id].append(residue.id)

    for chain, heteroatoms in chain_to_heteroatoms_map.items():
        try:
            chain_object = model[chain]
        except KeyError:
            print(f'Chain {chain} not found in structure.')
            print(f'Chains in structure: {[chain.id for chain in model.get_chains()]}')
            print(f'Chain to heteroatoms map: {chain_to_heteroatoms_map}')
            raise
        for heteroatom in heteroatoms:
            chain_object.detach_child(heteroatom)
    return

def calculate_residue_RSA(residue):
    residue_sasa, residue_max_sasa = residue.sasa, AA_TO_MAX_SASA_MAP[residue.resname]
    RSA = (residue_sasa / residue_max_sasa) * 100
    return round(RSA, ndigits = None) # Returns an integer type, which is what we want because the RSA probabilities table is set up using bins of size 1.

def get_residue_RSA_values(PDB_file_path, full_residue_IDs_list, chain_subset=None):
    """
    The RSA (Relative Solvent Accessibility) values of the list of full residue IDs are calculated in the context of the provided chain subset. For example, if the
    full PDB is an antibody antigen complex with chains AHL, chain_subset = 'HL' will calculate RSA values of residues in the HL complex context and ignore
    chain A. This is important to calculate the RSA values of interface residues correctly.
    """
    parser = PDBParser(QUIET = True)
    structure = parser.get_structure('', PDB_file_path)

    remove_heteroatoms(structure) # If not removed, heteroatoms would be accounted for when calculating the SASA of each residue.

    if chain_subset is not None:
        chains_to_remove = set(chain.id for chain in structure.get_chains()) - set(chain_subset)
        for chain_ID in chains_to_remove:
            structure.detach_child(chain_ID)

    sasa_calculator = SASA.ShrakeRupley()
    sasa_calculator.compute(structure, level='R')

    full_residue_IDs_set = set(full_residue_IDs_list)
    full_residue_ID_to_RSA_map = {}
    for residue in structure.get_residues():
        AA = protein_letters_3to1[residue.resname.title()]
        structure_ID, model_ID, chain_ID, (heteroatom_flag, position, icode) = residue.full_id # Biopython residue ID

        full_residue_ID = f'{AA}{chain_ID}{position}'
        if full_residue_ID not in full_residue_IDs_set:
            continue
        residue_RSA = calculate_residue_RSA(residue)
        full_residue_ID_to_RSA_map[full_residue_ID] = clip_value(lower=0, value=residue_RSA, upper=99) # The RSA bins in the probability table range between 0 and 99.

    residue_RSA_values = [full_residue_ID_to_RSA_map[full_residue_ID] for full_residue_ID in full_residue_IDs_list]
    # check if it doesnt contain none values:
    if None in residue_RSA_values:
        print(f'Full residue IDs list: {residue_RSA_values}')
        print(f'Full residue ID to RSA map: {full_residue_ID_to_RSA_map}')
        print(PDB_file_path, structure.get_residues())

    if len(residue_RSA_values) != len(full_residue_IDs_list):
        print(f'Full residue IDs list: {full_residue_IDs_list}')
        print(f'Full residue ID to RSA map: {full_residue_ID_to_RSA_map}')
        print(PDB_file_path, structure.get_residues())
        raise ValueError('Some residues do not have RSA values calculated.')

    return residue_RSA_values

def process_pdb_RSA(pdb_file, directory):
    parser = PDBParser(QUIET=True)
    try:
        structure = parser.get_structure(pdb_file, f"{directory}/{pdb_file}")
    except PDBExceptions.PDBConstructionException as e:
        print(f"Failed to parse {pdb_file}: {e}")
        return pdb_file, []

    residue_ID_list = []
    residue_key = []
    model = structure[0]  # Assuming we want the first model
    for chain in model:
        for residue in chain:
            structure_ID, model_ID, chain_ID, (heteroatom_flag, position, icode) = residue.get_full_id()
            try:
                residue_ID_list.append(f"{protein_letters_3to1[residue.resname.title()]}{chain_ID}{position}")
            except KeyError:
                continue
            residue_key.append((chain_ID, position))

    RSA_values = get_residue_RSA_values(f"{directory}/{pdb_file}", residue_ID_list)
    result_list = []
    for i, residue_ID in enumerate(residue_ID_list):
        chain_ID, position = residue_key[i]
        result_list.append((chain_ID, position, RSA_values[i]))

    return pdb_file, result_list

def get_residue_RSA_values_from_df(seqdetail_df, directory="./fragment_pairs", max_workers=8):
    """
    Parallel RSA (Relative Solvent Accessibility) calculation for residues in the provided dataframe.
    """
    seqdetail_df = seqdetail_df.copy()
    seqdetail_df['PDB_File'] = seqdetail_df['PDB_File'].str[5:]
    pdb_files = seqdetail_df['PDB_File'].unique()

    updates = []  # collect updates here

    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(process_pdb_RSA, pdb_file, directory): pdb_file for pdb_file in pdb_files}
        for future in as_completed(futures):
            pdb_file, results = future.result()
            for chain_ID, position, rsa_value in results:
                updates.append({
                    'PDB_File': pdb_file,
                    'Chain': chain_ID,
                    'Residue_Number': position,
                    'RSA': rsa_value
                })

    # Create DataFrame from updates
    updates_df = pd.DataFrame(updates)

    # Merge in the new RSA values
    seqdetail_df = seqdetail_df.merge(
        updates_df,
        on=['PDB_File', 'Chain', 'Residue_Number'],
        how='left'
    )

    return seqdetail_df

# Create a seqdetail's sample
K = seqdetail[(seqdetail["PDB_File"].str[5:]).isin(sample_repaired["PDB_File"])]
## Label seqdetail categories
K.loc[:, "Category"] = "Sampled Fragments"

# Get the residue RSA for each dataset
if os.path.exists("./results/templates_with_rsa.csv") and os.path.exists("./results/sample_with_rsa.csv") and os.path.exists("./results/bad_with_rsa.csv"):
    seqdetail_templates_with_RSA = pd.read_csv("./results/templates_with_rsa.csv")
    seqdetail_sample_with_RSA = pd.read_csv("./results/sample_with_rsa.csv")
    seqdetail_bad_with_RSA = pd.read_csv("./results/bad_with_rsa.csv")
else:
    seqdetail_templates_with_RSA = get_residue_RSA_values_from_df(seqdetail_templates, directory="./templates")
    seqdetail_sample_with_RSA = get_residue_RSA_values_from_df(K)
    seqdetail_bad_with_RSA = get_residue_RSA_values_from_df(seqdetail_bad)
    seqdetail_templates_with_RSA.to_csv("./results/templates_with_rsa.csv", index=False)
    seqdetail_sample_with_RSA.to_csv("./results/sample_with_rsa.csv", index=False)
    seqdetail_bad_with_RSA.to_csv("./results/bad_with_rsa.csv", index=False)

len(seqdetail_sample_with_RSA)

plt.figure(figsize=(18, 6))

# Merge both seqdetail datasets (bad and sampled fragments)
combined_data = pd.concat([seqdetail_sample_with_RSA, seqdetail_bad_with_RSA], ignore_index=True)

# Create a violin plot with split violins
sns.violinplot(
    x="Three_Letter", y="RSA", hue="Category", 
    data=combined_data, split=True, inner="quartile",
    palette={"Sampled Fragments": "cornflowerblue", "Bad Fragments": "lightcoral"},
    order=custom_order,
    cut=0 # Do not estimate KDE out of possible range (make it strictly 0-100)
)

plt.xlabel("Amino Acid")
plt.ylabel("RSA")
plt.title("RSA of Residues in Fragments With the Highest Total Energy Compared to Sampled Fragments")
plt.legend(title="Fragment Type")
plt.xticks(rotation=45)
plt.axhline(0, color="lightgray", linewidth=2)
plt.axhline(100, color="lightgray", linewidth=2)

ax = plt.gca()
handles, labels = ax.get_legend_handles_labels()
hue_labels = {
    "Sampled Fragments": f"Sampled Fragments (n={len(seqdetail_sample_with_RSA)} Res)",
    "Bad Fragments": f"Bad Fragments (n={len(seqdetail_bad_with_RSA)} Res)"
}
new_labels = [hue_labels.get(label, label) for label in labels]
ax.legend(handles, new_labels, title="Fragment Type")

for label in ax.get_xticklabels():
    aa = label.get_text()
    color = aa_color_map.get(aa)
    label.set_bbox(dict(facecolor=color, edgecolor='none', boxstyle='round,pad=0.3', alpha=0.5)) # Apply background color

plt.show()

We compare the sample of our dataset to Eisenberg's as well.

plt.figure(figsize=(18, 6))

# Merge both datasets (sampled and seqdetail from templates)
combined_data = pd.concat([seqdetail_sample_with_RSA, seqdetail_templates_with_RSA])
combined_data = combined_data.reset_index(drop=True)

# Create a violin plot with split violins
sns.violinplot(
    x="Three_Letter", y="RSA", hue="Category", 
    data=combined_data, split=True, inner="quartile",
    palette={"Sampled Fragments": "cornflowerblue", "Template Fragments": "mediumorchid"},
    alpha=0.8, saturation=1,
    order=custom_order, cut=0 # Do not estimate KDE out of possible range (make it strictly 0-100)
)

plt.xlabel("Amino Acid")
plt.ylabel("RSA")
plt.title("RSA Comparison of Residues From Extracted Fragments and Template Fragments")
plt.legend(title="Fragment Type")
plt.xticks(rotation=45)

plt.gca().set_yticks([0, 20, 40, 60, 80, 100])
plt.ylim(0, 100)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_color('dimgray')
plt.grid(False)

ax = plt.gca()
handles, labels = ax.get_legend_handles_labels()
hue_labels = {
    "Sampled Fragments": f"Sampled Fragments (n={len(seqdetail_sample_with_RSA)} Res)",
    "Template Fragments": f"Template Fragments (n={len(seqdetail_templates_with_RSA)} Res)"
}
new_labels = [hue_labels.get(label, label) for label in labels]
ax.legend(handles, new_labels, title="Fragment Type")

for label in ax.get_xticklabels():
    aa = label.get_text()
    color = aa_color_map.get(aa)
    label.set_bbox(dict(facecolor=color, edgecolor='none', boxstyle='round,pad=0.3', alpha=0.5)) # Apply background color


plt.savefig("plots/rsa_violin_all_residues.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/rsa_violin_all_residues.pdf", bbox_inches="tight")
plt.savefig("plots/rsa_violin_all_residues.svg", bbox_inches="tight")
plt.show()

We divide the residues in 3 categories - Stabilizing, Neutral and Destabilizing. Stabilizing residues have estimated total energies < -0.5, Destabilizing > 0.5 and Neutral all the values between. Let's see whether the distributions match across datasets.

# Create a new column with default value (optional, but good practice)
seqdetail_templates_with_RSA["Stability"] = "Unknown"
seqdetail_sample_with_RSA["Stability"] = "Unknown"

# Assign values using .loc
seqdetail_templates_with_RSA.loc[seqdetail_templates_with_RSA["Total_Energy"] < -0.5, "Stability"] = "Stabilizing"
seqdetail_templates_with_RSA.loc[seqdetail_templates_with_RSA["Total_Energy"] >= -0.5, "Stability"] = "Neutral"
seqdetail_templates_with_RSA.loc[seqdetail_templates_with_RSA["Total_Energy"] > 0.5, "Stability"] = "Destabilizing"

seqdetail_sample_with_RSA.loc[seqdetail_sample_with_RSA["Total_Energy"] < -0.5, "Stability"] = "Stabilizing"
seqdetail_sample_with_RSA.loc[seqdetail_sample_with_RSA["Total_Energy"] >= -0.5, "Stability"] = "Neutral"
seqdetail_sample_with_RSA.loc[seqdetail_sample_with_RSA["Total_Energy"] > 0.5, "Stability"] = "Destabilizing"

# Combine the DataFrames
combined_data = pd.concat([seqdetail_templates_with_RSA, seqdetail_sample_with_RSA], ignore_index=True)

plt.figure(figsize=(10, 6))

# sns.violinplot of destabilizing and stabilizing RSA
sns.violinplot(
    x="Stability", y="RSA", hue="Category",
    data=combined_data, split=True, inner="quartile",
    palette={"Sampled Fragments": "cornflowerblue", "Template Fragments": "mediumorchid"},
    order=["Stabilizing", "Neutral", "Destabilizing"],
    alpha=0.8, saturation=1,
    cut=0
)
plt.xlabel("Stability")
plt.ylabel("RSA")
plt.title("RSA Comparison of Different Types of Residues between Sampled and Template Fragments")
plt.legend(title="Fragment Type")

plt.gca().set_yticks([0, 20, 40, 60, 80, 100])
plt.ylim(0, 100)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_color('dimgray')
plt.grid(False)

ax = plt.gca()
handles, labels = ax.get_legend_handles_labels()
new_labels = [hue_labels.get(label, label) for label in labels]
ax.legend(handles, new_labels, title="Fragment Type")

plt.savefig("plots/rsa_stabilizing_destabilizing_residues.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/rsa_stabilizing_destabilizing_residues.pdf", bbox_inches="tight")
plt.savefig("plots/rsa_stabilizing_destabilizing_residues.svg", bbox_inches="tight")
plt.show()

Hydrogen Bonds¶

We compare the hydrogen bonds as well. Backbone and Side-chain.

plt.figure(figsize=(10, 6))

# Merge both dataset (the whole seqdetail and template's seqdetail)
combined_data = pd.concat([seqdetail_templates, seqdetail])
combined_data = combined_data.reset_index(drop=True)

# kde plot of seqdetail templates vs seqdetail repaired backhbond (of all residues, not-type specific)
sns.kdeplot(
    data=combined_data, x="BackHBond",
    hue="Category",
    fill=True, common_norm=False,
    palette={"Template Fragments": "mediumorchid", "All Fragments": "cornflowerblue"},
    alpha=0.5
)
plt.xlabel("Backbone Hydrogen Bond")
plt.ylabel("Density")
plt.title("Per-residue BackHBond Distribution of Extracted Fragments and Template Fragments")

plt.xlim(-2.5, 0.5)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_color('dimgray')
plt.grid(False)

plt.savefig("plots/backbone_h_bond_kde.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/backbone_h_bond_kde.pdf", bbox_inches="tight")
plt.savefig("plots/backbone_h_bond_kde.svg", bbox_inches="tight")
plt.show()

plt.figure(figsize=(10, 6))

# kde plot of seqdetail templates vs seqdetail repaired sidehbond (of all residues, not-type specific)
sns.kdeplot(
    data=combined_data, x="SideHBond",
    hue="Category",
    fill=True, common_norm=False,
    palette={"Template Fragments": "mediumorchid", "All Fragments": "cornflowerblue"},
    alpha=0.5
)
plt.xlabel("Sidechain Hydrogen Bond")
plt.ylabel("Density")
plt.title("Per-residue SideHBond Distribution of Extracted Fragments and Template Fragments")

plt.gca().set_yticks([0, 4, 8, 12, 16])
plt.xlim(-2.5, 0.5)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_color('dimgray')
plt.grid(False)

plt.savefig("plots/sidechain_h_bond_kde.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/sidechain_h_bond_kde.pdf", bbox_inches="tight")
plt.savefig("plots/sidechain_h_bond_kde.svg", bbox_inches="tight")
plt.show()

Burried Surface Area¶

We compute the Burried Surface Area (BSA) of Eisenberg's structures and structures from our sample.

# Burried Surface Area (one number per fragment)

import io
from collections import defaultdict
from Bio.PDB import PDBParser, PDBIO, Select

from utils.peptides import get_chains_from_stacks
from utils.peptides import ordered_stacks_split_multi_pp

class ChainSelect(Select):
    def __init__(self, chain_ids):
        self.chain_ids = set(chain_ids)
    def accept_chain(self, chain):
        return chain.id in self.chain_ids

def calculate_structure_SASA(structure):
    # Print all the chains in the structure
    # print(f"Calculating SASA for structure with chains: {[chain.id for chain in structure.get_chains()]}")
    sasa_calculator = SASA.ShrakeRupley()
    sasa_calculator.compute(structure, level='R')
    #pdb_io = PDBIO()
    #pdb_io.set_structure(structure)
    #pdb_io.save(f"temp_{structure.id}.pdb", select=ChainSelect(set([chain.id for chain in structure.get_chains()])))  # Save the structure to a temporary file
    #print(len(list(structure.get_residues())))
    return sum(residue.sasa for residue in structure.get_residues())

def get_unbound_structure(original_structure, subset_chains):
    pdb_io = PDBIO() 
    pdb_parser = PDBParser(QUIET=True)
    buf = io.StringIO()
    pdb_io.set_structure(original_structure)
    pdb_io.save(buf, select=ChainSelect(subset_chains))
    #pdb_io.save(f"unbound_{'_'.join(subset_chains)}{original_structure}.pdb", select=ChainSelect(subset_chains))
    buf.seek(0) # FIXME - using this in parallel is dangerous
    return pdb_parser.get_structure('', buf)

def calculate_BSA(structure, chain_subset_1, chain_subset_2):
    remove_heteroatoms(structure)

    # Filter to only chains of interest (both subsets)
    model = structure[0]
    combined_subset = set(chain_subset_1) | set(chain_subset_2)
    chains_to_remove = set(chain.id for chain in structure.get_chains()) - combined_subset
    #print(chain_subset_1, chain_subset_2, "END")

    for chain_ID in chains_to_remove:
        if chain_ID != ' ': # FIXME what is this?? It is present in the models chains
            try:
                model.detach_child(chain_ID)
            except Exception as e:
                raise KeyError(f"idk, {e}, {set(chain.id for chain in structure.get_chains())}, {combined_subset}")

    #print("Calculate SASA")
    # Calculate bound SASA
    #full_struct = get_unbound_structure(structure, combined_subset)
    bound_SASA = calculate_structure_SASA(structure)

    # Calculate unbound SASAs
    unbound_1 = get_unbound_structure(structure, chain_subset_1)
    unbound_SASA_1 = calculate_structure_SASA(unbound_1)

    unbound_2 = get_unbound_structure(structure, chain_subset_2)
    unbound_SASA_2 = calculate_structure_SASA(unbound_2)

    # Total unbound SASA (sum of both)
    # print(f"Unbound SASA 1: {unbound_SASA_1} + Unbound SASA 2: {unbound_SASA_2}")
    unbound_SASA_total = unbound_SASA_1 + unbound_SASA_2

    # Buried surface area
    # print(f"Unbound SASA Total: {unbound_SASA_total} - Bound SASA: {bound_SASA}")
    buried_surface_area = unbound_SASA_total - bound_SASA

    return unbound_SASA_total, bound_SASA, buried_surface_area

def process_pdb_BSA(pdb_file, directory):
    pdb_parser = PDBParser(QUIET=True)
    try:
        print(f"Processing {pdb_file} for BSA calculation...")
        structure = pdb_parser.get_structure(pdb_file, f"{directory}/{pdb_file}")
    except Exception as e:
        print(f"Failed to parse {pdb_file}: {e}")
        return pdb_file, None, None, None

    #print(structure.id, "Chains:", [chain.id for chain in structure.get_chains()])
    ordered_stacks, peptides, split_struct = ordered_stacks_split_multi_pp(structure, expected_num_of_stacks=2, max_attempts=10)

    #print(split_struct.id, "Chains:", [chain.id for chain in split_struct.get_chains()])
    stack_chains = get_chains_from_stacks(peptides, ordered_stacks)

    if len(stack_chains) != 2:
        print(f"Expected different number of stacks, got {len(stack_chains)}, {ordered_stacks}")
        print(f"Skipping {pdb_file}")
        return pdb_file, None, None, None

    unbound_SASA_total, bound_SASA, buried_surface_area = calculate_BSA(split_struct, stack_chains[0], stack_chains[1])
    return pdb_file, buried_surface_area, bound_SASA, unbound_SASA_total



def get_BSA_from_df(fragment_df, directory="./fragment_pairs", max_workers=8):
    fragment_df = fragment_df.copy()
    pdb_files = fragment_df['PDB_File'].unique()

    updates = []

    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(process_pdb_BSA, pdb_file, directory): pdb_file for pdb_file in pdb_files}
        for future in as_completed(futures):
            pdb_file, buried_surface_area, bound_SASA, unbound_SASA_total = future.result()
            if buried_surface_area is not None:
                updates.append({
                    'PDB_File': pdb_file,
                    'BSA': buried_surface_area,
                    'Bound_SASA': bound_SASA,
                    'Unbound_SASA': unbound_SASA_total
                })

    # Create updates DataFrame
    updates_df = pd.DataFrame(updates)
    
    # Merge in new values
    fragment_df = fragment_df.merge(updates_df, on='PDB_File', how='left')

    return fragment_df

# Get BSA for the template dataset
if os.path.exists("./results/templates_with_bsa.csv"):
    templates_with_BSA = pd.read_csv("./results/templates_with_bsa.csv")
else:
    templates_with_BSA = get_BSA_from_df(template_df, directory="./templates")
    templates_with_BSA.to_csv("./results/templates_with_bsa.csv", index=False)

templates_with_BSA.head()

	PDB_File	Total_Energy	Backbone_HBond	Sidechain_HBond	Van_der_Waals	Electrostatics	Solvation_Polar	Solvation_Hydrophobic	BSA	Bound_SASA	Unbound_SASA
0	1yjo.pdb	-27.20710	-41.6648	-70.69830	-80.8205	-7.77834	164.9760	-84.2058	1430.331422	3589.657794	5019.989216
1	1yjp_1.pdb	6.20392	-45.4826	-70.98320	-65.0946	6.56874	113.7480	-63.5063	1472.207771	3359.014188	4831.221959
2	1yjp_2.pdb	-14.26580	-50.0210	-72.01630	-80.3370	2.57071	126.6720	-83.8945	1433.401072	3552.675946	4986.077018
3	2kib_1.pdb	16.43380	-44.0706	0.00000	-43.1552	-3.11863	68.1515	-59.0119	1073.027307	4801.665177	5874.692484
4	2kib_2.pdb	17.88910	-33.9384	-1.84035	-41.7314	1.71712	55.3471	-57.0966	929.267625	4687.238910	5616.506535

Some of the template structures failed the calculation (have NaNs). This is mostly due to their weird shape and therefore incorrect identification of stacks (sides of the hexpeptide fragment pairs) with our algorithm.

# Mostly weird fragments which our alg doesn't know how to handle
# only 2y3l_3 and 5e5z should be computable
# (with looser threshold - but it is just 2 sturcts)
templates_with_BSA[templates_with_BSA['BSA'].isna()]

	PDB_File	Total_Energy	Backbone_HBond	Sidechain_HBond	Van_der_Waals	Electrostatics	Solvation_Polar	Solvation_Hydrophobic	BSA	Bound_SASA	Unbound_SASA
30	2y3l_3.pdb	38.711100	-23.52490	0.00000	-27.1616	-0.103731	42.3114	-36.6882	NaN	NaN	NaN
111	4uby.pdb	-0.626345	-31.55220	0.00000	-48.7639	-3.787230	67.9743	-68.0499	NaN	NaN	NaN
118	4w5p_2.pdb	-8.510380	-37.39760	-6.91971	-48.4529	-3.225380	63.8204	-66.1297	NaN	NaN	NaN
141	5e5v_2.pdb	7.228250	-30.64120	-5.05200	-51.7335	2.772190	68.4774	-72.0727	NaN	NaN	NaN
143	5e5z.pdb	23.775900	-32.34680	-19.45090	-44.6157	-1.399470	77.1989	-52.8902	NaN	NaN	NaN
175	6cfh_1.pdb	57.066000	-13.72370	-3.42594	-22.1422	5.805420	38.5544	-29.6550	NaN	NaN	NaN
176	6cfh_2.pdb	15.489300	-33.83700	0.00000	-50.5679	2.129130	66.7553	-72.3968	NaN	NaN	NaN
179	6cfh_5.pdb	13.991000	-22.49910	-1.86111	-39.6010	1.458860	42.7467	-58.5108	NaN	NaN	NaN
180	6cfh_6.pdb	55.837500	-3.76357	-7.22857	-11.8741	3.006000	21.9272	-15.7097	NaN	NaN	NaN

Simple summary of the template's BSA values:

desc = templates_with_BSA['BSA'].describe()
print(desc.apply(lambda x: f"{x:.6f}"))

count     172.000000
mean     1055.313185
std       448.291994
min        -0.000000
25%       841.395469
50%      1118.878224
75%      1394.837274
max      1954.759795
Name: BSA, dtype: object

# BSA is an approximation - these values are very close to zero (and should be zero, or little above)
print(f"{len(templates_with_BSA.loc[(templates_with_BSA['BSA'] < 0)])} template pairs are below 0.")
print(f"{len(templates_with_BSA[(templates_with_BSA['BSA'] < 0) & (templates_with_BSA['BSA'] > -1)])} are approx errors. Removing them..")

# Set approximation errors (negative BSA values) to zero
templates_with_BSA.loc[(templates_with_BSA['BSA'] < 0) & (templates_with_BSA['BSA'] > -1), 'BSA'] = 0

4 template pairs are below 0.
4 are approx errors. Removing them..

Calculate BSA of our sample.

# Get BSA for the repaired sample dataset
if os.path.exists("./results/sample_with_bsa.csv"):
    sample_with_BSA = pd.read_csv("./results/sample_with_bsa.csv")
else:
    sample_with_BSA = get_BSA_from_df(sample_repaired, "./fragment_pairs")
    sample_with_BSA.to_csv("./results/sample_with_bsa.csv", index=False)

sample_with_BSA.head()

	PDB_File	Total_Energy	Backbone_HBond	Sidechain_HBond	Van_der_Waals	Electrostatics	Solvation_Polar	Solvation_Hydrophobic	BSA	Bound_SASA	Unbound_SASA
0	7qkj_pair_765_Repair.pdb	18.45160	-30.7581	-8.22142	-42.7342	3.79509	61.2674	-53.1292	1075.330095	3265.915691	4341.245786
1	8olq_pair_012_Repair.pdb	50.32270	-36.4475	-31.34510	-59.5809	11.08980	101.5600	-67.5650	926.019646	3922.125508	4848.145153
2	8ci8_pair_225_Repair.pdb	12.05720	-36.2584	-6.14683	-67.4686	4.42839	89.9917	-93.9525	1567.003055	3808.659626	5375.662681
3	7nrq_pair_279_Repair.pdb	49.53490	-31.8512	-9.81061	-45.1977	3.33110	86.0854	-51.5352	1483.879077	3417.488690	4901.367767
4	7qkk_pair_121_Repair.pdb	8.61727	-46.0144	-19.36770	-57.6874	6.74647	89.7038	-71.9427	1072.009079	4046.350113	5118.359192

print(f"{len(sample_with_BSA[sample_with_BSA['BSA'].isna()])}/{len(sample_repaired)} have failed during the BSA calculation.")
sample_with_BSA[sample_with_BSA['BSA'].isna()]

0/12304 have failed during the BSA calculation.

	PDB_File	Total_Energy	Backbone_HBond	Sidechain_HBond	Van_der_Waals	Electrostatics	Solvation_Polar	Solvation_Hydrophobic	BSA	Bound_SASA	Unbound_SASA

Simple summary of sample's BSA values:

desc = sample_with_BSA['BSA'].describe()
print(desc.apply(lambda x: f"{x:.6f}"))

count    12304.000000
mean      1195.089564
std        209.751344
min        499.762813
25%       1035.606377
50%       1185.562046
75%       1335.289928
max       2297.255447
Name: BSA, dtype: object

Comparison of BSA across datasets:

plt.figure(figsize=(10,6))

# Merge
templates_with_BSA['Category'] = "Template Fragments"
sample_with_BSA['Category'] = "Sampled Fragments"
combined_data = pd.concat([templates_with_BSA, sample_with_BSA], ignore_index=True)

# Plot a comparison of Burried Surface Accessibility
sns.kdeplot(
    data=combined_data, x="BSA",
    hue="Category",
    cut=0,
    fill=True, common_norm=False,
    palette={"Template Fragments": "mediumorchid", "Sampled Fragments": "cornflowerblue"},
    alpha=0.5
)
plt.xlabel("Burried Surface Accessibility")
plt.ylabel("Density")
plt.title(f"Burried Surface Accessibility of Extracted Fragments and Template Fragments\nTemplate n={len(templates_with_BSA)}, Sampled n={len(sample_with_BSA)}")

plt.gca().set_yticks([0, 0.001, 0.002])
#plt.xlim(-2.5, 0.5)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_color('dimgray')
plt.grid(False)

plt.savefig("plots/bsa_kde.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/bsa_kde.pdf", bbox_inches="tight")
plt.savefig("plots/bsa_kde.svg", bbox_inches="tight")
plt.show()

Packing/Stacking Energies¶

We compare the dataset's packing (energy between the two stacks of each structure) and stacking (energy between the first two and last three layers in each stack) energies.

# Load results from Analyse complex for templates
ac_templates = pd.read_csv("./results/foldx_analyse_complex_results_templates.csv")
ac_templates['ID'] = ac_templates['PDB_File'].str.split('_').apply(lambda parts: '_'.join(parts[:-2]))
ac_templates['Category'] = "Templates"
ac_templates.head()

	PDB_File	Group1	Group2	IntraclashesGroup1	IntraclashesGroup2	Interaction_Energy	BackHBond	SideHBond	VDWClashes	ID	Category
0	3q2x_b_JH_FDB.pdb	HJ	BDF	0.044701	0.095801	-2.01238	-6.071510	-0.789906	0.051573	3q2x_b	Templates
1	2m5n_3_JHFDB_IGECA.pdb	BDFHJ	ACEGI	5.677250	4.744780	-22.21540	-0.450673	0.000000	2.607290	2m5n_3	Templates
2	3fr1_AC_EGI.pdb	CA	GIE	0.110695	0.405026	-7.39247	-6.188010	-0.867155	0.028225	3fr1	Templates
3	2ona_BD_FHJ.pdb	BD	FHJ	0.204879	0.496664	-3.11506	-5.760990	0.000000	0.022129	2ona	Templates
4	3dgj_1_BD_FHJ.pdb	DB	HJF	0.913331	0.249727	-4.73258	-4.370030	-1.285720	0.060052	3dgj_1	Templates

# Load results from Analyse complex for repaired fragments
ac_extracted = pd.read_csv("./results/foldx_analyse_complex_results_fragment_pairs.csv") # FIXME
ac_extracted['ID'] = ac_extracted['PDB_File'].str[:-4]
ac_extracted['Category'] = "Extracted"
## Replace original files of standardized files
ac_extracted = filter_non_standardized(ac_extracted)
ac_extracted.head()

	PDB_File	Group1	Group2	IntraclashesGroup1	IntraclashesGroup2	Interaction_Energy	BackHBond	SideHBond	VDWClashes	ID	Category
0	6ufr_pair_196_Repair.pdb	CGFJB	ADEHI	0.450845	0.054773	-21.882800	-0.410710	0.000000e+00	0.002200	6ufr_pair_196_Repair	Extracted
1	7e0f_pair_124_Repair.pdb	DEFGH	ABCIJ	4.111020	4.127980	-14.520100	-1.207550	0.000000e+00	1.108400	7e0f_pair_124_Repair	Extracted
2	6xfm_pair_135_Repair.pdb	1A	234	0.026684	0.028753	-0.407182	-4.296730	-2.064610e+00	0.000000	6xfm_pair_135_Repair	Extracted
3	6xyq_pair_344_Repair.pdb	HJ	BDF	0.307053	0.433553	-4.612390	-4.757160	0.000000e+00	0.017628	6xyq_pair_344_Repair	Extracted
4	8q9g_pair_232_Repair.pdb	ACEIJ	BDFGH	1.864520	0.124207	-17.891600	-0.508668	1.776360e-15	0.001564	8q9g_pair_232_Repair	Extracted

# Filter based on distance (take only what's in repaired_df)
ac_extracted = ac_extracted[ac_extracted["PDB_File"].isin(set(repaired_df["PDB_File"]))]
print(f"After filtering based on length, we have ~{len(ac_extracted)//3} structures ({len(ac_extracted)} rows)")

After filtering based on length, we have ~61940 structures (185820 rows)

combined_data = pd.concat([ac_templates, ac_extracted], ignore_index=True)

combined_data['Interaction Type'] = "Unknown"
combined_data.loc[combined_data['Group1'].str.len() == combined_data['Group2'].str.len(), 'Interaction Type'] = "Packing"
combined_data.loc[combined_data['Group1'].str.len() != combined_data['Group2'].str.len(), 'Interaction Type'] = "Stacking"
combined_data.head(10)

	PDB_File	Group1	Group2	IntraclashesGroup1	IntraclashesGroup2	Interaction_Energy	BackHBond	SideHBond	VDWClashes	ID	Category	Interaction Type
0	3q2x_b_JH_FDB.pdb	HJ	BDF	0.044701	0.095801	-2.01238	-6.071510	-0.789906	0.051573	3q2x_b	Templates	Stacking
1	2m5n_3_JHFDB_IGECA.pdb	BDFHJ	ACEGI	5.677250	4.744780	-22.21540	-0.450673	0.000000	2.607290	2m5n_3	Templates	Packing
2	3fr1_AC_EGI.pdb	CA	GIE	0.110695	0.405026	-7.39247	-6.188010	-0.867155	0.028225	3fr1	Templates	Stacking
3	2ona_BD_FHJ.pdb	BD	FHJ	0.204879	0.496664	-3.11506	-5.760990	0.000000	0.022129	2ona	Templates	Stacking
4	3dgj_1_BD_FHJ.pdb	DB	HJF	0.913331	0.249727	-4.73258	-4.370030	-1.285720	0.060052	3dgj_1	Templates	Stacking
5	4r0w_b_1_IG_ECA.pdb	GI	CAE	0.991309	0.947355	-3.19053	-6.102070	0.000000	0.008595	4r0w_b_1	Templates	Stacking
6	4xfo_JH_FDB.pdb	HJ	BDF	0.094704	0.382677	-5.00158	-6.254170	-1.028470	0.009694	4xfo	Templates	Stacking
7	4ril_a_3_IG_ECA.pdb	GI	CAE	0.752904	0.750379	-2.34154	-5.993340	-0.522533	0.000000	4ril_a_3	Templates	Stacking
8	6cfh_4_ACEGI_BDFHJ.pdb	CAGEI	BDHFJ	0.233039	0.345059	-28.30400	-1.715980	0.000000	0.290562	6cfh_4	Templates	Packing
9	3fr1_BD_FHJ.pdb	DB	HJF	0.127741	0.459395	-7.71799	-5.972650	-0.597861	0.118170	3fr1	Templates	Stacking

# Sanity check
print(combined_data['Group1'].str.len().value_counts())
print(combined_data['Group2'].str.len().value_counts())

Group1
2    124224
5     62112
Name: count, dtype: int64
Group2
3    124224
5     62112
Name: count, dtype: int64

plt.figure(figsize=(10,6))
K = combined_data[combined_data['Interaction Type'] == "Packing"]

# Plot a comparison of Packing Energies between Templates and Extracted
sns.kdeplot(
    data=K, x="Interaction_Energy",
    hue="Category",
    fill=True, common_norm=False,
    palette={"Templates": "mediumorchid", "Extracted": "cornflowerblue"},
    alpha=0.5
)
plt.xlabel("Packing Interaction Energy")
plt.ylabel("Density")
plt.title(f"Packing Energy Interaction Comparison\nTemplates n={len(K[K['Category']=='Templates'])}, Extracted n={len(K[K['Category']=='Extracted'])}")

plt.axvline(0, color="lightgray", linewidth=2, linestyle="--")
plt.xlim(-50, 50)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_color('dimgray')
#plt.gca().set_yticks([0, 0.1, 0.2, 0.3, 0.4])
plt.grid(False)

plt.savefig("plots/packing_energy_kde.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/packing_energy_kde.pdf", bbox_inches="tight")
plt.savefig("plots/packing_energy_kde.svg", bbox_inches="tight")
plt.show()

plt.figure(figsize=(10,6))
K = combined_data[combined_data['Interaction Type'] == "Stacking"]

# Plot a comparison of Packing Energies between Templates and Extracted
sns.kdeplot(
    data=K, x="Interaction_Energy",
    hue="Category",
    fill=True, common_norm=False,
    palette={"Templates": "mediumorchid", "Extracted": "cornflowerblue"},
    alpha=0.5
)
plt.xlabel("Stacking Interaction Energy")
plt.ylabel("Density")
plt.title(f"Stacking Energy Interaction Comparison\nTemplates n={len(K[K['Category']=='Templates'])}, Extracted n={len(K[K['Category']=='Extracted'])}")

plt.axvline(0, color="lightgray", linewidth=2, linestyle="--")
plt.xlim(-12.5, 12.5)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_color('dimgray')
plt.gca().set_yticks([0, 0.05, 0.1, 0.15, 0.2])
plt.grid(False)

plt.savefig("plots/stacking_energy_kde.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/stacking_energy_kde.pdf", bbox_inches="tight")
plt.savefig("plots/stacking_energy_kde.svg", bbox_inches="tight")
plt.show()

Let's see if there's any correlation between packing and stacking energy.

combined_data.head()
K = combined_data.copy()

packing = (
    K[K["Interaction Type"].str.lower() == "packing"]
      .loc[:, ["ID", "Interaction_Energy", "Category"]]
      .rename(columns={"Interaction_Energy": "Packing_Energy"})
)

stacking = (
    K[K["Interaction Type"].str.lower() == "stacking"]
      .loc[:, ["ID", "Interaction_Energy"]]
      .rename(columns={"Interaction_Energy": "Stacking_Energy"})
)

merged = stacking.merge(packing, on="ID", how="inner")
merged.head()

	ID	Stacking_Energy	Packing_Energy	Category
0	3q2x_b	-2.01238	-31.501000	Templates
1	3fr1	-7.39247	-21.395700	Templates
2	2ona	-3.11506	-15.931600	Templates
3	3dgj_1	-4.73258	-22.033400	Templates
4	4r0w_b_1	-3.19053	-0.840478	Templates

plt.figure(figsize=(10,6))
K = merged[merged['Category'] == "Templates"]

# Plot the correlation between stacking and packing energy
sns.scatterplot(
    data=K,
    x="Packing_Energy",
    y="Stacking_Energy",
    hue="Category",
    # fill=True, common_norm=False,
    palette={"Templates": "mediumorchid", "Extracted": "cornflowerblue"},
    alpha=0.5
)
plt.xlabel("Packing")
plt.ylabel("Stacking")
plt.title(f"Packing Interaction Energy vs Stacking Interaction Energy")
corr = np.corrcoef(K['Packing_Energy'], K['Stacking_Energy'])[0, 1]
print("Pearson correlation:", corr)

#plt.axvline(0, color="lightgray", linewidth=2, linestyle="--")
plt.xlim(-50, 50)
plt.ylim(-15, 15)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_color('dimgray')
#plt.gca().set_yticks([0, 0.05, 0.1, 0.15, 0.2])
plt.grid(False)

plt.savefig("plots/packing_stacking_templates_scatter.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/packing_stacking_templates_scatter.pdf", bbox_inches="tight")
plt.savefig("plots/packing_stacking_templates_scatter.svg", bbox_inches="tight")
plt.show()

Pearson correlation: 0.1850314465829147

plt.figure(figsize=(10,6))
K = merged[merged['Category'] == "Extracted"]

# Plot the correlation between stacking and packing energy
sns.scatterplot(
    data=K,
    x="Packing_Energy",
    y="Stacking_Energy",
    hue="Category",
    # fill=True, common_norm=False,
    palette={"Templates": "mediumorchid", "Extracted": "cornflowerblue"},
    s=10,
    alpha=0.1
)
plt.xlabel("Packing")
plt.ylabel("Stacking")
plt.title(f"Packing Interaction Energy vs Stacking Interaction Energy")
corr = np.corrcoef(K['Packing_Energy'], K['Stacking_Energy'])[0, 1]
print("Pearson correlation:", corr)

#plt.axvline(0, color="lightgray", linewidth=2, linestyle="--")
plt.xlim(-50, 50)
plt.ylim(-15, 15)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_color('dimgray')
#plt.gca().set_yticks([0, 0.05, 0.1, 0.15, 0.2])
plt.grid(False)

plt.savefig("plots/packing_stacking_extracted_scatter.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/packing_stacking_extracted_scatter.pdf", bbox_inches="tight")
plt.savefig("plots/packing_stacking_extracted_scatter.svg", bbox_inches="tight")
plt.show()

Pearson correlation: 0.3159500790333334

Shape diversity¶

To compare the shapes from both datasets and compare them, we have computed two RMSD matrices and clustered in a hierarchical fashion (two-level clustering).

First level: Coordinates of the hexapeptide pairs were projected onto a plane (from 3D to 2D), from these, an RMSD matrix was computed using a QCP superimposer. This matrix was than used as a basis for clustering with a dendrogram (using a ward distance metric).

Second level: Full 3D coordinates were used to create yet another RMSD matrix (with modified QCP superimposer which, after superimposition of the whole structures, took the maximum of the subset's RMSDs of the two stacks). The clustering pipeline was as follows: take all the structures which belong to the same cluster in the First level, take the rows corresponding to them in Second level RMSD matrix and cluster them (again using the dendrogram with the ward distance metric).

In the case of 2D structures, we assess the cluster's shape validity by viewing the superimposed projected coordinates (in the first level). For the 3D case, we superimpose all the structures to its cluster center - we export this cluster in a .cif file, we plot the 2D shapes as well - by averaging across layers and then projecting onto a plane by taking the PCA components.

import warnings

warnings.filterwarnings("ignore", message="'force_all_finite' was renamed to 'ensure_all_finite'", category=FutureWarning)
warnings.filterwarnings("ignore", message="n_jobs value .* overridden to .* by setting random_state", category=UserWarning)
warnings.filterwarnings("ignore", message="'force_all_finite' was renamed to 'ensure_all_finite'", category=FutureWarning, module=r".*sklearn.*")
warnings.filterwarnings("ignore", message=r"n_jobs value .* overridden to .* by setting random_state", category=UserWarning, module=r".*umap.*")

def print_pcoa_explained_var(pcoa_results, num_components=2):
    # Extract the raw eigenvalues, taking only the first 'num_components'
    # pcoa_results.eigvals is a pandas Series, direct slicing works
    raw_eigenvalues = pcoa_results.eigvals[:num_components]
    explained_variance_ratio = pcoa_results.proportion_explained[:num_components]
    cumulative_explained_variance = explained_variance_ratio.cumsum()

    # You now have the other metrics stored in these variables for the top 10 components:
    print("\nRaw Eigenvalues:")
    print(raw_eigenvalues)
    print("\nExplained Variance Ratio:")
    print(explained_variance_ratio)
    print("\nCumulative Explained Variance:")
    print(cumulative_explained_variance)

# Shape analysis
def plot_pca_and_umap(pca, umap, hue, palette):
    # Create wide figure with two subplots
    fig, axes = plt.subplots(1, 2, figsize=(12, 6))  # 1 row, 2 columns
    # Get value counts
    hue_series = pd.Series(hue).astype(str)
    category_counts = hue_series.value_counts()
    hue_labels = {label: f"{label} (n={category_counts.get(label, 0)})" for label in category_counts.index}
    print(hue_labels)
    num_of_templates = category_counts.get(category_counts.index[1], 0)
    print(f"Number of templates used {num_of_templates}")
    hue_group1 = hue[:num_of_templates]
    hue_group2 = hue[num_of_templates:]

    # Plot PCA
    try:
        sns.scatterplot(x=pca[:, 0], y=pca[:, 1], hue=hue, palette=palette, s=20, ax=axes[0])
    except:
        print("Error, trying to plot a df")
        sns.scatterplot(data=pca[num_of_templates:], x='PCoA_1', y='PCoA_2', hue=hue_group2, palette=palette,s=20, ax=axes[0], alpha=0.3)
        sns.scatterplot(data=pca[:num_of_templates], x='PCoA_1', y='PCoA_2', hue=hue_group1, palette=palette,s=20, ax=axes[0], alpha=0.9)

    axes[0].set_title("PCoA of Reconstructed Fragments")
    axes[0].set_xlabel("PCoA 1")
    axes[0].set_ylabel("PCoA 2")
    handles, labels = axes[0].get_legend_handles_labels()
    new_labels = [hue_labels.get(label, label) for label in labels]
    axes[0].legend(handles, new_labels, title="Category")
    axes[0].spines['top'].set_visible(False)
    axes[0].spines['right'].set_visible(False)
    axes[0].spines['left'].set_color('dimgray')
    axes[0].spines['bottom'].set_color('dimgray')
    axes[0].grid(False)

    # Plot UMAP
    # Split the UMAP coordinates
    umap_group1 = umap[:num_of_templates]
    umap_group2 = umap[num_of_templates:]

    # Plot Group 2 (overrepresented) first
    sns.scatterplot(x=umap_group2[:, 0], y=umap_group2[:, 1], hue=hue_group2,
                    palette=palette, s=20, ax=axes[1], alpha=0.3)
    sns.scatterplot(x=umap_group1[:, 0], y=umap_group1[:, 1], hue=hue_group1,
                    palette=palette, s=20, ax=axes[1], alpha=0.9)

    # Adjust labels and legend
    axes[1].set_title("UMAP of Reconstructed Fragments")
    axes[1].set_xlabel("UMAP 1")
    axes[1].set_ylabel("UMAP 2")
    handles, labels = axes[1].get_legend_handles_labels()
    new_labels = [hue_labels.get(label, label) for label in labels]
    axes[1].legend(handles, new_labels, title="Category")
    axes[1].spines['top'].set_visible(False)
    axes[1].spines['right'].set_visible(False)
    axes[1].spines['left'].set_color('dimgray')
    axes[1].spines['bottom'].set_color('dimgray')
    axes[1].grid(False)
    # Adjust layout
    plt.tight_layout()
    plt.show()

def num_components_explaining(percent_of_variance, pcoa_results):
    explained_variance_sum = 0
    components_explaining = 0

    for i in range(0, pcoa_results.samples.shape[1]):
        explained_variance_sum += pcoa_results.proportion_explained.iloc[i]
        if explained_variance_sum >= 0.9:
            components_explaining = i + 1
            print(f"{percent_of_variance}% of variance is explained by the first {i+1} components")
            return components_explaining

def plot_umap(num_templates, pdb_files, umap_emb, save_path="plots/rmsd_umap"):
    umapca_df = pd.DataFrame({
    'PDB_File': pdb_files,
    'UMAP_1': umap_emb[:, 0],
    'UMAP_2': umap_emb[:, 1]
    })

    fig = plt.figure(figsize=(10, 6))
    sampled_K = umapca_df.iloc[num_templates:]
    template_K = umapca_df.iloc[:num_templates]
    n_sampled =  len(sampled_K) - sampled_K['UMAP_1'].isnull().sum()
    n_template = len(template_K) - template_K['UMAP_1'].isnull().sum()

    sns.scatterplot(data=sampled_K, x='UMAP_1', y='UMAP_2',
                    color='cornflowerblue', s=20, alpha=0.3, label=f'Sampled Fragments (n={n_sampled})')
    sns.scatterplot(data=template_K, x='UMAP_1', y='UMAP_2',
                    color='mediumorchid', s=20, alpha=0.9, label=f'Template Fragments (n={n_template})')

    plt.title('UMAP Projection of All-to-All RMSD Matrix')
    plt.xlabel('UMAP 1')
    plt.ylabel('UMAP 2')
    plt.legend(title='Fragment Type')
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['left'].set_color('dimgray')
    plt.gca().spines['bottom'].set_color('dimgray')
    plt.grid(False)

    # plt.xlim(-0,5, 5.5)
    # plt.gca().set_yticks([0, 0.3, 0.6])

    plt.tight_layout()

    plt.savefig(f"{save_path}.png", dpi=600, bbox_inches="tight")
    plt.savefig(f"{save_path}.pdf", bbox_inches="tight")
    plt.savefig(f"{save_path}.svg", bbox_inches="tight")
    plt.show()

def plot_dendrogram(Z, labels, num_clusters, t_distance, save_path="plots/rmsd_dendrogram"):
    fig = plt.figure(figsize=(20, 8))
    ax = fig.add_subplot(111)

    sch.dendrogram(
        Z,
        labels=labels,
        leaf_rotation=90,
        leaf_font_size=8,
        #truncate_mode='lastp',
        p=num_clusters,
        show_contracted=True,
        color_threshold=t_distance,
        above_threshold_color='black',
        ax=ax
    )
    ax.grid(False)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    plt.gca().spines['left'].set_color('dimgray')
    ax.spines['bottom'].set_visible(False)
    ax.set_xticklabels([]) # [str(label) for label in range(1, num_clusters + 1)]

    plt.title('Hierarchical Clustering Dendrogram with Clusters Highlighted')
    plt.xlabel('')
    plt.ylabel('Distance (Ward)')
    plt.tight_layout() # Adjust layout to prevent labels from overlapping

    plt.savefig(f"{save_path}.png", dpi=600, bbox_inches="tight")
    plt.savefig(f"{save_path}.pdf", bbox_inches="tight")
    plt.savefig(f"{save_path}.svg", bbox_inches="tight")
    plt.show()

def load_features(path):
    # Load precomputed features
    data = np.load(path, allow_pickle=True)
    return data["X"], data["files"]

def load_matrix(path):
    print("Loading an RMSD matrix...")
    X, pdb_files = load_features(path)
    print(f"Loaded. Matrix has {X.shape} shape.")
    print(f"With corresponding number of files {len(pdb_files)}")

    # Get the NaN values counts
    nan_count_per_row = np.isnan(X).sum(axis=1)
    nan_count_counts = pd.Series(nan_count_per_row).value_counts()
    print(nan_count_counts)

    # Find indices of files which failed
    nan_count_to_remove = max(nan_count_per_row)
    if nan_count_to_remove > 0:
        indices_to_remove = [idx for idx, count in enumerate(nan_count_per_row) if count == nan_count_to_remove]
        print(f"Files to remove due to high NaN count ({nan_count_to_remove} nans)")
    else:
        indices_to_remove = []

    # Remove these rows and columns
    print(f"Removing {len(indices_to_remove)} files with {nan_count_to_remove} NaN values in their respective row.")
    keep_indices = [i for i in range(len(pdb_files)) if i not in indices_to_remove]
    X = X[np.ix_(keep_indices, keep_indices)]
    print(f"Resulting matrix shape is {X.shape}")

    # Update the pdb_files list
    pdb_files = [pdb_files[i] for i in range(len(pdb_files)) if i not in indices_to_remove]
    print(f"Adjusted number of files {len(pdb_files)}")

    return X, pdb_files

def reduce_matrix(X, files, max_samples=5_000, float_32=True):
    # Get a subset so we can handle it computationally
    if X.shape[0] > max_samples:
        print(f"Reducing matrix size to the first {max_samples} samples")
        selected = np.arange(max_samples)
        X = X[np.ix_(selected, selected)]
        files = files[:max_samples]

    if float_32:
        print("Converting to float32")
        X = X.astype(np.float32) # and normalize if necessary

    return X, files

def filter_matrix(X, files, wanted_files):
    filter_indices = []
    not_found_files = []
    for filtered_file in wanted_files:
        try:
            filter_indices.append(files.index(filtered_file))
        except ValueError:
            not_found_files.append(filtered_file)
    print(f"{len(not_found_files)} were not found in the pdb_files_projected matrix")
    print(np.array(filter_indices).shape)

    X = X[np.ix_(filter_indices, filter_indices)]
    files = np.asarray(files)[filter_indices].tolist()

    return X, files

# Load and reduce projected coordinates "2D-RMSD" matrix
rmsd_X_projected, pdb_files_projected = load_matrix("rmsd_matrix_projected.npz")
# Filter on our new condition
helper_templates = pd.DataFrame([{"PDB_File": file} for file in pdb_files_projected[:170]])
temporary = pd.concat([helper_templates, filter_helper])
temporary.reset_index()
temporary
#rmsd_X_projected, pdb_files_projected = reduce_matrix(rmsd_X_projected, pdb_files_projected)
rmsd_X_projected, pdb_files_projected = filter_matrix(rmsd_X_projected, pdb_files_projected, temporary["PDB_File"].tolist())

Loading an RMSD matrix...
Loaded. Matrix has (35969, 35969) shape.
With corresponding number of files 35969
0    35969
Name: count, dtype: int64
Removing 0 files with 0 NaN values in their respective row.
Resulting matrix shape is (35969, 35969)
Adjusted number of files 35969
16 were not found in the pdb_files_projected matrix
(12458,)

plt.figure(figsize=(5,3))

sns.kdeplot(
    rmsd_X_projected[np.triu_indices_from(rmsd_X_projected, k=1)].flatten(),
    color = "midnightblue",
    linewidth=2,
)

plt.xlim(-0.5, 5.5)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_color('dimgray')
plt.gca().set_yticks([0, 0.3, 0.6])
plt.gca().set_xticks([0, 1, 2, 3, 4, 5])
plt.grid(False)

plt.savefig("plots/rmsd_projected_kde.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/rmsd_projected_kde.pdf", bbox_inches="tight")
plt.savefig("plots/rmsd_projected_kde.svg", bbox_inches="tight")
plt.show()

# Load and reduce "3D-RMSD" matrix with max_stack superimposer
rmsd_X_max, pdb_files_max = load_matrix("rmsd_matrix_max_stack.npz")
# rmsd_X_max, pdb_files_max = reduce_matrix(rmsd_X_max, pdb_files_max)
pdb_files_max = [file[17:] for file in pdb_files_max]

rmsd_X_max, pdb_files_max = filter_matrix(rmsd_X_max, pdb_files_max, temporary["PDB_File"].tolist())

Loading an RMSD matrix...
Loaded. Matrix has (36047, 36047) shape.
With corresponding number of files 36047
78       35969
36046       78
Name: count, dtype: int64
Files to remove due to high NaN count (36046 nans)
Removing 78 files with 36046 NaN values in their respective row.
Resulting matrix shape is (35969, 35969)
Adjusted number of files 35969
16 were not found in the pdb_files_projected matrix
(12458,)

plt.figure(figsize=(5,3))

sns.kdeplot(
    rmsd_X_max[np.triu_indices_from(rmsd_X_max, k=1)].flatten(),
    color = "midnightblue",
    linewidth=2,
)

plt.xlim(-0.5, 16)
plt.ylim(0, 0.4)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_color('dimgray')
plt.gca().set_yticks([0, 0.2, 0.4])
plt.gca().set_xticks([0, 5, 10, 15])
plt.grid(False)

plt.savefig("plots/rmsd_max_kde.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/rmsd_max_kde.pdf", bbox_inches="tight")
plt.savefig("plots/rmsd_max_kde.svg", bbox_inches="tight")
plt.show()

First level¶

distance_matrix = DistanceMatrix(rmsd_X_projected)

# Perform PCoA
num_top_components = 2
pcoa_results = pcoa(distance_matrix, number_of_dimensions=400)
results_df = pcoa_results.samples.iloc[:, :num_top_components].copy()
## Create column names for the DataFrame based on the number of top components
column_names = [f'PCoA_{i+1}' for i in range(num_top_components)]
results_df.columns = column_names
print_pcoa_explained_var(pcoa_results, num_top_components)

# UMAP
## Initialize UMAP with 'precomputed' metric
umap_model = umap.UMAP(n_neighbors=30, n_components=2, metric='precomputed', random_state=42, min_dist=0.05)
emb = umap_model.fit_transform(rmsd_X_projected)

num_templates_projected=170
category = np.array(
    ["Templates"] * num_templates_projected + ["Extracted"] * (len(pdb_files_projected) - num_templates_projected)
)

plot_pca_and_umap(results_df, emb, category, {"Templates": "mediumorchid", "Extracted": "cornflowerblue"})

Raw Eigenvalues:
PC1    13451.882343
PC2     6343.336759
dtype: float64

Explained Variance Ratio:
PC1    0.235486
PC2    0.111045
dtype: float64

Cumulative Explained Variance:
PC1    0.235486
PC2    0.346531
dtype: float64

{'Extracted': 'Extracted (n=12288)', 'Templates': 'Templates (n=170)'}
Number of templates used 170
Error, trying to plot a df

components_explaining = num_components_explaining(percent_of_variance=90, pcoa_results=pcoa_results)
umapca_model = umap.UMAP(n_neighbors=30, n_components=2, random_state=42, min_dist=0.1, spread=1)
umapca_emb = umapca_model.fit_transform(pcoa_results.samples.iloc[:, :components_explaining].values)

plot_umap(num_templates_projected, pdb_files_projected, umapca_emb, save_path="plots/rmsd_projected_umap")

90% of variance is explained by the first 266 components

# Perform hierarchical clustering
t_distance = 25
Z = sch.linkage(squareform(rmsd_X_projected), method='ward', metric='precomputed')

labels = sch.fcluster(Z, t=t_distance, criterion='distance')

num_clusters = len(np.unique(labels))
print(f"Found {num_clusters} number of clusters... labeled {len(labels)} structures")

custom_cluster_palette = [ # set tab10 cluster colors to match the dendrogram
'tab:orange', 'tab:green', 'tab:red',
'tab:purple', 'tab:brown', 'tab:pink',
'tab:gray', 'tab:olive', 'tab:cyan'#, 'tab:blue'
]
plot_dendrogram(Z, pdb_files_projected, num_clusters, t_distance, save_path="plots/rmsd_projected_dendrogram")

Found 21 number of clusters... labeled 12458 structures

import superpose_fragments
import projected_pairwise_rmsd_matrix
from Bio.PDB.qcprot import QCPSuperimposer

pdb_to_idx = {file: idx for idx, file in enumerate(pdb_files_projected)}

def calculate_summed_distance_within_cluster(row):
    current_pdb = row['PDB_File']
    current_cluster = row['Cluster']

    # Get all PDB files in the current cluster
    cluster_members_df = K[K['Cluster'] == current_cluster]
    cluster_member_pdbs = cluster_members_df['PDB_File'].tolist()

    # Get the index of the current and other cluster members' PDB files in the RMSD matrix
    current_pdb_idx = pdb_to_idx[current_pdb]
    cluster_member_indices = [pdb_to_idx[pdb] for pdb in cluster_member_pdbs]

    # Extract the relevant row from the RMSD matrix (distances from current PDB)
    distances_from_current_pdb = rmsd_X_projected[current_pdb_idx, :] # FIXME
    summed_distance = distances_from_current_pdb[cluster_member_indices].sum()
    # summed_distance_squared = (distances_from_current_pdb[cluster_member_indices]**2).sum() # TODO
    # if i want to know which files distort the cluster
    return summed_distance

def superpose_cluster(cluster_num, distance_matrix, projected_alignment=False, K_labeled=K):
    # Get the indices of the PDB files in the target cluster
    cluster_rows = K_labeled[K_labeled['Cluster'] == cluster_num].sort_values(by='Summed_Distances_Within_Cluster', ascending=True)
    cluster_indices = cluster_rows['PDB_File'].apply(lambda x: pdb_to_idx[x]).tolist()
    cluster_rmsds_submatrix = distance_matrix[np.ix_(cluster_indices, cluster_indices)]
    # Flatten the upper triangular part
    upper_triangle_indices = np.triu_indices_from(cluster_rmsds_submatrix, k=1) # FIXME: can be rectangular not a square
    all_pairwise_cluster_distances = cluster_rmsds_submatrix[upper_triangle_indices]
    # Get all the cluster paths and superpose
    paths = []
    for file in list(cluster_rows['PDB_File'].values):
        file = file.split('/')[-1]
        if file.split('_')[-1] == "Repair.pdb":
            paths.append(f"./fragment_pairs/{file}")
        else:
            paths.append(f"./templates/{file}")
    if not projected_alignment:
        rmsds, super_opts, aligned_coords = superpose_fragments.superpose_all(paths, output_path=f"superposed_cluster_{cluster_num}.cif")
    else:
        center_projected_coords, _ = projected_pairwise_rmsd_matrix.get_projected_coords_and_std(paths[0])
        superimposer = QCPSuperimposer()
        aligned_coords = [center_projected_coords]
        rmsds = []

        for path in paths[1:]:
            projected_coords, _ = projected_pairwise_rmsd_matrix.get_projected_coords_and_std(path)
            if projected_coords is None: # FIXME: why some templates failing
                print(f"{projected_coords} of {path}")
                continue
            
            possible_aligns = projected_pairwise_rmsd_matrix.generate_possible_alignments_test(projected_coords)

            best_rmsd = float('inf')
            best_variant = None
            for align in possible_aligns:
                superimposer.set(center_projected_coords, align)
                superimposer.run()
                rmsd = superimposer.get_rms()
                
                if rmsd < best_rmsd:
                    best_rmsd = rmsd
                    best_variant = superimposer.get_transformed()

            rmsds.append(best_rmsd)
            aligned_coords.append(best_variant)

    print(f"Number of structures in cluster {cluster_num}: {len(cluster_rows)}")
    print(f"Mean RMSD within cluster: {np.mean(all_pairwise_cluster_distances):.2f} Å")
    print(f"Standard deviation of RMSDs within cluster: {np.std(all_pairwise_cluster_distances):.2f} Å")
    print(f"Standard deviation of RMSDs superposition to centroid structure: {np.std(rmsds):.2f} Å")
    return aligned_coords

def plot_projected_cluster(cluster_num, projected_alignment=True, X=rmsd_X_projected, K_labeled=K, save_path="plots/cluster_test"):
    aligned_coords = superpose_cluster(cluster_num, X, projected_alignment=projected_alignment, K_labeled=K_labeled)
    aligned_coords = np.array(aligned_coords)
    avg_layer_coords = aligned_coords.reshape(-1, 12, 3)
    mean_coords = np.mean(avg_layer_coords, axis=0)
    std_coords = np.std(avg_layer_coords, axis=0)

    centered_mean_coords = mean_coords - np.mean(mean_coords, axis=0)
    pca = PCA(n_components=2)
    pca.fit(centered_mean_coords)
    projected_coords = pca.transform(centered_mean_coords)

    print(f"Total variance explained by best-fit plane: ~{np.sum(pca.explained_variance_ratio_):.2f}")
    print(f"Variance explained by PC1: {pca.explained_variance_ratio_[0]:.2f}")
    print(f"Variance explained by PC2: {pca.explained_variance_ratio_[1]:.2f}")

    projected_stdev_x = np.dot(std_coords, pca.components_[0]) # Project stdev along PC1
    projected_stdev_y = np.dot(std_coords, pca.components_[1]) # Project stdev along PC2
    projected_stdev_x_abs = np.abs(projected_stdev_x)
    projected_stdev_y_abs = np.abs(projected_stdev_y)

    fig_2d_pca = plt.figure(figsize=(6, 4))
    ax_2d_pca = fig_2d_pca.add_subplot(111)

    # Plot the mean trace on the best-fit planeplot_projected_cluster
    ax_2d_pca.plot(projected_coords[:6, 0], 
                    projected_coords[:6, 1], 
                    'o-', label='Mean Trace (Hexapeptide 1)', color='black', linewidth=10, markersize=6)
    ax_2d_pca.plot(projected_coords[6:, 0], 
                    projected_coords[6:, 1], 
                    'o-', label='Mean Trace (Hexapeptide 2)', color='black', linewidth=10, markersize=6)
    # Add "blurriness" as error bars
    # Use projected_stdev_x_abs and projected_stdev_y_abs for error bars along the new axes
    ax_2d_pca.errorbar(projected_coords[:, 0], projected_coords[:, 1],
                       xerr=projected_stdev_x_abs, yerr=projected_stdev_y_abs,
                       fmt='none', capsize=3, color='grey', alpha=0.5, zorder=0)

    ax_2d_pca.axes.get_xaxis().set_visible(False)
    ax_2d_pca.axes.get_yaxis().set_visible(False)
    # remove the box
    for spine in ax_2d_pca.spines.values():
        spine.set_visible(False)
    ax_2d_pca.grid(False)
    ax_2d_pca.set_aspect('equal', adjustable='box') # Keep aspect ratio for true shape
    plt.tight_layout()

    plt.savefig(f"{save_path}_{cluster_num}.png", dpi=300, bbox_inches="tight")
    plt.savefig(f"{save_path}_{cluster_num}.pdf", bbox_inches="tight")
    plt.savefig(f"{save_path}_{cluster_num}.svg", bbox_inches="tight")
    plt.show()

K = pd.DataFrame({
    'PDB_File': pdb_files_projected,
    'Cluster': labels,
    'Category': category
})

# Apply the function to each row
K['Summed_Distances_Within_Cluster'] = K.apply(calculate_summed_distance_within_cluster, axis=1)

cluster_num = 11
# superpose_cluster(cluster_num, rmsd_X_projected)
plot_projected_cluster(cluster_num, True, rmsd_X_projected, K)

Number of structures in cluster 11: 309
Mean RMSD within cluster: 2.67 Å
Standard deviation of RMSDs within cluster: 0.59 Å
Standard deviation of RMSDs superposition to centroid structure: 0.61 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.84
Variance explained by PC2: 0.16

for i in range(max(labels)):
    plot_projected_cluster(i+1, True, rmsd_X_projected, K, save_path="plots/projected_clusters/projected_cluster")

Number of structures in cluster 1: 370
Mean RMSD within cluster: 1.96 Å
Standard deviation of RMSDs within cluster: 0.63 Å
Standard deviation of RMSDs superposition to centroid structure: 0.56 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.98
Variance explained by PC2: 0.02

Number of structures in cluster 2: 612
Mean RMSD within cluster: 1.66 Å
Standard deviation of RMSDs within cluster: 0.46 Å
Standard deviation of RMSDs superposition to centroid structure: 0.49 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.95
Variance explained by PC2: 0.05

Number of structures in cluster 3: 433
Mean RMSD within cluster: 1.89 Å
Standard deviation of RMSDs within cluster: 0.48 Å
Standard deviation of RMSDs superposition to centroid structure: 0.46 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.88
Variance explained by PC2: 0.12

Number of structures in cluster 4: 591
Mean RMSD within cluster: 2.28 Å
Standard deviation of RMSDs within cluster: 0.53 Å
Standard deviation of RMSDs superposition to centroid structure: 0.52 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.92
Variance explained by PC2: 0.08

Number of structures in cluster 5: 360
Mean RMSD within cluster: 1.70 Å
Standard deviation of RMSDs within cluster: 0.48 Å
Standard deviation of RMSDs superposition to centroid structure: 0.47 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.80
Variance explained by PC2: 0.20

Number of structures in cluster 6: 868
Mean RMSD within cluster: 2.27 Å
Standard deviation of RMSDs within cluster: 0.51 Å
Standard deviation of RMSDs superposition to centroid structure: 0.42 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.78
Variance explained by PC2: 0.22

Number of structures in cluster 7: 469
Mean RMSD within cluster: 1.31 Å
Standard deviation of RMSDs within cluster: 0.46 Å
Standard deviation of RMSDs superposition to centroid structure: 0.36 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.79
Variance explained by PC2: 0.21

Number of structures in cluster 8: 468
Mean RMSD within cluster: 1.51 Å
Standard deviation of RMSDs within cluster: 0.47 Å
Standard deviation of RMSDs superposition to centroid structure: 0.47 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.93
Variance explained by PC2: 0.07

Number of structures in cluster 9: 500
Mean RMSD within cluster: 1.49 Å
Standard deviation of RMSDs within cluster: 0.44 Å
Standard deviation of RMSDs superposition to centroid structure: 0.36 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.82
Variance explained by PC2: 0.18

Number of structures in cluster 10: 776
Mean RMSD within cluster: 2.21 Å
Standard deviation of RMSDs within cluster: 0.50 Å
Standard deviation of RMSDs superposition to centroid structure: 0.57 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.82
Variance explained by PC2: 0.18

Number of structures in cluster 11: 309
Mean RMSD within cluster: 2.67 Å
Standard deviation of RMSDs within cluster: 0.59 Å
Standard deviation of RMSDs superposition to centroid structure: 0.61 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.84
Variance explained by PC2: 0.16

Number of structures in cluster 12: 453
Mean RMSD within cluster: 1.90 Å
Standard deviation of RMSDs within cluster: 0.55 Å
Standard deviation of RMSDs superposition to centroid structure: 0.49 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.65
Variance explained by PC2: 0.35

Number of structures in cluster 13: 1168
Mean RMSD within cluster: 2.18 Å
Standard deviation of RMSDs within cluster: 0.45 Å
Standard deviation of RMSDs superposition to centroid structure: 0.47 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.55
Variance explained by PC2: 0.45

Number of structures in cluster 14: 408
Mean RMSD within cluster: 1.96 Å
Standard deviation of RMSDs within cluster: 0.53 Å
Standard deviation of RMSDs superposition to centroid structure: 0.49 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.60
Variance explained by PC2: 0.40

Number of structures in cluster 15: 611
Mean RMSD within cluster: 1.91 Å
Standard deviation of RMSDs within cluster: 0.53 Å
Standard deviation of RMSDs superposition to centroid structure: 0.54 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.63
Variance explained by PC2: 0.37

Number of structures in cluster 16: 366
Mean RMSD within cluster: 1.15 Å
Standard deviation of RMSDs within cluster: 0.36 Å
Standard deviation of RMSDs superposition to centroid structure: 0.31 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.71
Variance explained by PC2: 0.29

Expected 60 atoms, got 0 atoms in ./fragment_pairs/7ncj_pair_009_Repair.pdb
None of ./fragment_pairs/7ncj_pair_009_Repair.pdb
Number of structures in cluster 17: 857
Mean RMSD within cluster: 1.58 Å
Standard deviation of RMSDs within cluster: 0.50 Å
Standard deviation of RMSDs superposition to centroid structure: 0.50 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.79
Variance explained by PC2: 0.21

Number of structures in cluster 18: 581
Mean RMSD within cluster: 1.46 Å
Standard deviation of RMSDs within cluster: 0.52 Å
Standard deviation of RMSDs superposition to centroid structure: 0.53 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.73
Variance explained by PC2: 0.27

Number of structures in cluster 19: 451
Mean RMSD within cluster: 1.58 Å
Standard deviation of RMSDs within cluster: 0.51 Å
Standard deviation of RMSDs superposition to centroid structure: 0.48 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.68
Variance explained by PC2: 0.32

Number of structures in cluster 20: 875
Mean RMSD within cluster: 1.39 Å
Standard deviation of RMSDs within cluster: 0.39 Å
Standard deviation of RMSDs superposition to centroid structure: 0.26 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.57
Variance explained by PC2: 0.43

Number of structures in cluster 21: 932
Mean RMSD within cluster: 1.76 Å
Standard deviation of RMSDs within cluster: 0.60 Å
Standard deviation of RMSDs superposition to centroid structure: 0.63 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.60
Variance explained by PC2: 0.40

def plot_labeled_umap(df):
    fig = plt.figure(figsize=(10, 6))

    sns.scatterplot(data=df, x='UMAP_1', y='UMAP_2',
                    hue='Cluster', s=20, alpha=1, palette=custom_cluster_palette)

    plt.title('UMAP Projection of Sampled and Template Fragments')
    plt.xlabel('UMAP 1')
    plt.ylabel('UMAP 2')
    plt.legend(title='Clusters')
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)
    plt.grid(False)
    #plt.ylim(-0.5, 7)
    plt.gca().spines['left'].set_color('dimgray')
    plt.gca().spines['bottom'].set_color('dimgray')
    #plt.gca().set_yticks([0, 2, 4, 6])

    plt.tight_layout()
    plt.savefig("plots/rmsd_projected_labeled_umap.png", dpi=600, bbox_inches="tight")
    plt.savefig("plots/rmsd_projected_labeled_umap.pdf", bbox_inches="tight")
    plt.savefig("plots/rmsd_projected_labeled_umap.svg", bbox_inches="tight")
    plt.show()

umapca_df = pd.DataFrame({
'PDB_File': pdb_files_projected,
'UMAP_1': umapca_emb[:, 0],
'UMAP_2': umapca_emb[:, 1]
})
K_merged = pd.merge(K, umapca_df, on="PDB_File")
plot_labeled_umap(K_merged)

unique_cluster_names = K['Cluster'].unique()
cluster_order_numeric_sorted_str = sorted(unique_cluster_names, key=int)
K['Cluster'] = pd.Categorical(K['Cluster'], categories=cluster_order_numeric_sorted_str, ordered=True)

# Calculate value counts based on the ordered categorical 'Cluster'
# This will now respect the order defined in the categorical type
value_counts_ordered = K['Cluster'].value_counts().sort_index()
#print([f"Cluster {i+1}, {val}" for i, val in enumerate(value_counts_ordered.values)])
# Prepare data for "Template Fragments"
stacked_counts = K.groupby(['Cluster', 'Category']).size().reset_index(name='counts')
stacked_counts = stacked_counts[stacked_counts['Category'] == "Templates"]
#print([f"Cluster {i+1}, {val}" for i, val in enumerate(stacked_counts['counts'].values)])

plt.figure(figsize=(25, 2))
# Plot the total cluster sizes, now ordered by cluster number
ax = sns.barplot(x=value_counts_ordered.index, y=value_counts_ordered.values, color='midnightblue')
# Plot the "Template Fragments" counts, which will also be ordered by cluster number
#ax = sns.barplot(x='Cluster', y='counts', color='mediumorchid', data=stacked_counts) # TODO: purple distribution (kde)
tmp = stacked_counts.set_index('Cluster')['counts']
x_template = np.repeat([i-1 for i in tmp.index.values], tmp.values)
x_template_num = pd.Series(x_template).astype("int64").to_numpy(dtype=float)

# Match x-limits to the clusters
xmin, xmax = value_counts_ordered.index.min(), value_counts_ordered.index.max()
ax.set_xlim(-1, xmax)

# --- KDE on a secondary y-axis ---
ax2 = ax.twinx()
sns.kdeplot(
    x=x_template_num, ax=ax2, fill=True, alpha=0.3,
    bw_adjust=0.1, color='mediumorchid', clip=(xmin - 2, xmax + 2)
)

plt.xlabel('Cluster')
plt.ylabel('Count')
ax.grid(False)
ax2.grid(False)
# remove spines
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_color('dimgray')
ax2.spines['left'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.spines['bottom'].set_visible(False)
ax2.spines['right'].set_color('dimgray')
ax2.set_ylabel('Density')
plt.xticks(rotation=90)
plt.tight_layout()

plt.savefig("plots/cluster_counts_projected_bar.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/cluster_counts_projected_bar.pdf", bbox_inches="tight")
plt.savefig("plots/cluster_counts_projected_bar.svg", bbox_inches="tight")
plt.show()

pdb_to_idx = {file: idx for idx, file in enumerate(pdb_files_projected)}
rmsd_plotting_df = []

for cluster in cluster_order_numeric_sorted_str:
    cluster_rows = K[K['Cluster'] == cluster]
    cluster_indices = cluster_rows['PDB_File'].apply(lambda x: pdb_to_idx[x]).tolist()
    cluster_rmsds_submatrix = rmsd_X_projected[np.ix_(cluster_indices, cluster_indices)]
    # Flatten the upper triangular part
    upper_triangle_indices = np.triu_indices_from(cluster_rmsds_submatrix, k=1)
    all_pairwise_cluster_distances = cluster_rmsds_submatrix[upper_triangle_indices]

    rmsds_test=[]
    for dist in all_pairwise_cluster_distances:
        if dist == 0:
            continue
        rmsds_test.append(dist)
        rmsd_plotting_df.append({'Cluster': cluster, 'RMSD': dist})

rmsd_plotting_df = pd.DataFrame(rmsd_plotting_df)

plt.figure(figsize=(10, 6))
sns.violinplot(
    x='Cluster',
    y='RMSD',
    data=rmsd_plotting_df,
    palette=custom_cluster_palette,
    inner='quartile',
    width=0.7,
    cut=0,
)

plt.xlabel('Cluster', fontsize=12)
plt.ylabel('Pairwise RMSD', fontsize=12)
plt.title('Distribution of Pairwise RMSDs within Clusters', fontsize=14)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

plt.ylim(-0.5, 7)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('dimgray')
plt.gca().spines['bottom'].set_visible(False)
plt.gca().set_yticks([0, 2, 4, 6])

plt.grid(axis='y', linestyle='--', alpha=0.7) #, linewidth=2)
plt.tight_layout()

plt.savefig("plots/rmsd_projected_within_cluster_violin.png", dpi=600, bbox_inches="tight")
plt.savefig("plots/rmsd_projected_within_cluster_violin.pdf", bbox_inches="tight")
plt.savefig("plots/rmsd_projected_within_cluster_violin.svg", bbox_inches="tight")
plt.show()

Second level¶

distance_matrix = DistanceMatrix(rmsd_X_max) # FIXME: MAX_STACK SUPERIMPOSER DOESNT GIVE US A `DISTANCE` MATRIX

# Perform PCoA
num_top_components = 2
pcoa_results = pcoa(distance_matrix, number_of_dimensions=3000)
results_df = pcoa_results.samples.iloc[:, :num_top_components].copy()
## Create column names for the DataFrame based on the number of top components
column_names = [f'PCoA_{i+1}' for i in range(num_top_components)]
results_df.columns = column_names
print_pcoa_explained_var(pcoa_results, num_top_components)


# UMAP
## Initialize UMAP with 'precomputed' metric
umap_model = umap.UMAP(n_neighbors=30, n_components=2, metric='precomputed', random_state=42)
emb = umap_model.fit_transform(rmsd_X_max)

num_templates_max = 170
category_spatial = np.array(
    ["Templates"] * num_templates_max + ["Extracted"] * (len(pdb_files_max) - num_templates_max)
)

plot_pca_and_umap(results_df, emb, category, {"Templates": "mediumorchid", "Extracted": "cornflowerblue"})

Raw Eigenvalues:
PC1    34424.583235
PC2    20307.085038
dtype: float64

Explained Variance Ratio:
PC1    0.049221
PC2    0.029035
dtype: float64

Cumulative Explained Variance:
PC1    0.049221
PC2    0.078256
dtype: float64

{'Extracted': 'Extracted (n=12288)', 'Templates': 'Templates (n=170)'}
Number of templates used 170
Error, trying to plot a df

components_explaining = num_components_explaining(percent_of_variance=90, pcoa_results=pcoa_results)
umapca_model = umap.UMAP(n_neighbors=15, n_components=2, random_state=42, min_dist=0.1)
umapca_emb = umapca_model.fit_transform(pcoa_results.samples.iloc[:, :components_explaining].values)

plot_umap(num_templates_max, pdb_files_max, umapca_emb, save_path="plots/rmsd_max_umap")

90% of variance is explained by the first 2765 components

pdb_to_idx = {file: idx for idx, file in enumerate(pdb_files_max)}

def calculate_summed_distance_within_cluster_spatial(row):
    current_pdb = row['PDB_File']
    current_cluster = row['Cluster']

    # Get all PDB files in the current cluster
    cluster_members_df = K_spatial[K_spatial['Cluster'] == current_cluster]
    cluster_member_pdbs = cluster_members_df['PDB_File'].tolist()

    # Get the index of the current and other cluster members' PDB files in the RMSD matrix
    current_pdb_idx = pdb_to_idx[current_pdb]
    cluster_member_indices = [pdb_to_idx[pdb] for pdb in cluster_member_pdbs]

    # Extract the relevant row from the RMSD matrix (distances from current PDB)
    distances_from_current_pdb = rmsd_X_projected[current_pdb_idx, :] # FIXME
    summed_distance = distances_from_current_pdb[cluster_member_indices].sum()
    # summed_distance_squared = (distances_from_current_pdb[cluster_member_indices]**2).sum() # TODO
    # if i want to know which files distort the cluster
    return summed_distance

# Perform hierarchical clustering
t_distance_spatial = 25
Z_spatial = sch.linkage(squareform(rmsd_X_max), method='ward', metric='precomputed')

labels_spatial = sch.fcluster(Z_spatial, t=t_distance_spatial, criterion='distance')

num_clusters_spatial = len(np.unique(labels_spatial))
print(f"Found {num_clusters_spatial} number of clusters... labeled {len(labels_spatial)} structures")

K_spatial = pd.DataFrame({
    'PDB_File': pdb_files_max,
    'Cluster': labels_spatial,
    'Category': category_spatial
})
# Apply the function to each row
K_spatial['Summed_Distances_Within_Cluster'] = K_spatial.apply(calculate_summed_distance_within_cluster_spatial, axis=1)

plot_dendrogram(Z_spatial, pdb_files_max, num_clusters_spatial, t_distance_spatial, save_path="plots/rmsd_max_dendrogram")

Found 38 number of clusters... labeled 12458 structures

def calculate_summed_distance_within_cluster_spatial_sub(row):
    current_pdb = row['PDB_File']
    current_cluster = row['Cluster']

    # Get all PDB files in the current cluster
    cluster_members_df = K_hier[K_hier['Cluster'] == current_cluster]
    cluster_member_pdbs = cluster_members_df['PDB_File'].tolist()

    # Get the index of the current and other cluster members' PDB files in the RMSD matrix
    current_pdb_idx = pdb_to_idx[current_pdb]
    cluster_member_indices = [pdb_to_idx[pdb] for pdb in cluster_member_pdbs]

    # Extract the relevant row from the RMSD matrix (distances from current PDB)
    distances_from_current_pdb = rmsd_X_projected[current_pdb_idx, :] # FIXME
    summed_distance = distances_from_current_pdb[cluster_member_indices].sum()
    # summed_distance_squared = (distances_from_current_pdb[cluster_member_indices]**2).sum() # TODO
    # if i want to know which files distort the cluster
    return summed_distance

def hierarchical_dendrogram(t_dist, cluster_num, cluster_df, pdb_files, X):
    # Perform hierarchical clustering and return sub-cluster objects    
    if cluster_num > cluster_df['Cluster'].max() or cluster_num <= 0:
        print(f"Pick correct cluster number... not {cluster_num}")
        return
    
    sub_cluster_df = cluster_df[cluster_df['Cluster'] == cluster_num]
    query_files = sub_cluster_df['PDB_File'].to_numpy()
    pdb_files = np.asarray([file.split('/')[-1] for file in pdb_files])
    
    X_sub = X.copy()
    mask = np.isin(pdb_files, query_files)
    selected = np.flatnonzero(mask) 
    selected_files = pdb_files[selected]

    X_sub = X_sub[np.ix_(selected, selected)]

    Z_sub = sch.linkage(squareform(X_sub), method='ward', metric='precomputed')
    labels_sub = sch.fcluster(Z_sub, t=t_dist, criterion='distance')

    num_clusters_sub = len(np.unique(labels_sub))
    print(f"Found {num_clusters_sub} number of clusters... labeled {len(labels_sub)} structures")

    plot_dendrogram(Z_sub, selected_files, num_clusters_sub, t_dist, save_path=f"plots/rmsd_max_of_projected_cluster_{cluster_num}_dendrogram")

    K_hier = pd.DataFrame({
        'PDB_File': selected_files,
        'Cluster': labels_sub,
    })
    return K_hier, X_sub

# Get a clustering of subcluster number 11 with ward threshold 10
subcluster_num = 11
K_hier, X_sub = hierarchical_dendrogram(10, subcluster_num, K, pdb_files_max, rmsd_X_max)
pdb_to_idx = {file: idx for idx, file in enumerate(K_hier['PDB_File'])}
# Apply the function to each row
K_hier['Summed_Distances_Within_Cluster'] = K_hier.apply(calculate_summed_distance_within_cluster_spatial_sub, axis=1)

Found 11 number of clusters... labeled 309 structures

# Plot each 3D-clustered subcluster in 2D
for i in range(len(K_hier['Cluster'].unique())):
    plot_projected_cluster(i+1, projected_alignment=False, X=X_sub, K_labeled=K_hier, save_path=f"plots/max_clusters/sub{subcluster_num}_max_cluster")

Saved superimposed structure to superposed_cluster_1.cif
Number of structures in cluster 1: 48
Mean RMSD within cluster: 2.73 Å
Standard deviation of RMSDs within cluster: 0.79 Å
Standard deviation of RMSDs superposition to centroid structure: 0.67 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.76
Variance explained by PC2: 0.23

Saved superimposed structure to superposed_cluster_2.cif
Number of structures in cluster 2: 31
Mean RMSD within cluster: 2.98 Å
Standard deviation of RMSDs within cluster: 1.20 Å
Standard deviation of RMSDs superposition to centroid structure: 0.61 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.84
Variance explained by PC2: 0.16

Saved superimposed structure to superposed_cluster_3.cif
Number of structures in cluster 3: 27
Mean RMSD within cluster: 2.88 Å
Standard deviation of RMSDs within cluster: 0.83 Å
Standard deviation of RMSDs superposition to centroid structure: 0.78 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.75
Variance explained by PC2: 0.25

Saved superimposed structure to superposed_cluster_4.cif
Number of structures in cluster 4: 45
Mean RMSD within cluster: 2.35 Å
Standard deviation of RMSDs within cluster: 0.70 Å
Standard deviation of RMSDs superposition to centroid structure: 0.50 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.70
Variance explained by PC2: 0.29

Saved superimposed structure to superposed_cluster_5.cif
Number of structures in cluster 5: 9
Mean RMSD within cluster: 0.54 Å
Standard deviation of RMSDs within cluster: 0.19 Å
Standard deviation of RMSDs superposition to centroid structure: 0.20 Å
Total variance explained by best-fit plane: ~0.99
Variance explained by PC1: 0.71
Variance explained by PC2: 0.28

Saved superimposed structure to superposed_cluster_6.cif
Number of structures in cluster 6: 21
Mean RMSD within cluster: 3.16 Å
Standard deviation of RMSDs within cluster: 0.97 Å
Standard deviation of RMSDs superposition to centroid structure: 0.51 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.74
Variance explained by PC2: 0.26

Saved superimposed structure to superposed_cluster_7.cif
Number of structures in cluster 7: 34
Mean RMSD within cluster: 3.07 Å
Standard deviation of RMSDs within cluster: 0.80 Å
Standard deviation of RMSDs superposition to centroid structure: 0.43 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.84
Variance explained by PC2: 0.16

Saved superimposed structure to superposed_cluster_8.cif
Number of structures in cluster 8: 42
Mean RMSD within cluster: 3.13 Å
Standard deviation of RMSDs within cluster: 1.03 Å
Standard deviation of RMSDs superposition to centroid structure: 0.76 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.76
Variance explained by PC2: 0.23

Saved superimposed structure to superposed_cluster_9.cif
Number of structures in cluster 9: 11
Mean RMSD within cluster: 2.28 Å
Standard deviation of RMSDs within cluster: 1.15 Å
Standard deviation of RMSDs superposition to centroid structure: 0.83 Å
Total variance explained by best-fit plane: ~0.99
Variance explained by PC1: 0.75
Variance explained by PC2: 0.24

Saved superimposed structure to superposed_cluster_10.cif
Number of structures in cluster 10: 14
Mean RMSD within cluster: 3.10 Å
Standard deviation of RMSDs within cluster: 1.39 Å
Standard deviation of RMSDs superposition to centroid structure: 0.51 Å
Total variance explained by best-fit plane: ~1.00
Variance explained by PC1: 0.69
Variance explained by PC2: 0.30

Saved superimposed structure to superposed_cluster_11.cif
Number of structures in cluster 11: 27
Mean RMSD within cluster: 2.51 Å
Standard deviation of RMSDs within cluster: 1.17 Å
Standard deviation of RMSDs superposition to centroid structure: 0.79 Å
Total variance explained by best-fit plane: ~0.99
Variance explained by PC1: 0.82
Variance explained by PC2: 0.17