Molecule Clustering

Similarity Based Clustering

Here is an example of clusting molecule with Tanimoto distance. Firstly, a similarity matrix whose element in the $i^{th}$ row and $j^{th}$ column is the structure similarity between molecule $i$ and molecule $j$, is constructed as a placeholder. Then, use RDKit tools computes the similarity between each molecule pair based on their topological fingerprints.

import numpy as np
from rdkit import Chem
from rdkit.DataStructs import FingerprintSimilarity

sm = np.zeros((n, n))
for i in range(n):
    for j in range(i, n):
        m1, m2 = Chem.MolFromSmiles(smile_list[i]), Chem.MolFromSmiles(smile_list[j])
        sm[i, j] = FingerprintSimilarity(Chem.RDKFingerprint(m1), Chem.RDKFingerprint(m2))
sm = sm + sm.T - np.eye(n)

Rows (or columns) of the similarity matrix can be regarded as the features of the corresponding molecules. Therefore, the molecules are clusted based on these similarity features.

from sklearn.cluster import AffinityPropagation

af = AffinityPropagation().fit(sm)
cluster_centers_indices = af.cluster_centers_indices_
n_clusters = len(cluster_centers_indices)
labels = af.labels_

Representation Based Clustering

The vector representation of a molecule can be generated obained from either machine learning model (e.g. molecule 3D structure embedding model) or chemical features (e.g. descriptors). Here we give an example of clusting drugs based on several descriptor modules.

from rdkit.Chem import Descriptors as ds
from rdkit import Chem

modules = ['ExactMolWt', 'FpDensityMorgan1', 'FpDensityMorgan2', 'FpDensityMorgan3', 'HeavyAtomMolWt', 'MaxAbsPartialCharge', 'MaxPartialCharge', 'MinAbsPartialCharge', 'MinPartialCharge', 'MolWt', 'NumValenceElectrons']

def generate_descriptor_features(smile_list, module_list) -> np.array:
    features = np.zeros((len(smile_list), len(modules)))
    for i, smile in enumerate(smile_list):
        mol = Chem.MolFromSmiles(smile)
        features[i] = [ds.__dict__[f](mol) for f in modules]

    return features
    
features = generate_descriptor_features(smiles, modules)

Firstly, 11 descriptor modules are selected to generate molecular features.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import umap

def generate_embedding(scaled_data, dim=3, mod='pca', seed = 20):
    if mod == 'umap':
        reducer = umap.UMAP(n_components=dim, random_state=seed)
        embeddings = reducer.fit_transform(scaled_data) 
    if mod == 'pca':
        pca = PCA(n_components=3, random_state=seed)
        pca.fit(scaled_data)
        embeddings = pca.transform(scaled_data)
    
    return embeddings, mod

features = StandardScaler().fit_transform(features)
embeddings, mod = generate_embedding(features, mod='umap', seed=8)

Then, standardize the features by removing the mean and scaling to unit variance, which helps to avoid the dominance of features with larger values than others. Finally, linear and(or) nonlinear dimensionality reduction methods can be used for visualization.