Activity Cliff identification using BitBIRCH

This repository provides a comprehensive toolkit for identifying and analyzing activity cliffs in molecular datasets using BitBIRCH clustering algorithms.

(a) Diameter BitBIRCH cluster (contains Activity Cliffs)	(b) Smooth clusters (No Activity Cliffs)

Overview of methods

This repository provides tools for:

Activity Cliff Detection: Identify pairs of structurally similar molecules with significantly different biological activities
BitBIRCH Clustering: Efficient clustering algorithm optimized for molecular fingerprints
Multi-Fingerprint Analysis: Support for RDKit, ECFP4, and MACCS molecular fingerprints
Smooth vs Cliff Clustering: Compare clustering behavior for activity cliffs vs smooth activity relationships
Visualization: Generate molecular structure visualizations for cluster analysis

Key Features

Flexible Ordering: Multiple ordering strategies for fingerprint processing (random, sum-based, centroid-based)
Recursive Analysis: Optional recursive clustering to identify additional activity cliffs
Comprehensive Benchmarking: Built-in parameter tuning and performance comparison tools
Rich Visualizations: SVG-based molecular structure displays with activity annotations

Installation

Prerequisites

pip install numpy pandas rdkit matplotlib pillow scikit-learn tqdm

This also requires the BitBIRCH module, installation instructions can be found in https://github.com/mqcomplab/bitbirch.

Setup

Clone the repository:

git clone https://github.com/mqcomplab/BitBIRCH_AC.git
cd BitBIRCH_AC

Quick Start

Data Preparation

1. Generate Fingerprints

Convert SMILES strings to molecular fingerprints:

python gen_fp.py

Input: CSV files with columns:

smiles: SMILES strings
exp_mean [nM]: Experimental activity values

Output: Pickle files containing fingerprints for RDKit, ECFP4, and MACCS

2. Process Library Data

Generate similarity and property difference matrices:

python process_library.py

Output: Numpy arrays in files/ directory:

fps_*.npy: Fingerprint arrays
props_*.npy: Property arrays (log_{10_{-transformed)}}
tani_matrix_*.npy: Pairwise Tanimoto similarity matrices (1 if pairwise sim≥ threshold, else 0)
prop_diff_matrix_*.npy: Pairwise Property difference matrices (1 if difference ≥1 log unit else 0)

Activity Cliff Analysis

Perform comprehensive AC analysis with different parameter combinations:

# Basic analysis
python AC.py --order increasing_sum --recursive False

# Advanced analysis with multiple configurations
python AC.py --order increasing_sum increasing_sum_cent --recursive True --use_offsets --max_workers 20

Generation and Visualization of Smooth/Cliff Clusters

(a) Standard deviation in properties of Diameter BitBIRCH clusters and Smooth Clusters

Our methods naturally lead to a recipe to generate clusters devoid of activity cliffs. tutorial2.ipynb

# Load the input data 

smiles_file='data/CHEMBL4005_Ki.csv'  # Ensure this is consistent with the npy files

from scripts.smooth_ac import FingerprintClusterAnalyzer

#Intialize the analyzer
analyzer = FingerprintClusterAnalyzer(
        fingerprint_types=['ECFP', 'MACCS', 'RDKIT'],
        thresholds=np.linspace(0.3,0.9,7), # Adjust thresholds as needed
        top=10
    )

# Load data and perform clustering
analyzer.load_fingerprint_data(data_prefix='CHEMBL4005_Ki_fp', prop_prefix='CHEMBL4005_Ki_fp') 
analyzer.perform_clustering()
analyzer.save_results_to_csv(filename='results/clustering_results_CHEMBL4005_Ki.csv')

# Compare fingerprint types at a specific threshold
analyzer.compare_fingerprint_types(0.9, 1, smiles_file, max_molecules=20) # Visualizes and saves the first (or most populated cluster) across all fingerprint types at 0.9 threshold.

Notebooks

Further details of the code can be found in the Tutorial.ipynb notebook.

Citation

https://www.biorxiv.org/content/10.1101/2025.09.17.676791v1

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Images		Images
Plots		Plots
__pycache__		__pycache__
bb_utils		bb_utils
data		data
files		files
molecules		molecules
pkl		pkl
results		results
scripts		scripts
LICENSE		LICENSE
README.md		README.md
Tutorial.ipynb		Tutorial.ipynb
clustering_results.csv		clustering_results.csv
image.png		image.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Activity Cliff identification using BitBIRCH

Overview of methods

Key Features

Installation

Prerequisites

Setup

Quick Start

Data Preparation

1. Generate Fingerprints

2. Process Library Data

Activity Cliff Analysis

Generation and Visualization of Smooth/Cliff Clusters

Notebooks

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

mqcomplab/BitBIRCH_AC

Folders and files

Latest commit

History

Repository files navigation

Activity Cliff identification using BitBIRCH

Overview of methods

Key Features

Installation

Prerequisites

Setup

Quick Start

Data Preparation

1. Generate Fingerprints

2. Process Library Data

Activity Cliff Analysis

Generation and Visualization of Smooth/Cliff Clusters

Notebooks

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages