Show and Tell – Structure Search and Cheminée

Javier Pineda, PhD

July 6, 2023

1. What is a chemical compound?

2. Why do researchers care about chemical compounds?

3. How does one search for chemical compounds?

4. Where do we come in?

Presentation Overview

1. What is a chemical compound?

2. Why do researchers care about chemical compounds?

3. How does one search for chemical compounds?

4. Where do we come in?

Presentation Overview

What is a chemical compound?

1. What is a chemical compound?

2. Why do researchers care about chemical compounds?

3. How does one search for chemical compounds?

4. Where do we come in?

Presentation Overview

Chemical compounds do stuff

Compound Protein

DNA Complex

Methyl Group

Proteins get messed up sometimes and compounds help

Cancer cells develop mutations in

proteins à cells grow too fast à TUMOR

Plug up a protein with a compound to stop

the cancer cells from growing

Problem

Solution

ü Chemical compounds do stuff

ü Chemical compounds generally do stuff to or with proteins

ü Sometimes proteins get messed up, and compounds can

bind to proteins and make things better

Summary

1. What is a chemical compound?

2. Why do researchers care about chemical compounds?

3. How does one search for chemical compounds?

4. Where do we come in?

Presentation Overview

Compounds are “screened” in the lab

1 protein 100,000 compounds

Hopefully a few “hits”

A successful screen is followed by lead generation

R =

Hit / Scaffold

Molecular Fragments

Lead

Structure combos either purchased or synthesized

Researchers need to search for compounds online*

Scaffold

Reference

“Give me all compounds that contain the following

scaffold”

“Give me compounds that are similar to this

reference molecule”

SUBSTRUCTURE SEARCH

SIMILARITY SEARCH

*Here researchers are purchasing rather than synthesizing

We can convert compound structures to 1s and 0s

100000000001000000

000000000000000000

000100000000000000

000000000000000010

000000000000100000

000000000000000000

000000000010000000

000000000000000000

000000001000000000

000000000000000000

Substructure Search via fingerprints

1 0 1 0 1 0 1

0 0 0 0 1 0 0

AND

O=C(N(C1CC(C)C)CC2=CC=CC=C2)C(C3CC(C=CC=C4)=C4C3)NC1=O

C1(C=CC=C2)=C2CCC1

Similarity Search via fingerprints

1 0 1 0 1 0 1

0 1 1 0 1 0 0

T = 0.4 (i.e. 40% similarity)

O=C(N(C1CC(C)C)CC2=CC=CC=C2)C(C3CC(C=CC=C4)=C4C3)NC1=O

O=C1C(C2=CC(C=CC=C3)=C3C=C2)CC(C)CC1CC4=CC=CN=C4

Fingerprints are a great start, but are not enough

ü Fingerprints capture fragments, but not connections and positioning

ü For rigorous substructure matching, we likely need to build the 3D molecules

and overlay them to verify a substructure relationship

ü RDKit can do all of this!

1. What is a chemical compound?

2. Why do researchers care about chemical compounds?

3. How does one search for chemical compounds?

4. Where do we come in?

Presentation Overview

Challenges

ü Substructure, superstructure, exact match, and similarity search

ü Search among 10s of millions of compounds (potentially more)

ü Get quick results (e.g. 1 sec)

Challenges

ü Substructure, superstructure, exact match, and similarity search

ü Search among 10s of millions of compounds (potentially more)

ü Get quick results (e.g. 1 sec)

OUR SOLUTION

Say hello to Cheminée*

*Name and package design attributed to Xavier Lange

Cheminée and RX will be friends

User draws chemical structure

Structure is converted to smiles

(i.e. ChemWriter)

Cheminée

Standardize smiles

Get chemical descriptors

Use descriptors to cut through

Tantivy index à course results

Use fingerprints to filter

further à fine results

Use RDKit to perform 3D

structure comparisons à final

results

Query SMILES

Compound IDs

Results on Product Hub

Similarity search presents a challenge

ü Similarity search generally involves bit comparisons (i.e. Tanimoto similarity)

ü Not feasible to compute Tanimoto similarities across an entire compound

database for every similarity search query

”Simple” machine learning can help with similarity search

ü Compute principal components analysis (PCA)

just once

on a representative

database subset using all chemical descriptors à PCA descriptors

ü Save PC matrix from PCA (i.e. 40 x 2 matrix)

ü Discretize the PCA descriptors into bins

ü Assign the query compound to a PCA bin and

compute Tanimoto similarity for all neighboring

molecules

Bins

How far along are we?

ü Currently working with Pubchem compounds

ü Substructure search is working

ü Superstructure search is working

ü Similarity search remains to be done

ü Need to build out REST API

Acknowledgments

ü Xavier Lange

ü Maria Dubyaga

ü RDKit developers

ü Tantivy developers

Questions?