1
Show and Tell Structure Search and Cheminée
Javier Pineda, PhD
July 6, 2023
2
1. What is a chemical compound?
2. Why do researchers care about chemical compounds?
3. How does one search for chemical compounds?
4. Where do we come in?
Presentation Overview
3
1. What is a chemical compound?
2. Why do researchers care about chemical compounds?
3. How does one search for chemical compounds?
4. Where do we come in?
Presentation Overview
4
What is a chemical compound?
5
1. What is a chemical compound?
2. Why do researchers care about chemical compounds?
3. How does one search for chemical compounds?
4. Where do we come in?
Presentation Overview
6
Chemical compounds do stuff
Compound Protein
DNA Complex
+
Methyl Group
7
Proteins get messed up sometimes and compounds help
Cancer cells develop mutations in
proteins à cells grow too fast à TUMOR
Plug up a protein with a compound to stop
the cancer cells from growing
Problem
Solution
8
ü Chemical compounds do stuff
ü Chemical compounds generally do stuff to or with proteins
ü Sometimes proteins get messed up, and compounds can
bind to proteins and make things better
Summary
9
1. What is a chemical compound?
2. Why do researchers care about chemical compounds?
3. How does one search for chemical compounds?
4. Where do we come in?
Presentation Overview
10
Compounds are “screened” in the lab
1 protein 100,000 compounds
Hopefully a few “hits”
11
A successful screen is followed by lead generation
R
R =
Hit / Scaffold
Molecular Fragments
Lead
Structure combos either purchased or synthesized
12
Researchers need to search for compounds online*
R
Scaffold
Reference
“Give me all compounds that contain the following
scaffold
“Give me compounds that are similar to this
reference molecule
SUBSTRUCTURE SEARCH
SIMILARITY SEARCH
*Here researchers are purchasing rather than synthesizing
13
We can convert compound structures to 1s and 0s
100000000001000000
000000000000000000
000100000000000000
000000000000000010
000000000000100000
000000000000000000
000000000000000000
000000000000000000
000000000000000000
000000000010000000
000000000000000000
000000000000000000
000000001000000000
000000000000000000
000000000000000000
14
Substructure Search via fingerprints
1 0 1 0 1 0 1
0 0 0 0 1 0 0
0 0 0 0 1 0 0
AND
A
B
O=C(N(C1CC(C)C)CC2=CC=CC=C2)C(C3CC(C=CC=C4)=C4C3)NC1=O
C1(C=CC=C2)=C2CCC1
15
Similarity Search via fingerprints
1 0 1 0 1 0 1
0 1 1 0 1 0 0
A
B
T = 0.4 (i.e. 40% similarity)
O=C(N(C1CC(C)C)CC2=CC=CC=C2)C(C3CC(C=CC=C4)=C4C3)NC1=O
O=C1C(C2=CC(C=CC=C3)=C3C=C2)CC(C)CC1CC4=CC=CN=C4
16
Fingerprints are a great start, but are not enough
ü Fingerprints capture fragments, but not connections and positioning
ü For rigorous substructure matching, we likely need to build the 3D molecules
and overlay them to verify a substructure relationship
ü RDKit can do all of this!
17
1. What is a chemical compound?
2. Why do researchers care about chemical compounds?
3. How does one search for chemical compounds?
4. Where do we come in?
Presentation Overview
18
Challenges
ü Substructure, superstructure, exact match, and similarity search
ü Search among 10s of millions of compounds (potentially more)
ü Get quick results (e.g. 1 sec)
19
Challenges
ü Substructure, superstructure, exact match, and similarity search
ü Search among 10s of millions of compounds (potentially more)
ü Get quick results (e.g. 1 sec)
OUR SOLUTION
Powered by
20
Say hello to Cheminée*
*Name and package design attributed to Xavier Lange
21
Cheminée and RX will be friends
RX
User draws chemical structure
Structure is converted to smiles
(i.e. ChemWriter)
Cheminée
Standardize smiles
Get chemical descriptors
Use descriptors to cut through
Tantivy index à course results
Use fingerprints to filter
further à fine results
Use RDKit to perform 3D
structure comparisons à final
results
Query SMILES
Compound IDs
Results on Product Hub
22
Similarity search presents a challenge
ü Similarity search generally involves bit comparisons (i.e. Tanimoto similarity)
ü Not feasible to compute Tanimoto similarities across an entire compound
database for every similarity search query
23
”Simple” machine learning can help with similarity search
ü Compute principal components analysis (PCA)
just once
on a representative
database subset using all chemical descriptors à PCA descriptors
ü Save PC matrix from PCA (i.e. 40 x 2 matrix)
ü Discretize the PCA descriptors into bins
ü Assign the query compound to a PCA bin and
compute Tanimoto similarity for all neighboring
molecules
Bins
24
How far along are we?
ü Currently working with Pubchem compounds
ü Substructure search is working
ü Superstructure search is working
ü Similarity search remains to be done
ü Need to build out REST API
25
Acknowledgments
ü Xavier Lange
ü Maria Dubyaga
ü RDKit developers
ü Tantivy developers
26
Questions?