23
”Simple” machine learning can help with similarity search
ü Compute principal components analysis (PCA)
just once
on a representative
database subset using all chemical descriptors à PCA descriptors
ü Save PC matrix from PCA (i.e. 40 x 2 matrix)
ü Discretize the PCA descriptors into bins
ü Assign the query compound to a PCA bin and
compute Tanimoto similarity for all neighboring
molecules
Bins