Roadmap
The next major milestone for Cheminee is integration with Quickwit, a cloud-native search engine built on Tantivy. This will bring S3-backed storage, elastic compute scaling, and the full Quickwit operational model to chemical structure search.
Cheminee currently runs as a standalone server wrapping Tantivy directly. This works well, but requires a dedicated server sized for peak indexing workloads even when most of the time is spent serving search traffic. Quickwit solves this by separating compute from storage:
- Index splits go to S3 — no local disk to manage
- Indexers scale independently from searchers — burst to many cores for indexing, scale to zero when idle
- Searchers stay small — they pull splits from S3 on demand
+-----------------------+
| quickwit-cheminee | <-- new crate, the glue
+-----------------------+
/ | \
+-------------+ +-------------+ +------------+
| DocProcessor| | Tokenizer | | Query + |
| (transform) | | (indexing) | | Collector |
+-------------+ +-------------+ | (search) |
| | +------------+
+----------+ +-----------+ |
| rdkit | | rdkit | +----------+
+----------+ +-----------+ | rdkit |
+----------+
A quickwit-cheminee crate will plug into Quickwit at compile time via a cargo feature flag, adding chemistry awareness at three points:
- Indexing — A custom
DocProcessorthat takes raw SMILES, standardizes them, computes fingerprints and descriptors, and enriches the document before it enters the index - Tokenization — A
SmilesTokenizerfor chemistry-aware text indexing - Search — Custom query types (substructure, superstructure, similarity, identity) and collectors that perform chemical post-filtering using fingerprint screening and RDKit substructure matching
cheminee-coreextraction — Chemistry logic (standardization, fingerprinting, descriptor computation, substructure matching) will be extracted into a standalonecheminee-corecrate, shared by both the existing Cheminee server and the Quickwit pluginfingerprintfield type — A first-class Tantivy/Quickwit field type for chemical fingerprints, enabling fast Tanimoto similarity queries and substructure screening directly in the index- Native Quickwit doc mapping — Use Quickwit’s existing schema system rather than maintaining a separate schema library
| Phase | Work | Status |
|---|---|---|
| 1 | Extract cheminee-core crate | Planned |
| 2 | Implement ChemicalDocMapper for indexing | Planned |
| 3 | Implement chemical search queries + collectors | Planned |
| 4 | REST API and integration testing | Planned |
| 5 | Deploy and validate | Planned |
We’re targeting Quickwit release-0.9 (v0.9.0). Key changes from earlier Quickwit versions that affect our integration:
DocMapperis now a concrete struct (not adyntrait)- Indexing pipeline:
Source → DocProcessor → Indexer → IndexSerializer → Packager → Uploader → Sequencer → Publisher - Tantivy is pinned to a Quickwit fork
- License is Apache 2.0