CDPKit (short for Chemical Data Processing Toolkit) is an open-source cheminformatics toolkit implemented in C++. CDPKit comprises a suite of software tools and a programming library called the Chemical Data Processing Library (CDPL) which provides a high-quality and well-tested modular implementation of basic functionality typically required by any higher-level software application in the field of cheminformatics. In addition to the CDPL C++ API, an equivalent Python-interfacing layer is provided that allows to harness all of CDPL’s functionality easily from Python code.

Key Features

  • Data structures for the representation and processing of molecules, chemical reactions and pharmacophores

  • Routines for all typical cheminformatics pre-processing tasks (e.g. ring and aromaticity perception, stereochemistry processing, …)

  • Powerful methods for molecule and reaction substructure searching

  • Readers/writers for various file formats (MDL Mol, SDF, Rxn, RDF, Mol2, PDB, MMTF, SMILES, SMARTS, etc.) allowing the I/O of small molecule, macromolecular, reaction and pharmacophore data

  • Molecule fragmentation algorithms (RECAP [1], BRICS [2])

  • Generation of molecule and pharmacophore fingerprints (e.g. ECFP [3])

  • Large collection of implemented chemical structure descriptors

  • 2D structure layout and rendering of molecules and reactions

  • Gaussian shape-based molecule alignment and descriptor calculation [4]

  • Pharmacophore generation, alignment and screening

  • 3D structure and conformer generation [5]

  • Prediction of a wide panel of physicochemical properties

  • Full-blown test-suite compliant implementation of the MMFF94 [6] force field

  • Runs without flaws on Linux, macOS and Windows

  • C++ implementation follows best practices for a maximum of robustness and speed

  • … and many more …

Machine Learning Integration

CDPKit seamlessly integrates with machine learning libraries like scikit-learn, PyTorch, and TensorFlow. Utilizing CDPKit for tasks like molecular data I/O, feature extraction, descriptor calculations, and so on, greatly aids scientists that intend to build ML models for the prediction of physicochemical properties, biological activity, site of metabolism , toxicity, and other attributes of potential drug candidates. An example of such an integration with ML methods is showcased in the source code of the software described in Wieder et al. [7].


The CDPKit source code is released under the terms of the GNU Lesser General Public License (LGPL) V2.1-or-later. CDPKit documentation is licensed under the terms of the GNU Free Documentation License (GFDL) V1.2-or-later. Code snippets in tutorials and the source code of CDPL programming examples are distributed under the terms of the Zero-Clause BSD License (0BSD).

Scientific publications

Published scientific work that relies on CDPKit functionality:

How to cite