Introduction

About

CDPKit (short for Chemical Data Processing Toolkit) is an open-source cheminformatics toolkit implemented in C++. CDPKit comprises a suite of software tools and a programming library called the Chemical Data Processing Library (CDPL) which provides a high-quality and well-tested modular implementation of basic functionality typically required by any higher-level software application in the field of cheminformatics. In addition to the CDPL C++ API, an equivalent Python-interfacing layer is provided that allows to harness all of CDPL’s functionality easily from Python code.

Key Features

  • Data structures for the representation and processing of molecules, chemical reactions and pharmacophores

  • Routines for all typical cheminformatics pre-processing tasks (e.g. ring and aromaticity perception, stereochemistry processing, …)

  • Powerful methods for molecule and reaction substructure searching

  • Readers/writers for various file formats (MDL Mol, SDF, Rxn, RDF, Mol2, PDB, MMTF, SMILES, SMARTS, etc.) allowing the I/O of small molecule, macromolecular, reaction and pharmacophore data

  • Molecule fragmentation algorithms (RECAP [1], BRICS [2])

  • Generation of molecule and pharmacophore fingerprints (e.g. ECFP [3])

  • Large collection of implemented chemical structure descriptors

  • 2D structure layout and rendering of molecules and reactions

  • Gaussian shape-based molecule alignment and descriptor calculation [4]

  • Pharmacophore generation, alignment and screening

  • 3D structure and conformer generation [5]

  • Prediction of a wide panel of physicochemical properties

  • Full-blown test-suite compliant implementation of the MMFF94 [6] force field

  • Runs without flaws on Linux, macOS and Windows

  • C++ implementation follows best practices for a maximum of robustness and speed

  • … and many more …

Machine Learning Integration

CDPKit seamlessly integrates with machine learning libraries like scikit-learn, PyTorch, and TensorFlow. Utilizing CDPKit for tasks like molecular data I/O, feature extraction, descriptor calculations, and so on, greatly aids scientists that intend to build ML models for the prediction of physicochemical properties, biological activity, site of metabolism, toxicity, and other attributes of potential drug candidates. An example of such an integration with ML methods is showcased in the source code of the software described in Wieder et al. [7].

License

The CDPKit source code is released under the terms of the GNU Lesser General Public License (LGPL) V2.1-or-later. CDPKit documentation is licensed under the terms of the GNU Free Documentation License (GFDL) V1.2-or-later. Code snippets in tutorials and the source code of CDPL programming examples are distributed under the terms of the Zero-Clause BSD License (0BSD).

Scientific publications

Published scientific work that relies on CDPKit functionality:

How to cite

People