Introduction
About
CDPKit (short for Chemical Data Processing Toolkit) is an open-source cheminformatics toolkit implemented in C++. CDPKit comprises a suite of software tools and a programming library called the Chemical Data Processing Library (CDPL) which provides a high-quality and well-tested modular implementation of basic functionality typically required by any higher-level software application in the field of cheminformatics. In addition to the CDPL C++ API, an equivalent Python-interfacing layer is provided that allows to harness all of CDPL’s functionality easily from Python code.
Key Features
Data structures for the representation and processing of molecules, chemical reactions and pharmacophores
Routines for all typical cheminformatics pre-processing tasks (e.g. ring and aromaticity perception, stereochemistry processing, …)
Powerful methods for molecule and reaction substructure searching
Readers/writers for various file formats (MDL Mol, SDF, Rxn, RDF, Mol2, PDB, MMTF, SMILES, SMARTS, etc.) allowing the I/O of small molecule, macromolecular, reaction and pharmacophore data
Generation of molecule and pharmacophore fingerprints (e.g. ECFP [3])
Large collection of implemented chemical structure descriptors
2D structure layout and rendering of molecules and reactions
Gaussian shape-based molecule alignment and descriptor calculation [4]
Pharmacophore generation, alignment and screening
3D structure and conformer generation [5]
Prediction of a wide panel of physicochemical properties
Full-blown test-suite compliant implementation of the MMFF94 [6] force field
Runs without flaws on Linux, macOS and Windows
C++ implementation follows best practices for a maximum of robustness and speed
… and many more …
Machine Learning Integration
CDPKit seamlessly integrates with machine learning libraries like scikit-learn, PyTorch, and TensorFlow. Utilizing CDPKit for tasks like molecular data I/O, feature extraction, descriptor calculations, and so on, greatly aids scientists that intend to build ML models for the prediction of physicochemical properties, biological activity, site of metabolism , toxicity, and other attributes of potential drug candidates. An example of such an integration with ML methods is showcased in the source code of the software described in Wieder et al. [7].
License
The CDPKit source code is released under the terms of the GNU Lesser General Public License (LGPL) V2.1-or-later. CDPKit documentation is licensed under the terms of the GNU Free Documentation License (GFDL) V1.2-or-later. Code snippets in tutorials and the source code of CDPL programming examples are distributed under the terms of the Zero-Clause BSD License (0BSD).
Scientific publications
Published scientific work that relies on CDPKit functionality:
Thomas Seidel, Christian Permann, Oliver Wieder, Stefan M. Kohlbacher, and Thierry Langer. High-quality conformer generation with conforge: algorithm and performance assessment. Journal of Chemical Information and Modeling, 0(0):null, 0. PMID: 37624145. URL: https://doi.org/10.1021/acs.jcim.3c00563, arXiv:https://doi.org/10.1021/acs.jcim.3c00563, doi:10.1021/acs.jcim.3c00563.
Oliver Wieder, Mélaine Kuenemann, Marcus Wieder, Thomas Seidel, Christophe Meyer, Sharon D. Bryant, and Thierry Langer. Improved lipophilicity and aqueous solubility prediction with composite graph neural networks. Molecules, 2021. URL: https://www.mdpi.com/1420-3049/26/20/6185, doi:10.3390/molecules26206185.
Ya Chen, Thomas Seidel, Roxane Axel Jacob, Steffen Hirte, Angelica Mazzolari, Alessandro Pedretti, Giulio Vistoli, Thierry Langer, Filip Miljković, and Johannes Kirchmair. Active learning approach for guiding site-of-metabolism measurement and annotation. Journal of Chemical Information and Modeling, 64(2):348–358, 2024. PMID: 38170877. URL: https://doi.org/10.1021/acs.jcim.3c01588, arXiv:https://doi.org/10.1021/acs.jcim.3c01588, doi:10.1021/acs.jcim.3c01588.
Doris A. Schuetz, Thomas Seidel, Arthur Garon, Riccardo Martini, Markus Körbel, Gerhard F. Ecker, and Thierry Langer. Grail: grids of pharmacophore interaction fields. Journal of Chemical Theory and Computation, 14(9):4958–4970, 2018. PMID: 30075621. URL: https://doi.org/10.1021/acs.jctc.8b00495, arXiv:https://doi.org/10.1021/acs.jctc.8b00495, doi:10.1021/acs.jctc.8b00495.
Marcus Wieder, Arthur Garon, Ugo Perricone, Stefan Boresch, Thomas Seidel, Anna Maria Almerico, and Thierry Langer. Common hits approach: combining pharmacophore modeling and molecular dynamics simulations. Journal of Chemical Information and Modeling, 57(2):365–385, 2017. PMID: 28072524. URL: https://doi.org/10.1021/acs.jcim.6b00674, arXiv:https://doi.org/10.1021/acs.jcim.6b00674, doi:10.1021/acs.jcim.6b00674.
Stefan Michael Kohlbacher, Matthias Schmid, Thomas Seidel, and Thierry Langer. Applications of the novel quantitative pharmacophore activity relationship method qphar in virtual screening and lead-optimisation. Pharmaceuticals, 2022. URL: https://www.mdpi.com/1424-8247/15/9/1122, doi:10.3390/ph15091122.
Jörg Heider, Jonas Kilian, Aleksandra Garifulina, Steffen Hering, Thierry Langer, and Thomas Seidel. Apo2ph4: a versatile workflow for the generation of receptor-based pharmacophore models for virtual screening. Journal of Chemical Information and Modeling, 63(1):101–110, 2023. PMID: 36526584. URL: https://doi.org/10.1021/acs.jcim.2c00814, arXiv:https://doi.org/10.1021/acs.jcim.2c00814, doi:10.1021/acs.jcim.2c00814.
Stefan M. Kohlbacher, Thierry Langer, and Thomas Seidel. Qphar: quantitative pharmacophore activity relationship: method and validation. Journal of Cheminformatics, 13(1):57, Aug 2021. URL: https://doi.org/10.1186/s13321-021-00537-9, doi:10.1186/s13321-021-00537-9.
How to cite
Source code: Thomas Seidel, Chemical Data Processing Toolkit source code repository, https://github.com/molinfo-vienna/CDPKit
Documentation: Thomas Seidel, Oliver Wieder, Chemical Data Processing Toolkit documentation pages, https://cdpkit.org
People
Thomas Seidel (project founder, main developer)
Oliver Wieder (documentation)