Database Cleaning ================= The script *clean_mol_db.py* reads molecules from an input file, performs (optional) preprocessing, and then writes only those molecules that fulfill particular user-defined criteria to an output file. **Synopsis** :program:`python` *clean_mol_db.py* [-h] -i <file> -o <file> [-d <file>] [-s] [-c] [-x <element list>] [-a <element list>] [-m <element count list>] [-M <element count list>] [-v <0|1|2|3>] **Mandatory options** -i <file> Input molecule file. -o <file> Output molecule file. **Other options** -h, --help Shows help message. -d <file> Discarded molecule output file. -s Keep only the largest molecule component (default: false). -c Minimize the number of charged atoms (default: false) by protonation/deprotonation and charge equalization. -x <element list> List of excluded chem. elements (default: no elements are excluded). -a <element list> List of allowed chem. elements (default: all elements are allowed). -m <element count list> Minimum chem. element specific atom counts (default: no count limits). -M <element count list> Maximum chem. element specific atom counts (default: no count limits). -v <0|1|2|3> Verbosity level (default: 1; 0 -> no console output, 1 -> print summary, 2 -> verbose, 3 -> extra verbose). The options *-a* and *-x* both require a list of chemical elements as argument. Chemical element lists are specified in the form *<S>,...,<S>* where *<S>* is the symbol of a chemical element or generic atom type. Supported generic types are: ====== ======= Symbol Meaning ====== ======= M any metal MH any metal or hydrogen A any element except hydrogen AH any element \* any element (equivalent to AH) X any halogen XH any halogen or hydrogen Q any element except hydrogen and carbon QH any element except carbon ====== ======= The options *-m* and *-M* both require a list of chemical element counts as argument. Chemical element counts are specified in the form *<S>:<N>,...,<S>:<N>* where *<S>* is the symbol of a chemical element or generic atom type (see above) and *<N>* is the corresponding minimum or maximum count. If the count part is omitted and only *<S>* gets specified then the count is assumed to be ``1``. **Example usage** .. code-block:: shell python clean_mol_db.py -i <path/to/molecule/input/file> -o <path/to/molecule/output/file> -a C,H,N,O,S,P,F,Cl,Br,I -m C,A:3 -M F:9 -c -s When executed as shown, the script will perform the following operations on each read input molecule (in the order listed): #. Reduction of the number of charged atoms (if any and if possible) #. Removal of all but the largest molecular graph component (only if multi-comp. molecule) #. Check whether the chem. element of each atom of the working molecule (= result of prev. steps) is either C, H, N, O, S, P, F, Cl, Br, or I. #. Check whether the atom list of the working molecule contains at least one carbon and three heavy atoms #. Check whether the atom list of the working molecule contains not more than 9 fluorine atoms The first check that fails leads to a rejection of the molecule. Working molecules that pass all checks will be written to the specified output file. **Code** .. literalinclude:: /downloads/clean_mol_db.py :language: python :linenos: :lines: 21- :download:`Download source file</downloads/clean_mol_db.py>`