1.1.3.3. Database Cleaning

The script clean_mol_db.py reads molecules from an input file, performs (optional) preprocessing, and then writes only those molecules that fulfill particular user-defined criteria to an output file.

Synopsis

python clean_mol_db.py [-h] -i <file> -o <file> [-d <file>] [-s] [-c] [-x <element list>] [-a <element list>] [-m <element count list>] [-M <element count list>] [-v <0|1|2|3>]

Mandatory options

-i <file>

Input molecule file

-o <file>

Output molecule file

Other options

-h, --help

Show help message and exit

-d <file>

Discarded molecule output file

-s

Keep only the largest molecule component (default: false)

-S

Remove atom and bond stereochemistry information (default: false)

-c

Minimize the number of charged atoms (default: false) by protonation/deprotonation and charge equalization

-C <true|false>

Canonicalize output molecule (default: true)

-x <element list>

List of excluded chem. elements (default: no elements are excluded)

-a <element list>

List of allowed chem. elements (default: all elements are allowed)

-m <element count list>

Minimum chem. element specific atom counts (default: no count limits)

-M <element count list>

Maximum chem. element specific atom counts (default: no count limits)

-v <0|1|2|3>

Verbosity level (default: 1; 0 -> no console output, 1 -> print summary, 2 -> verbose, 3 -> extra verbose)

The options -a and -x both require a list of chemical elements as argument. Chemical element lists are specified in the form <S>,…,<S> where <S> is the symbol of a chemical element or generic atom type. Supported generic types are:

Symbol

Meaning

M

any metal

MH

any metal or hydrogen

A

any element except hydrogen

AH

any element

*

any element (equivalent to AH)

X

any halogen

XH

any halogen or hydrogen

Q

any element except hydrogen and carbon

QH

any element except carbon

The options -m and -M both require a list of chemical element counts as argument. Chemical element counts are specified in the form <S>:<N>,…,<S>:<N> where <S> is the symbol of a chemical element or generic atom type (see above) and <N> is the corresponding minimum or maximum count. If the count part is omitted and only <S> gets specified then the count is assumed to be 1.

Example usage

python clean_mol_db.py -i <path/to/molecule/input/file> -o <path/to/molecule/output/file> -a C,H,N,O,S,P,F,Cl,Br,I -m C,A:3 -M F:9 -c -s

When executed as shown, the script will perform the following operations on each read input molecule (in the order listed):

  1. Reduction of the number of charged atoms (if any and if possible)

  2. Removal of all but the largest molecular graph component (only if multi-comp. molecule)

  3. Check whether the chem. element of each atom of the working molecule (= result of prev. steps) is either C, H, N, O, S, P, F, Cl, Br, or I.

  4. Check whether the atom list of the working molecule contains at least one carbon and three heavy atoms

  5. Check whether the atom list of the working molecule contains not more than 9 fluorine atoms

The first check that fails leads to a rejection of the molecule. Working molecules that pass all checks will be written to the specified output file.

Code

  1import sys
  2import argparse
  3
  4import CDPL.Chem as Chem
  5import CDPL.MolProp as MolProp
  6
  7    
  8# performs a (optional) standardization of the argument molecule and then checks whether
  9# it fulfills certain user-defined criteria for forwarding/rejection
 10def processMolecule(mol: Chem.Molecule, args: argparse.Namespace) -> tuple:
 11    chgs_mod = False
 12
 13    if args.min_charges:
 14        # will store the 'uncharged' version of the argument mol
 15        res_mol = Chem.BasicMolecule() 
 16
 17        # minimize the number of charged atoms using the corresponding functionality implemented
 18        # in class Chem.ProtonationStateStandardizer
 19        chgs_mod = Chem.ProtonationStateStandardizer().standardize(mol, res_mol, Chem.ProtonationStateStandardizer.MIN_CHARGED_ATOM_COUNT)
 20
 21        # if changes were made then from now on use the 'uncharged' molecule for further checks
 22        if chgs_mod:  
 23            mol = res_mol
 24        
 25    comps_strpd = False
 26    
 27    if args.strip_comps:
 28        # perceive components (if not done already)
 29        comps = Chem.perceiveComponents(mol, False) 
 30
 31        # find largest component (regarding heavy atom count) but only if multi-comp. molecule
 32        if comps.size > 1:    # check if multi-comp. molecule
 33            lgst_comp = None
 34            lgst_comp_hvy_atom_count = 0
 35
 36            for comp in Chem.getComponents(mol):                 # for each component
 37                hvy_atom_count = MolProp.getHeavyAtomCount(comp) # calc. non-hydrogen atom count
 38
 39                if hvy_atom_count > lgst_comp_hvy_atom_count:    # check if the largest comp. so far has been found
 40                    lgst_comp_hvy_atom_count = hvy_atom_count    # if so, store for later use
 41                    lgst_comp = comp
 42
 43            # if the input molecule has structure data then pass them on (for SDF output)
 44            if Chem.hasStructureData(mol):                       
 45                Chem.setStructureData(lgst_comp, Chem.getStructureData(mol))
 46
 47            # for further checks use only the identified largest component 
 48            comps_strpd = True
 49            mol = lgst_comp
 50
 51    # calc. implicit hydrogen counts (if not done already)
 52    Chem.calcImplicitHydrogenCounts(mol, False)
 53    
 54    # create instance of class MolProp.ElementHistogram for storing the per-element atom counts
 55    # of the molecule (or its largest comp.)
 56    elem_histo = MolProp.ElementHistogram()              
 57
 58    # get per-element atom counts
 59    MolProp.generateElementHistogram(mol, elem_histo, False)    
 60
 61    # check if the found chem. elements are all in the set of allowed elements (if specified)
 62    if not checkAllowedElements(elem_histo, args.allowed_elements):    
 63        return (None, 'prohibited element detected')
 64
 65    # check if none of the found chem. elements is in the set of excluded elements (if specified)
 66    if not checkExcludedElements(elem_histo, args.excluded_elements):
 67        return (None, 'prohibited element detected')
 68
 69    # check if all specified minium chem. element atom counts are reached
 70    if not checkMinAtomCounts(elem_histo, args.min_atom_counts):
 71        return (None, 'minimum atom counts not reached')
 72
 73    # check if none of the specified maximum chem. element atom counts gets exceeded
 74    if not checkMaxAtomCounts(elem_histo, args.max_atom_counts):
 75        return (None, 'maximum atom count exceeded')
 76
 77    log_msg = None
 78    
 79    if chgs_mod:
 80        log_msg = 'reduced charged atom count'
 81
 82    if comps_strpd:
 83        if log_msg:
 84            log_msg += ', removed components'
 85        else:
 86            log_msg = 'removed components'
 87
 88    if args.strip_stereo:
 89        for atom in mol.atoms:  # clear atom stereo descriptors
 90            Chem.clearStereoDescriptor(atom)
 91
 92        for bond in mol.bonds:  # clear bond stereo descriptors
 93            Chem.clearStereoDescriptor(bond) 
 94    
 95    if args.canonicalize:
 96        Chem.calcBasicProperties(mol, False)    # calc. basic properties required for canonicalization
 97        Chem.calcCanonicalNumbering(mol, True)  # calc. canon. numbering of atoms
 98        Chem.canonicalize(mol)                  # order atoms/bonds according to canon. numbering
 99            
100    return (mol, log_msg)
101
102# checks if all found chem. elements are in the set of allowed elements (if specified)
103def checkAllowedElements(elem_histo: MolProp.ElementHistogram, elem_list: list) -> bool:
104    if not elem_list:
105        return True
106
107    for atom_type in elem_histo.getKeys():
108        allowed = False
109        
110        for all_atom_type in elem_list:
111            if Chem.atomTypesMatch(all_atom_type, atom_type):
112                allowed = True
113                break
114
115        if not allowed:
116            return False
117
118    return True
119
120# checks if none of the found chem. elements is in the set of excluded elements (if specified)
121def checkExcludedElements(elem_histo: MolProp.ElementHistogram, elem_list: list) -> bool:
122    if not elem_list:
123        return True
124
125    for atom_type in elem_histo.getKeys():
126        for x_atom_type in elem_list:
127            if Chem.atomTypesMatch(x_atom_type, atom_type):
128                return False
129
130    return True
131
132# return the total number of atoms matching the specified atom type (element or generic type)
133def getNumAtomsOfType(elem_histo: MolProp.ElementHistogram, qry_atom_type: int) -> int:
134    tot_count = 0
135
136    for atom_type, count in elem_histo.getEntries():
137        # check if the elem. histogram entry matches the specified atom type (element or generic type)
138        # if so, add the stored count the total count
139        if Chem.atomTypesMatch(qry_atom_type, atom_type):  
140            tot_count += count
141
142    return tot_count
143
144# checks if all the specified minium counts of atoms matching a particular element or
145# generic type are reached
146def checkMinAtomCounts(elem_histo: MolProp.ElementHistogram, atom_counts: dict) -> bool:
147    if not atom_counts:
148        return True
149    
150    for atom_type, min_count in atom_counts.items():
151        if getNumAtomsOfType(elem_histo, atom_type) < min_count:
152            return False
153
154    return True
155
156# checks if none of the specified maximum counts of atoms matching a particular element or
157# generic type gets exceeded
158def checkMaxAtomCounts(elem_histo: MolProp.ElementHistogram, atom_counts: dict) -> bool:
159    if not atom_counts:
160        return True
161    
162    for atom_type, max_count in atom_counts.items():
163        if getNumAtomsOfType(elem_histo, atom_type) > max_count:
164            return False
165
166    return True
167
168# converts a comma separated list of chem. element/generic type symbols into a list of the
169# corresponding numeric atom types defined in Chem.AtomType
170def parseElementList(elem_list: str) -> list:
171    if not elem_list:
172        return []
173    
174    atom_types = []
175    
176    for elem in elem_list.split(','):
177        elem = elem.strip()
178        
179        if not elem: # zero length?
180            continue
181        
182        atom_type = Chem.AtomDictionary.getType(elem)
183
184        if atom_type == Chem.AtomType.UNKNOWN:
185            sys.exit('Error: unknown chemical element/generic type \'%s\'' % elem)
186
187        atom_types.append(atom_type)
188            
189    return atom_types
190
191# converts a comma separated list of chem. element (or generic type) symbol/count pairs (sep. by colon) into a dictionary
192# mapping the corresponding numeric atom types (defined in Chem.AtomType) to an integer number
193def parseElementCountList(elem_count_list: str) -> dict:
194    if not elem_count_list:
195        return {}
196    
197    atom_type_counts = {}
198    
199    for elem_count in elem_count_list.split(','):
200        elem_count = elem_count.strip()
201        
202        if not elem_count: # zero length?
203            continue
204        
205        elem_count = elem_count.split(':')
206        elem_count[0] = elem_count[0].strip()
207        atom_type = Chem.AtomDictionary.getType(elem_count[0])
208        
209        if atom_type == Chem.AtomType.UNKNOWN:
210            sys.exit('Error: unknown chemical element/generic type \'%s\'' % elem_count[0])
211
212        count = 1
213        
214        if len(elem_count) > 1: # has count spec.?
215            count = int(elem_count[1])
216            
217        atom_type_counts[atom_type] = count
218            
219    return atom_type_counts
220                                
221def parseArgs() -> argparse.Namespace:
222    def strtobool(value: str) -> bool:
223        value = value.lower()
224        
225        if value in ('y', 'yes', 'on', '1', 'true', 't'):
226            return True
227        
228        return False
229
230    parser = argparse.ArgumentParser(description='Strips compounds that fulfill particular user-defined criteria from a molecule database.')
231
232    parser.add_argument('-i',
233                        dest='in_file',
234                        required=True,
235                        metavar='<file>',
236                        help='Input molecule file')
237    parser.add_argument('-o',
238                        dest='out_file',
239                        required=True,
240                        metavar='<file>',
241                        help='Output molecule file')
242    parser.add_argument('-d',
243                        dest='disc_file',
244                        required=False,
245                        metavar='<file>',
246                        help='Discarded molecule output file')
247    parser.add_argument('-s',
248                        dest='strip_comps',
249                        required=False,
250                        action='store_true',
251                        default=False,
252                        help='Keep only the largest molecule component (default: false)')
253    parser.add_argument('-S',
254                        dest='strip_stereo',
255                        required=False,
256                        action='store_true',
257                        default=False,
258                        help='Remove atom and bond stereochemistry information (default: false)')
259    parser.add_argument('-c',
260                        dest='min_charges',
261                        required=False,
262                        action='store_true',
263                        default=False,
264                        help='Minimize number of charged atoms (default: false)')
265    parser.add_argument('-C',
266                        dest='canonicalize',
267                        required=False,
268                        metavar='<true|false>',
269                        type=lambda x:bool(strtobool(x)),
270                        default=True,
271                        help='Canonicalize output molecule (default: true)')
272    parser.add_argument('-x',
273                        dest='excluded_elements',
274                        required=False,
275                        metavar="<element list>",
276                        help='List of excluded chem. elements (default: no elements are excluded)')
277    parser.add_argument('-a',
278                        dest='allowed_elements',
279                        required=False,
280                        metavar="<element list>",
281                        help='List of allowed chem. elements (default: all elements are allowed)')
282    parser.add_argument('-m',
283                        dest='min_atom_counts',
284                        required=False,
285                        metavar="<element count list>",
286                        help='Minimum chem. element specific atom counts (default: no count limits)')
287    parser.add_argument('-M',
288                        dest='max_atom_counts',
289                        required=False,
290                        metavar="<element count list>",
291                        help='Maximum chem. element specific atom counts (default: no count limits)')
292    parser.add_argument('-v',
293                        dest='verb_level',
294                        required=False,
295                        metavar='<0|1|2|3>',
296                        choices=range(0, 4),
297                        default=1,
298                        help='Verbosity level (default: 1; 0 -> no console output, 1 -> print summary, 2 -> verbose, 3 -> extra verbose)',
299                        type=int)
300
301    return parser.parse_args()
302
303def main() -> None:
304    args = parseArgs() # process command line arguments
305
306    # convert specified lists of chem. element/gen. type symbols into a corresponding list of
307    # numeric atom types (defined in Chem.Atomtype)
308    args.allowed_elements  = parseElementList(args.allowed_elements)
309    args.excluded_elements = parseElementList(args.excluded_elements)
310
311    # convert specified comma separated lists of chem. element (or generic type) symbol/count pairs (sep. by colon) into
312    # corresponding dictionaries mapping numeric atom types (defined in Chem.AtomType) to integers
313    args.min_atom_counts = parseElementCountList(args.min_atom_counts)
314    args.max_atom_counts = parseElementCountList(args.max_atom_counts)
315    
316    # create reader for input molecules (format specified by file extension)
317    reader = Chem.MoleculeReader(args.in_file) 
318
319    # create writer for molecules passing the checks (format specified by file extension)
320    writer = Chem.MolecularGraphWriter(args.out_file) 
321
322    # if canonicalization is enabled and the output format is SMILES then generate canonical SMILES directly 
323    if args.canonicalize and (writer.getDataFormat() == Chem.DataFormat.SMILES or \
324                              writer.getDataFormat() == Chem.DataFormat.SMILES_GZ or \
325                              writer.getDataFormat() == Chem.DataFormat.SMILES_BZ2):
326        args.canonicalize = False
327        Chem.setSMILESOutputCanonicalFormParameter(writer, True)
328            
329    if args.disc_file:
330        # create writer for sorted out molecules (format specified by file extension)
331        disc_writer = Chem.MolecularGraphWriter(args.disc_file)
332    else:
333        disc_writer = None
334        
335    # create instances of the default implementation of the Chem.Molecule interface for the input and output molecules
336    in_mol = Chem.BasicMolecule()
337
338    i = 0
339    num_changed = 0
340    num_disc = 0
341    
342    try:
343        # read and process molecules one after the other until the end of input has been reached (or a severe error occurs)
344        while reader.read(in_mol):
345            # compose a molecule identifier
346            mol_id = Chem.getName(in_mol).strip() 
347
348            if mol_id == '':
349                mol_id = '#' + str(i + 1)  # fallback if name is empty or not available
350            else:
351                mol_id = '\'%s\' (#%s)' % (mol_id, str(i + 1))
352            
353            try:
354                # process input molecule
355                out_mol, log_msg = processMolecule(in_mol, args) 
356                
357                if out_mol is None:     # check whether the molecule has been sorted out
358                    if args.verb_level > 1 and log_msg:
359                        print('- Molecule %s: discarded, %s' % (mol_id, log_msg))
360
361                    num_disc += 1
362                        
363                    if not disc_writer: # check whether discarded molecules should be saved to a separate file
364                        i += 1
365                        continue
366
367                    # write discarded molecule to the specified target file
368                    out_mol = in_mol
369                    out_mol_writer = disc_writer
370                else:
371                    if log_msg:
372                        if args.verb_level > 1 :
373                            print('- Molecule %s: %s' % (mol_id, log_msg))
374                            
375                        num_changed += 1
376                    else:
377                        if args.verb_level > 2:
378                            print('- Molecule %s: left unchanged' % mol_id)
379
380                    # molecule passed all checks and needs to be written to the regular output file
381                    out_mol_writer = writer
382                    
383                try:
384                    # calculate (if not already present) some basic properties of the output molecule
385                    # that might be required for writing (output format dependent)
386                    Chem.calcImplicitHydrogenCounts(out_mol, False)
387                    Chem.perceiveHybridizationStates(out_mol, False)
388                    Chem.perceiveSSSR(out_mol, False)
389                    Chem.setRingFlags(out_mol, False)
390                    Chem.setAromaticityFlags(out_mol, False)
391                    Chem.perceiveComponents(out_mol, False)
392                              
393                    # output molecule
394                    if not out_mol_writer.write(out_mol):
395                        sys.exit('Error: writing molecule %s failed' % mol_id)
396
397                except Exception as e: # handle exception raised in case of severe write errors
398                    sys.exit('Error: writing molecule %s failed: %s' % (mol_id, str(e)))
399
400                i += 1
401                
402            except Exception as e: # handle exception raised in case of severe processing errors
403                sys.exit('Error: processing of molecule %s failed: %s' % (mol_id, str(e)))
404                
405    except Exception as e: # handle exception raised in case of severe read errors
406        sys.exit('Error: reading of molecule %s failed: %s' % (str(i), str(e)))
407
408    if args.verb_level > 0:
409        print('Summary:')
410        print(' - %s molecules processed' % str(i))
411        print(' - %s modified ' % str(num_changed))
412        print(' - %s discarded' % str(num_disc))
413        
414    writer.close()
415    sys.exit(0)
416
417if __name__ == '__main__':
418    main()

Download source file