1.1.3.3. Database Cleaning

The script clean_mol_db.py reads molecules from an input file, performs (optional) preprocessing, and then writes only those molecules that fulfill particular user-defined criteria to an output file.

Synopsis

python clean_mol_db.py [-h] -i <file> -o <file> [-d <file>] [-s] [-c] [-x <element list>] [-a <element list>] [-m <element count list>] [-M <element count list>] [-v <0|1|2|3>]

Mandatory options

-i <file>

Input molecule file

-o <file>

Output molecule file

Other options

-h, --help

Show help message and exit

-d <file>

Discarded molecule output file

-s

Keep only the largest molecule component (default: false)

-c

Minimize the number of charged atoms (default: false) by protonation/deprotonation and charge equalization

-x <element list>

List of excluded chem. elements (default: no elements are excluded)

-a <element list>

List of allowed chem. elements (default: all elements are allowed)

-m <element count list>

Minimum chem. element specific atom counts (default: no count limits)

-M <element count list>

Maximum chem. element specific atom counts (default: no count limits)

-v <0|1|2|3>

Verbosity level (default: 1; 0 -> no console output, 1 -> print summary, 2 -> verbose, 3 -> extra verbose)

The options -a and -x both require a list of chemical elements as argument. Chemical element lists are specified in the form <S>,…,<S> where <S> is the symbol of a chemical element or generic atom type. Supported generic types are:

Symbol

Meaning

M

any metal

MH

any metal or hydrogen

A

any element except hydrogen

AH

any element

*

any element (equivalent to AH)

X

any halogen

XH

any halogen or hydrogen

Q

any element except hydrogen and carbon

QH

any element except carbon

The options -m and -M both require a list of chemical element counts as argument. Chemical element counts are specified in the form <S>:<N>,…,<S>:<N> where <S> is the symbol of a chemical element or generic atom type (see above) and <N> is the corresponding minimum or maximum count. If the count part is omitted and only <S> gets specified then the count is assumed to be 1.

Example usage

python clean_mol_db.py -i <path/to/molecule/input/file> -o <path/to/molecule/output/file> -a C,H,N,O,S,P,F,Cl,Br,I -m C,A:3 -M F:9 -c -s

When executed as shown, the script will perform the following operations on each read input molecule (in the order listed):

  1. Reduction of the number of charged atoms (if any and if possible)

  2. Removal of all but the largest molecular graph component (only if multi-comp. molecule)

  3. Check whether the chem. element of each atom of the working molecule (= result of prev. steps) is either C, H, N, O, S, P, F, Cl, Br, or I.

  4. Check whether the atom list of the working molecule contains at least one carbon and three heavy atoms

  5. Check whether the atom list of the working molecule contains not more than 9 fluorine atoms

The first check that fails leads to a rejection of the molecule. Working molecules that pass all checks will be written to the specified output file.

Code

  1import sys
  2import argparse
  3
  4import CDPL.Chem as Chem
  5import CDPL.MolProp as MolProp
  6
  7    
  8# performs a (optional) standardization of the argument molecule and then checks whether
  9# it fulfills certain user-defined criteria for inclusion/exclusion
 10def processMolecule(mol: Chem.Molecule, args: argparse.Namespace) -> tuple:
 11    chgs_mod = False
 12
 13    if args.min_charges:
 14        # will store the 'uncharged' version of the argument mol
 15        res_mol = Chem.BasicMolecule() 
 16
 17        # minimize the number of charged atoms using the corresponding functionality implemented
 18        # in class Chem.ProtonationStateStandardizer
 19        chgs_mod = Chem.ProtonationStateStandardizer().standardize(mol, res_mol, Chem.ProtonationStateStandardizer.MIN_CHARGED_ATOM_COUNT)
 20
 21        # if changes were made then from now on use the 'uncharged' molecule for further checks
 22        if chgs_mod:  
 23            mol = res_mol
 24        
 25    comps_strpd = False
 26    
 27    if args.strip_comps:
 28        # perceive components (if not done already)
 29        comps = Chem.perceiveComponents(mol, False) 
 30
 31        # find largest component (regarding heavy atom count) but only if multi-comp. molecule
 32        if comps.size > 1:    # check if multi-comp. molecule
 33            lgst_comp = None
 34            lgst_comp_hvy_atom_count = 0
 35
 36            for comp in Chem.getComponents(mol):                 # for each component
 37                hvy_atom_count = MolProp.getHeavyAtomCount(comp) # calc. non-hydrogen atom count
 38
 39                if hvy_atom_count > lgst_comp_hvy_atom_count:    # check if the largest comp. so far has been found
 40                    lgst_comp_hvy_atom_count = hvy_atom_count    # if so, store for later use
 41                    lgst_comp = comp
 42
 43            # if the input molecule has structure data then pass them on (for SDF output)
 44            if Chem.hasStructureData(mol):                       
 45                Chem.setStructureData(lgst_comp, Chem.getStructureData(mol))
 46
 47            # for further checks use only the identified largest component 
 48            comps_strpd = True
 49            mol = lgst_comp
 50
 51    # calc. implicit hydrogen counts (if not done already)
 52    Chem.calcImplicitHydrogenCounts(mol, False)
 53    
 54    # create instance of class MolProp.ElementHistogram for storing the per-element atom counts
 55    # of the molecule (or its largest comp.)
 56    elem_histo = MolProp.ElementHistogram()              
 57
 58    # get per-element atom counts
 59    MolProp.generateElementHistogram(mol, elem_histo, False)    
 60
 61    # check if the found chem. elements are all in the set of allowed elements (if specified)
 62    if not checkAllowedElements(elem_histo, args.allowed_elements):    
 63        return (None, 'prohibited element detected')
 64
 65    # check if none of the found chem. elements is in the set of excluded elements (if specified)
 66    if not checkExcludedElements(elem_histo, args.excluded_elements):
 67        return (None, 'prohibited element detected')
 68
 69    # check if all specified minium chem. element atom counts are reached
 70    if not checkMinAtomCounts(elem_histo, args.min_atom_counts):
 71        return (None, 'minimum atom counts not reached')
 72
 73    # check if none of the specified maximum chem. element atom counts gets exceeded
 74    if not checkMaxAtomCounts(elem_histo, args.max_atom_counts):
 75        return (None, 'maximum atom count exceeded')
 76
 77    log_msg = None
 78    
 79    if chgs_mod:
 80        log_msg = 'reduced charged atom count'
 81
 82    if comps_strpd:
 83        if log_msg:
 84            log_msg += ', removed components'
 85        else:
 86            log_msg = 'removed components'
 87            
 88    return (mol, log_msg)
 89
 90# checks if all found chem. elements are in the set of allowed elements (if specified)
 91def checkAllowedElements(elem_histo: MolProp.ElementHistogram, elem_list: list) -> bool:
 92    if not elem_list:
 93        return True
 94
 95    for atom_type in elem_histo.getKeys():
 96        allowed = False
 97        
 98        for all_atom_type in elem_list:
 99            if Chem.atomTypesMatch(all_atom_type, atom_type):
100                allowed = True
101                break
102
103        if not allowed:
104            return False
105
106    return True
107
108# checks if none of the found chem. elements is in the set of excluded elements (if specified)
109def checkExcludedElements(elem_histo: MolProp.ElementHistogram, elem_list: list) -> bool:
110    if not elem_list:
111        return True
112
113    for atom_type in elem_histo.getKeys():
114        for x_atom_type in elem_list:
115            if Chem.atomTypesMatch(x_atom_type, atom_type):
116                return False
117
118    return True
119
120# return the total number of atoms matching the specified atom type (element or generic type)
121def getNumAtomsOfType(elem_histo: MolProp.ElementHistogram, qry_atom_type: int) -> int:
122    tot_count = 0
123
124    for atom_type, count in elem_histo.getEntries():
125        # check if the elem. histogram entry matches the specified atom type (element or generic type)
126        # if so, add the stored count the total count
127        if Chem.atomTypesMatch(qry_atom_type, atom_type):  
128            tot_count += count
129
130    return tot_count
131
132# checks if all the specified minium counts of atoms matching a particular element or
133# generic type are reached
134def checkMinAtomCounts(elem_histo: MolProp.ElementHistogram, atom_counts: dict) -> bool:
135    if not atom_counts:
136        return True
137    
138    for atom_type, min_count in atom_counts.items():
139        if getNumAtomsOfType(elem_histo, atom_type) < min_count:
140            return False
141
142    return True
143
144# checks if none of the specified maximum counts of atoms matching a particular element or
145# generic type gets exceeded
146def checkMaxAtomCounts(elem_histo: MolProp.ElementHistogram, atom_counts: dict) -> bool:
147    if not atom_counts:
148        return True
149    
150    for atom_type, max_count in atom_counts.items():
151        if getNumAtomsOfType(elem_histo, atom_type) > max_count:
152            return False
153
154    return True
155
156# converts a comma separated list of chem. element/generic type symbols into a list of the
157# corresponding numeric atom types defined in Chem.AtomType
158def parseElementList(elem_list: str) -> list:
159    if not elem_list:
160        return []
161    
162    atom_types = []
163    
164    for elem in elem_list.split(','):
165        elem = elem.strip()
166        
167        if not elem: # zero length?
168            continue
169        
170        atom_type = Chem.AtomDictionary.getType(elem)
171
172        if atom_type == Chem.AtomType.UNKNOWN:
173            sys.exit('Error: unknown chemical element/generic type \'%s\'' % elem)
174
175        atom_types.append(atom_type)
176            
177    return atom_types
178
179# converts a comma separated list of chem. element (or generic type) symbol/count pairs (sep. by colon) into a dictionary
180# mapping the corresponding numeric atom types (defined in Chem.AtomType) to an integer number
181def parseElementCountList(elem_count_list: str) -> dict:
182    if not elem_count_list:
183        return {}
184    
185    atom_type_counts = {}
186    
187    for elem_count in elem_count_list.split(','):
188        elem_count = elem_count.strip()
189        
190        if not elem_count: # zero length?
191            continue
192        
193        elem_count = elem_count.split(':')
194        elem_count[0] = elem_count[0].strip()
195        atom_type = Chem.AtomDictionary.getType(elem_count[0])
196        
197        if atom_type == Chem.AtomType.UNKNOWN:
198            sys.exit('Error: unknown chemical element/generic type \'%s\'' % elem_count[0])
199
200        count = 1
201        
202        if len(elem_count) > 1: # has count spec.?
203            count = int(elem_count[1])
204            
205        atom_type_counts[atom_type] = count
206            
207    return atom_type_counts
208                                
209def parseArgs() -> argparse.Namespace:
210    parser = argparse.ArgumentParser(description='Strips compounds that fulfill particular user-defined criteria from a molecule database.')
211
212    parser.add_argument('-i',
213                        dest='in_file',
214                        required=True,
215                        metavar='<file>',
216                        help='Input molecule file')
217    parser.add_argument('-o',
218                        dest='out_file',
219                        required=True,
220                        metavar='<file>',
221                        help='Output molecule file')
222    parser.add_argument('-d',
223                        dest='disc_file',
224                        required=False,
225                        metavar='<file>',
226                        help='Discarded molecule output file')
227    parser.add_argument('-s',
228                        dest='strip_comps',
229                        required=False,
230                        action='store_true',
231                        default=False,
232                        help='Keep only the largest molecule component (default: false)')
233    parser.add_argument('-c',
234                        dest='min_charges',
235                        required=False,
236                        action='store_true',
237                        default=False,
238                        help='Minimize number of charged atoms (default: false)')
239    parser.add_argument('-x',
240                        dest='excluded_elements',
241                        required=False,
242                        metavar="<element list>",
243                        help='List of excluded chem. elements (default: no elements are excluded)')
244    parser.add_argument('-a',
245                        dest='allowed_elements',
246                        required=False,
247                        metavar="<element list>",
248                        help='List of allowed chem. elements (default: all elements are allowed)')
249    parser.add_argument('-m',
250                        dest='min_atom_counts',
251                        required=False,
252                        metavar="<element count list>",
253                        help='Minimum chem. element specific atom counts (default: no count limits)')
254    parser.add_argument('-M',
255                        dest='max_atom_counts',
256                        required=False,
257                        metavar="<element count list>",
258                        help='Maximum chem. element specific atom counts (default: no count limits)')
259    parser.add_argument('-v',
260                        dest='verb_level',
261                        required=False,
262                        metavar='<0|1|2|3>',
263                        choices=range(0, 4),
264                        default=1,
265                        help='Verbosity level (default: 1; 0 -> no console output, 1 -> print summary, 2 -> verbose, 3 -> extra verbose)',
266                        type=int)
267
268    return parser.parse_args()
269
270def main() -> None:
271    args = parseArgs() # process command line arguments
272
273    # convert specified lists of chem. element/gen. type symbols into a corresponding list of
274    # numeric atom types (defined in Chem.Atomtype)
275    args.allowed_elements  = parseElementList(args.allowed_elements)
276    args.excluded_elements = parseElementList(args.excluded_elements)
277
278    # convert specified comma separated lists of chem. element (or generic type) symbol/count pairs (sep. by colon) into
279    # corresponding dictionaries mapping numeric atom types (defined in Chem.AtomType) to integers
280    args.min_atom_counts = parseElementCountList(args.min_atom_counts)
281    args.max_atom_counts = parseElementCountList(args.max_atom_counts)
282    
283    # create reader for input molecules (format specified by file extension)
284    reader = Chem.MoleculeReader(args.in_file) 
285
286    # create writer for molecules passing the checks (format specified by file extension)
287    writer = Chem.MolecularGraphWriter(args.out_file) 
288
289    # write canonical SMILES
290    Chem.setSMILESOutputCanonicalFormParameter(writer, True)
291    
292    if args.disc_file:
293        # create writer for sorted out molecules (format specified by file extension)
294        disc_writer = Chem.MolecularGraphWriter(args.disc_file)
295        # write canonical SMILES
296        Chem.setSMILESOutputCanonicalFormParameter(disc_writer, True)
297    else:
298        disc_writer = None
299        
300    # create instances of the default implementation of the Chem.Molecule interface for the input and output molecules
301    in_mol = Chem.BasicMolecule()
302
303    i = 0
304    num_changed = 0
305    num_disc = 0
306    
307    try:
308        # read and process molecules one after the other until the end of input has been reached (or a severe error occurs)
309        while reader.read(in_mol):
310            # compose a molecule identifier
311            mol_id = Chem.getName(in_mol).strip() 
312
313            if mol_id == '':
314                mol_id = '#' + str(i + 1)  # fallback if name is empty or not available
315            else:
316                mol_id = '\'%s\' (#%s)' % (mol_id, str(i + 1))
317            
318            try:
319                # process input molecule
320                out_mol, log_msg = processMolecule(in_mol, args) 
321                
322                if not out_mol:         # check whether the molecule has been sorted out
323                    if args.verb_level > 1 and log_msg:
324                        print('- Molecule %s: discarded, %s' % (mol_id, log_msg))
325
326                    num_disc += 1
327                        
328                    if not disc_writer: # check whether discarded molecules should be saved to a separate file
329                        i += 1
330                        continue
331
332                    # write discarded molecule to the specified target file
333                    out_mol = in_mol
334                    out_mol_writer = disc_writer
335                else:
336                    if log_msg:
337                        if args.verb_level > 1 :
338                            print('- Molecule %s: %s' % (mol_id, log_msg))
339                            
340                        num_changed += 1
341                    else:
342                        if args.verb_level > 2:
343                            print('- Molecule %s: left unchanged' % mol_id)
344
345                    # molecule passed all checks and needs to be written to the regular output file
346                    out_mol_writer = writer
347                    
348                try:
349                    # calculate (if not already present) some basic properties of the output molecule
350                    # that might be required for writing (output format dependent)
351                    Chem.calcImplicitHydrogenCounts(out_mol, False)
352                    Chem.perceiveHybridizationStates(out_mol, False)
353                    Chem.perceiveSSSR(out_mol, False)
354                    Chem.setRingFlags(out_mol, False)
355                    Chem.setAromaticityFlags(out_mol, False)
356                    Chem.perceiveComponents(out_mol, False)
357                              
358                    # output molecule
359                    if not out_mol_writer.write(out_mol):
360                        sys.exit('Error: writing molecule %s failed' % mol_id)
361
362                except Exception as e: # handle exception raised in case of severe write errors
363                    sys.exit('Error: writing molecule %s failed: %s' % (mol_id, str(e)))
364
365                i += 1
366                
367            except Exception as e: # handle exception raised in case of severe processing errors
368                sys.exit('Error: processing of molecule %s failed: %s' % (mol_id, str(e)))
369                
370    except Exception as e: # handle exception raised in case of severe read errors
371        sys.exit('Error: reading of molecule %s failed: %s' % (str(i), str(e)))
372
373    if args.verb_level > 0:
374        print('Summary:')
375        print(' - %s molecules processed' % str(i))
376        print(' - %s modified ' % str(num_changed))
377        print(' - %s discarded' % str(num_disc))
378        
379    writer.close()
380    sys.exit(0)
381
382if __name__ == '__main__':
383    main()

Download source file