1.1.3.3. Database Cleaning
The script clean_mol_db.py reads molecules from an input file, performs (optional) preprocessing, and then writes only those molecules that fulfill particular user-defined criteria to an output file.
Synopsis
python clean_mol_db.py [-h] -i <file> -o <file> [-d <file>] [-s] [-c] [-x <element list>] [-a <element list>] [-m <element count list>] [-M <element count list>] [-v <0|1|2|3>]
Mandatory options
- -i <file>
Input molecule file
- -o <file>
Output molecule file
Other options
- -h, --help
Show help message and exit
- -d <file>
Discarded molecule output file
- -s
Keep only the largest molecule component (default: false)
- -S
Remove atom and bond stereochemistry information (default: false)
- -c
Minimize the number of charged atoms (default: false) by protonation/deprotonation and charge equalization
- -C <true|false>
Canonicalize output molecule (default: true)
- -x <element list>
List of excluded chem. elements (default: no elements are excluded)
- -a <element list>
List of allowed chem. elements (default: all elements are allowed)
- -m <element count list>
Minimum chem. element specific atom counts (default: no count limits)
- -M <element count list>
Maximum chem. element specific atom counts (default: no count limits)
- -v <0|1|2|3>
Verbosity level (default: 1; 0 -> no console output, 1 -> print summary, 2 -> verbose, 3 -> extra verbose)
The options -a and -x both require a list of chemical elements as argument. Chemical element lists are specified in the form <S>,…,<S> where <S> is the symbol of a chemical element or generic atom type. Supported generic types are:
Symbol |
Meaning |
---|---|
M |
any metal |
MH |
any metal or hydrogen |
A |
any element except hydrogen |
AH |
any element |
* |
any element (equivalent to AH) |
X |
any halogen |
XH |
any halogen or hydrogen |
Q |
any element except hydrogen and carbon |
QH |
any element except carbon |
The options -m and -M both require a list of chemical element counts as argument.
Chemical element counts are specified in the form <S>:<N>,…,<S>:<N> where <S> is
the symbol of a chemical element or generic atom type (see above) and <N> is
the corresponding minimum or maximum count. If the count part is omitted and only <S>
gets specified then the count is assumed to be 1
.
Example usage
python clean_mol_db.py -i <path/to/molecule/input/file> -o <path/to/molecule/output/file> -a C,H,N,O,S,P,F,Cl,Br,I -m C,A:3 -M F:9 -c -s
When executed as shown, the script will perform the following operations on each read input molecule (in the order listed):
Reduction of the number of charged atoms (if any and if possible)
Removal of all but the largest molecular graph component (only if multi-comp. molecule)
Check whether the chem. element of each atom of the working molecule (= result of prev. steps) is either C, H, N, O, S, P, F, Cl, Br, or I.
Check whether the atom list of the working molecule contains at least one carbon and three heavy atoms
Check whether the atom list of the working molecule contains not more than 9 fluorine atoms
The first check that fails leads to a rejection of the molecule. Working molecules that pass all checks will be written to the specified output file.
Code
1import sys
2import argparse
3
4import CDPL.Chem as Chem
5import CDPL.MolProp as MolProp
6
7
8# performs a (optional) standardization of the argument molecule and then checks whether
9# it fulfills certain user-defined criteria for forwarding/rejection
10def processMolecule(mol: Chem.Molecule, args: argparse.Namespace) -> tuple:
11 chgs_mod = False
12
13 if args.min_charges:
14 # will store the 'uncharged' version of the argument mol
15 res_mol = Chem.BasicMolecule()
16
17 # minimize the number of charged atoms using the corresponding functionality implemented
18 # in class Chem.ProtonationStateStandardizer
19 chgs_mod = Chem.ProtonationStateStandardizer().standardize(mol, res_mol, Chem.ProtonationStateStandardizer.MIN_CHARGED_ATOM_COUNT)
20
21 # if changes were made then from now on use the 'uncharged' molecule for further checks
22 if chgs_mod:
23 mol = res_mol
24
25 comps_strpd = False
26
27 if args.strip_comps:
28 # perceive components (if not done already)
29 comps = Chem.perceiveComponents(mol, False)
30
31 # find largest component (regarding heavy atom count) but only if multi-comp. molecule
32 if comps.size > 1: # check if multi-comp. molecule
33 lgst_comp = None
34 lgst_comp_hvy_atom_count = 0
35
36 for comp in Chem.getComponents(mol): # for each component
37 hvy_atom_count = MolProp.getHeavyAtomCount(comp) # calc. non-hydrogen atom count
38
39 if hvy_atom_count > lgst_comp_hvy_atom_count: # check if the largest comp. so far has been found
40 lgst_comp_hvy_atom_count = hvy_atom_count # if so, store for later use
41 lgst_comp = comp
42
43 # if the input molecule has structure data then pass them on (for SDF output)
44 if Chem.hasStructureData(mol):
45 Chem.setStructureData(lgst_comp, Chem.getStructureData(mol))
46
47 # for further checks use only the identified largest component
48 comps_strpd = True
49 mol = lgst_comp
50
51 # calc. implicit hydrogen counts (if not done already)
52 Chem.calcImplicitHydrogenCounts(mol, False)
53
54 # create instance of class MolProp.ElementHistogram for storing the per-element atom counts
55 # of the molecule (or its largest comp.)
56 elem_histo = MolProp.ElementHistogram()
57
58 # get per-element atom counts
59 MolProp.generateElementHistogram(mol, elem_histo, False)
60
61 # check if the found chem. elements are all in the set of allowed elements (if specified)
62 if not checkAllowedElements(elem_histo, args.allowed_elements):
63 return (None, 'prohibited element detected')
64
65 # check if none of the found chem. elements is in the set of excluded elements (if specified)
66 if not checkExcludedElements(elem_histo, args.excluded_elements):
67 return (None, 'prohibited element detected')
68
69 # check if all specified minium chem. element atom counts are reached
70 if not checkMinAtomCounts(elem_histo, args.min_atom_counts):
71 return (None, 'minimum atom counts not reached')
72
73 # check if none of the specified maximum chem. element atom counts gets exceeded
74 if not checkMaxAtomCounts(elem_histo, args.max_atom_counts):
75 return (None, 'maximum atom count exceeded')
76
77 log_msg = None
78
79 if chgs_mod:
80 log_msg = 'reduced charged atom count'
81
82 if comps_strpd:
83 if log_msg:
84 log_msg += ', removed components'
85 else:
86 log_msg = 'removed components'
87
88 if args.strip_stereo:
89 for atom in mol.atoms: # clear atom stereo descriptors
90 Chem.clearStereoDescriptor(atom)
91
92 for bond in mol.bonds: # clear bond stereo descriptors
93 Chem.clearStereoDescriptor(bond)
94
95 if args.canonicalize:
96 Chem.calcBasicProperties(mol, False) # calc. basic properties required for canonicalization
97 Chem.calcCanonicalNumbering(mol, True) # calc. canon. numbering of atoms
98 Chem.canonicalize(mol) # order atoms/bonds according to canon. numbering
99
100 return (mol, log_msg)
101
102# checks if all found chem. elements are in the set of allowed elements (if specified)
103def checkAllowedElements(elem_histo: MolProp.ElementHistogram, elem_list: list) -> bool:
104 if not elem_list:
105 return True
106
107 for atom_type in elem_histo.getKeys():
108 allowed = False
109
110 for all_atom_type in elem_list:
111 if Chem.atomTypesMatch(all_atom_type, atom_type):
112 allowed = True
113 break
114
115 if not allowed:
116 return False
117
118 return True
119
120# checks if none of the found chem. elements is in the set of excluded elements (if specified)
121def checkExcludedElements(elem_histo: MolProp.ElementHistogram, elem_list: list) -> bool:
122 if not elem_list:
123 return True
124
125 for atom_type in elem_histo.getKeys():
126 for x_atom_type in elem_list:
127 if Chem.atomTypesMatch(x_atom_type, atom_type):
128 return False
129
130 return True
131
132# return the total number of atoms matching the specified atom type (element or generic type)
133def getNumAtomsOfType(elem_histo: MolProp.ElementHistogram, qry_atom_type: int) -> int:
134 tot_count = 0
135
136 for atom_type, count in elem_histo.getEntries():
137 # check if the elem. histogram entry matches the specified atom type (element or generic type)
138 # if so, add the stored count the total count
139 if Chem.atomTypesMatch(qry_atom_type, atom_type):
140 tot_count += count
141
142 return tot_count
143
144# checks if all the specified minium counts of atoms matching a particular element or
145# generic type are reached
146def checkMinAtomCounts(elem_histo: MolProp.ElementHistogram, atom_counts: dict) -> bool:
147 if not atom_counts:
148 return True
149
150 for atom_type, min_count in atom_counts.items():
151 if getNumAtomsOfType(elem_histo, atom_type) < min_count:
152 return False
153
154 return True
155
156# checks if none of the specified maximum counts of atoms matching a particular element or
157# generic type gets exceeded
158def checkMaxAtomCounts(elem_histo: MolProp.ElementHistogram, atom_counts: dict) -> bool:
159 if not atom_counts:
160 return True
161
162 for atom_type, max_count in atom_counts.items():
163 if getNumAtomsOfType(elem_histo, atom_type) > max_count:
164 return False
165
166 return True
167
168# converts a comma separated list of chem. element/generic type symbols into a list of the
169# corresponding numeric atom types defined in Chem.AtomType
170def parseElementList(elem_list: str) -> list:
171 if not elem_list:
172 return []
173
174 atom_types = []
175
176 for elem in elem_list.split(','):
177 elem = elem.strip()
178
179 if not elem: # zero length?
180 continue
181
182 atom_type = Chem.AtomDictionary.getType(elem)
183
184 if atom_type == Chem.AtomType.UNKNOWN:
185 sys.exit('Error: unknown chemical element/generic type \'%s\'' % elem)
186
187 atom_types.append(atom_type)
188
189 return atom_types
190
191# converts a comma separated list of chem. element (or generic type) symbol/count pairs (sep. by colon) into a dictionary
192# mapping the corresponding numeric atom types (defined in Chem.AtomType) to an integer number
193def parseElementCountList(elem_count_list: str) -> dict:
194 if not elem_count_list:
195 return {}
196
197 atom_type_counts = {}
198
199 for elem_count in elem_count_list.split(','):
200 elem_count = elem_count.strip()
201
202 if not elem_count: # zero length?
203 continue
204
205 elem_count = elem_count.split(':')
206 elem_count[0] = elem_count[0].strip()
207 atom_type = Chem.AtomDictionary.getType(elem_count[0])
208
209 if atom_type == Chem.AtomType.UNKNOWN:
210 sys.exit('Error: unknown chemical element/generic type \'%s\'' % elem_count[0])
211
212 count = 1
213
214 if len(elem_count) > 1: # has count spec.?
215 count = int(elem_count[1])
216
217 atom_type_counts[atom_type] = count
218
219 return atom_type_counts
220
221def parseArgs() -> argparse.Namespace:
222 def strtobool(value: str) -> bool:
223 value = value.lower()
224
225 if value in ('y', 'yes', 'on', '1', 'true', 't'):
226 return True
227
228 return False
229
230 parser = argparse.ArgumentParser(description='Strips compounds that fulfill particular user-defined criteria from a molecule database.')
231
232 parser.add_argument('-i',
233 dest='in_file',
234 required=True,
235 metavar='<file>',
236 help='Input molecule file')
237 parser.add_argument('-o',
238 dest='out_file',
239 required=True,
240 metavar='<file>',
241 help='Output molecule file')
242 parser.add_argument('-d',
243 dest='disc_file',
244 required=False,
245 metavar='<file>',
246 help='Discarded molecule output file')
247 parser.add_argument('-s',
248 dest='strip_comps',
249 required=False,
250 action='store_true',
251 default=False,
252 help='Keep only the largest molecule component (default: false)')
253 parser.add_argument('-S',
254 dest='strip_stereo',
255 required=False,
256 action='store_true',
257 default=False,
258 help='Remove atom and bond stereochemistry information (default: false)')
259 parser.add_argument('-c',
260 dest='min_charges',
261 required=False,
262 action='store_true',
263 default=False,
264 help='Minimize number of charged atoms (default: false)')
265 parser.add_argument('-C',
266 dest='canonicalize',
267 required=False,
268 metavar='<true|false>',
269 type=lambda x:bool(strtobool(x)),
270 default=True,
271 help='Canonicalize output molecule (default: true)')
272 parser.add_argument('-x',
273 dest='excluded_elements',
274 required=False,
275 metavar="<element list>",
276 help='List of excluded chem. elements (default: no elements are excluded)')
277 parser.add_argument('-a',
278 dest='allowed_elements',
279 required=False,
280 metavar="<element list>",
281 help='List of allowed chem. elements (default: all elements are allowed)')
282 parser.add_argument('-m',
283 dest='min_atom_counts',
284 required=False,
285 metavar="<element count list>",
286 help='Minimum chem. element specific atom counts (default: no count limits)')
287 parser.add_argument('-M',
288 dest='max_atom_counts',
289 required=False,
290 metavar="<element count list>",
291 help='Maximum chem. element specific atom counts (default: no count limits)')
292 parser.add_argument('-v',
293 dest='verb_level',
294 required=False,
295 metavar='<0|1|2|3>',
296 choices=range(0, 4),
297 default=1,
298 help='Verbosity level (default: 1; 0 -> no console output, 1 -> print summary, 2 -> verbose, 3 -> extra verbose)',
299 type=int)
300
301 return parser.parse_args()
302
303def main() -> None:
304 args = parseArgs() # process command line arguments
305
306 # convert specified lists of chem. element/gen. type symbols into a corresponding list of
307 # numeric atom types (defined in Chem.Atomtype)
308 args.allowed_elements = parseElementList(args.allowed_elements)
309 args.excluded_elements = parseElementList(args.excluded_elements)
310
311 # convert specified comma separated lists of chem. element (or generic type) symbol/count pairs (sep. by colon) into
312 # corresponding dictionaries mapping numeric atom types (defined in Chem.AtomType) to integers
313 args.min_atom_counts = parseElementCountList(args.min_atom_counts)
314 args.max_atom_counts = parseElementCountList(args.max_atom_counts)
315
316 # create reader for input molecules (format specified by file extension)
317 reader = Chem.MoleculeReader(args.in_file)
318
319 # create writer for molecules passing the checks (format specified by file extension)
320 writer = Chem.MolecularGraphWriter(args.out_file)
321
322 # if canonicalization is enabled and the output format is SMILES then generate canonical SMILES directly
323 if args.canonicalize and (writer.getDataFormat() == Chem.DataFormat.SMILES or \
324 writer.getDataFormat() == Chem.DataFormat.SMILES_GZ or \
325 writer.getDataFormat() == Chem.DataFormat.SMILES_BZ2):
326 args.canonicalize = False
327 Chem.setSMILESOutputCanonicalFormParameter(writer, True)
328
329 if args.disc_file:
330 # create writer for sorted out molecules (format specified by file extension)
331 disc_writer = Chem.MolecularGraphWriter(args.disc_file)
332 else:
333 disc_writer = None
334
335 # create instances of the default implementation of the Chem.Molecule interface for the input and output molecules
336 in_mol = Chem.BasicMolecule()
337
338 i = 0
339 num_changed = 0
340 num_disc = 0
341
342 try:
343 # read and process molecules one after the other until the end of input has been reached (or a severe error occurs)
344 while reader.read(in_mol):
345 # compose a molecule identifier
346 mol_id = Chem.getName(in_mol).strip()
347
348 if mol_id == '':
349 mol_id = '#' + str(i + 1) # fallback if name is empty or not available
350 else:
351 mol_id = '\'%s\' (#%s)' % (mol_id, str(i + 1))
352
353 try:
354 # process input molecule
355 out_mol, log_msg = processMolecule(in_mol, args)
356
357 if out_mol is None: # check whether the molecule has been sorted out
358 if args.verb_level > 1 and log_msg:
359 print('- Molecule %s: discarded, %s' % (mol_id, log_msg))
360
361 num_disc += 1
362
363 if not disc_writer: # check whether discarded molecules should be saved to a separate file
364 i += 1
365 continue
366
367 # write discarded molecule to the specified target file
368 out_mol = in_mol
369 out_mol_writer = disc_writer
370 else:
371 if log_msg:
372 if args.verb_level > 1 :
373 print('- Molecule %s: %s' % (mol_id, log_msg))
374
375 num_changed += 1
376 else:
377 if args.verb_level > 2:
378 print('- Molecule %s: left unchanged' % mol_id)
379
380 # molecule passed all checks and needs to be written to the regular output file
381 out_mol_writer = writer
382
383 try:
384 # calculate (if not already present) some basic properties of the output molecule
385 # that might be required for writing (output format dependent)
386 Chem.calcImplicitHydrogenCounts(out_mol, False)
387 Chem.perceiveHybridizationStates(out_mol, False)
388 Chem.perceiveSSSR(out_mol, False)
389 Chem.setRingFlags(out_mol, False)
390 Chem.setAromaticityFlags(out_mol, False)
391 Chem.perceiveComponents(out_mol, False)
392
393 # output molecule
394 if not out_mol_writer.write(out_mol):
395 sys.exit('Error: writing molecule %s failed' % mol_id)
396
397 except Exception as e: # handle exception raised in case of severe write errors
398 sys.exit('Error: writing molecule %s failed: %s' % (mol_id, str(e)))
399
400 i += 1
401
402 except Exception as e: # handle exception raised in case of severe processing errors
403 sys.exit('Error: processing of molecule %s failed: %s' % (mol_id, str(e)))
404
405 except Exception as e: # handle exception raised in case of severe read errors
406 sys.exit('Error: reading of molecule %s failed: %s' % (str(i), str(e)))
407
408 if args.verb_level > 0:
409 print('Summary:')
410 print(' - %s molecules processed' % str(i))
411 print(' - %s modified ' % str(num_changed))
412 print(' - %s discarded' % str(num_disc))
413
414 writer.close()
415 sys.exit(0)
416
417if __name__ == '__main__':
418 main()