1.1.3.3. Database Cleaning
The script clean_mol_db.py reads molecules from an input file, performs (optional) preprocessing, and then writes only those molecules that fulfill particular user-defined criteria to an output file.
Synopsis
python clean_mol_db.py [-h] -i <file> -o <file> [-d <file>] [-s] [-c] [-x <element list>] [-a <element list>] [-m <element count list>] [-M <element count list>] [-v <0|1|2|3>]
Mandatory options
- -i <file>
Input molecule file
- -o <file>
Output molecule file
Other options
- -h, --help
Show help message and exit
- -d <file>
Discarded molecule output file
- -s
Keep only the largest molecule component (default: false)
- -c
Minimize the number of charged atoms (default: false) by protonation/deprotonation and charge equalization
- -x <element list>
List of excluded chem. elements (default: no elements are excluded)
- -a <element list>
List of allowed chem. elements (default: all elements are allowed)
- -m <element count list>
Minimum chem. element specific atom counts (default: no count limits)
- -M <element count list>
Maximum chem. element specific atom counts (default: no count limits)
- -v <0|1|2|3>
Verbosity level (default: 1; 0 -> no console output, 1 -> print summary, 2 -> verbose, 3 -> extra verbose)
The options -a and -x both require a list of chemical elements as argument. Chemical element lists are specified in the form <S>,…,<S> where <S> is the symbol of a chemical element or generic atom type. Supported generic types are:
Symbol |
Meaning |
---|---|
M |
any metal |
MH |
any metal or hydrogen |
A |
any element except hydrogen |
AH |
any element |
* |
any element (equivalent to AH) |
X |
any halogen |
XH |
any halogen or hydrogen |
Q |
any element except hydrogen and carbon |
QH |
any element except carbon |
The options -m and -M both require a list of chemical element counts as argument.
Chemical element counts are specified in the form <S>:<N>,…,<S>:<N> where <S> is
the symbol of a chemical element or generic atom type (see above) and <N> is
the corresponding minimum or maximum count. If the count part is omitted and only <S>
gets specified then the count is assumed to be 1
.
Example usage
python clean_mol_db.py -i <path/to/molecule/input/file> -o <path/to/molecule/output/file> -a C,H,N,O,S,P,F,Cl,Br,I -m C,A:3 -M F:9 -c -s
When executed as shown, the script will perform the following operations on each read input molecule (in the order listed):
Reduction of the number of charged atoms (if any and if possible)
Removal of all but the largest molecular graph component (only if multi-comp. molecule)
Check whether the chem. element of each atom of the working molecule (= result of prev. steps) is either C, H, N, O, S, P, F, Cl, Br, or I.
Check whether the atom list of the working molecule contains at least one carbon and three heavy atoms
Check whether the atom list of the working molecule contains not more than 9 fluorine atoms
The first check that fails leads to a rejection of the molecule. Working molecules that pass all checks will be written to the specified output file.
Code
1import sys
2import argparse
3
4import CDPL.Chem as Chem
5import CDPL.MolProp as MolProp
6
7
8# performs a (optional) standardization of the argument molecule and then checks whether
9# it fulfills certain user-defined criteria for inclusion/exclusion
10def processMolecule(mol: Chem.Molecule, args: argparse.Namespace) -> tuple:
11 chgs_mod = False
12
13 if args.min_charges:
14 # will store the 'uncharged' version of the argument mol
15 res_mol = Chem.BasicMolecule()
16
17 # minimize the number of charged atoms using the corresponding functionality implemented
18 # in class Chem.ProtonationStateStandardizer
19 chgs_mod = Chem.ProtonationStateStandardizer().standardize(mol, res_mol, Chem.ProtonationStateStandardizer.MIN_CHARGED_ATOM_COUNT)
20
21 # if changes were made then from now on use the 'uncharged' molecule for further checks
22 if chgs_mod:
23 mol = res_mol
24
25 comps_strpd = False
26
27 if args.strip_comps:
28 # perceive components (if not done already)
29 comps = Chem.perceiveComponents(mol, False)
30
31 # find largest component (regarding heavy atom count) but only if multi-comp. molecule
32 if comps.size > 1: # check if multi-comp. molecule
33 lgst_comp = None
34 lgst_comp_hvy_atom_count = 0
35
36 for comp in Chem.getComponents(mol): # for each component
37 hvy_atom_count = MolProp.getHeavyAtomCount(comp) # calc. non-hydrogen atom count
38
39 if hvy_atom_count > lgst_comp_hvy_atom_count: # check if the largest comp. so far has been found
40 lgst_comp_hvy_atom_count = hvy_atom_count # if so, store for later use
41 lgst_comp = comp
42
43 # if the input molecule has structure data then pass them on (for SDF output)
44 if Chem.hasStructureData(mol):
45 Chem.setStructureData(lgst_comp, Chem.getStructureData(mol))
46
47 # for further checks use only the identified largest component
48 comps_strpd = True
49 mol = lgst_comp
50
51 # calc. implicit hydrogen counts (if not done already)
52 Chem.calcImplicitHydrogenCounts(mol, False)
53
54 # create instance of class MolProp.ElementHistogram for storing the per-element atom counts
55 # of the molecule (or its largest comp.)
56 elem_histo = MolProp.ElementHistogram()
57
58 # get per-element atom counts
59 MolProp.generateElementHistogram(mol, elem_histo, False)
60
61 # check if the found chem. elements are all in the set of allowed elements (if specified)
62 if not checkAllowedElements(elem_histo, args.allowed_elements):
63 return (None, 'prohibited element detected')
64
65 # check if none of the found chem. elements is in the set of excluded elements (if specified)
66 if not checkExcludedElements(elem_histo, args.excluded_elements):
67 return (None, 'prohibited element detected')
68
69 # check if all specified minium chem. element atom counts are reached
70 if not checkMinAtomCounts(elem_histo, args.min_atom_counts):
71 return (None, 'minimum atom counts not reached')
72
73 # check if none of the specified maximum chem. element atom counts gets exceeded
74 if not checkMaxAtomCounts(elem_histo, args.max_atom_counts):
75 return (None, 'maximum atom count exceeded')
76
77 log_msg = None
78
79 if chgs_mod:
80 log_msg = 'reduced charged atom count'
81
82 if comps_strpd:
83 if log_msg:
84 log_msg += ', removed components'
85 else:
86 log_msg = 'removed components'
87
88 return (mol, log_msg)
89
90# checks if all found chem. elements are in the set of allowed elements (if specified)
91def checkAllowedElements(elem_histo: MolProp.ElementHistogram, elem_list: list) -> bool:
92 if not elem_list:
93 return True
94
95 for atom_type in elem_histo.getKeys():
96 allowed = False
97
98 for all_atom_type in elem_list:
99 if Chem.atomTypesMatch(all_atom_type, atom_type):
100 allowed = True
101 break
102
103 if not allowed:
104 return False
105
106 return True
107
108# checks if none of the found chem. elements is in the set of excluded elements (if specified)
109def checkExcludedElements(elem_histo: MolProp.ElementHistogram, elem_list: list) -> bool:
110 if not elem_list:
111 return True
112
113 for atom_type in elem_histo.getKeys():
114 for x_atom_type in elem_list:
115 if Chem.atomTypesMatch(x_atom_type, atom_type):
116 return False
117
118 return True
119
120# return the total number of atoms matching the specified atom type (element or generic type)
121def getNumAtomsOfType(elem_histo: MolProp.ElementHistogram, qry_atom_type: int) -> int:
122 tot_count = 0
123
124 for atom_type, count in elem_histo.getEntries():
125 # check if the elem. histogram entry matches the specified atom type (element or generic type)
126 # if so, add the stored count the total count
127 if Chem.atomTypesMatch(qry_atom_type, atom_type):
128 tot_count += count
129
130 return tot_count
131
132# checks if all the specified minium counts of atoms matching a particular element or
133# generic type are reached
134def checkMinAtomCounts(elem_histo: MolProp.ElementHistogram, atom_counts: dict) -> bool:
135 if not atom_counts:
136 return True
137
138 for atom_type, min_count in atom_counts.items():
139 if getNumAtomsOfType(elem_histo, atom_type) < min_count:
140 return False
141
142 return True
143
144# checks if none of the specified maximum counts of atoms matching a particular element or
145# generic type gets exceeded
146def checkMaxAtomCounts(elem_histo: MolProp.ElementHistogram, atom_counts: dict) -> bool:
147 if not atom_counts:
148 return True
149
150 for atom_type, max_count in atom_counts.items():
151 if getNumAtomsOfType(elem_histo, atom_type) > max_count:
152 return False
153
154 return True
155
156# converts a comma separated list of chem. element/generic type symbols into a list of the
157# corresponding numeric atom types defined in Chem.AtomType
158def parseElementList(elem_list: str) -> list:
159 if not elem_list:
160 return []
161
162 atom_types = []
163
164 for elem in elem_list.split(','):
165 elem = elem.strip()
166
167 if not elem: # zero length?
168 continue
169
170 atom_type = Chem.AtomDictionary.getType(elem)
171
172 if atom_type == Chem.AtomType.UNKNOWN:
173 sys.exit('Error: unknown chemical element/generic type \'%s\'' % elem)
174
175 atom_types.append(atom_type)
176
177 return atom_types
178
179# converts a comma separated list of chem. element (or generic type) symbol/count pairs (sep. by colon) into a dictionary
180# mapping the corresponding numeric atom types (defined in Chem.AtomType) to an integer number
181def parseElementCountList(elem_count_list: str) -> dict:
182 if not elem_count_list:
183 return {}
184
185 atom_type_counts = {}
186
187 for elem_count in elem_count_list.split(','):
188 elem_count = elem_count.strip()
189
190 if not elem_count: # zero length?
191 continue
192
193 elem_count = elem_count.split(':')
194 elem_count[0] = elem_count[0].strip()
195 atom_type = Chem.AtomDictionary.getType(elem_count[0])
196
197 if atom_type == Chem.AtomType.UNKNOWN:
198 sys.exit('Error: unknown chemical element/generic type \'%s\'' % elem_count[0])
199
200 count = 1
201
202 if len(elem_count) > 1: # has count spec.?
203 count = int(elem_count[1])
204
205 atom_type_counts[atom_type] = count
206
207 return atom_type_counts
208
209def parseArgs() -> argparse.Namespace:
210 parser = argparse.ArgumentParser(description='Strips compounds that fulfill particular user-defined criteria from a molecule database.')
211
212 parser.add_argument('-i',
213 dest='in_file',
214 required=True,
215 metavar='<file>',
216 help='Input molecule file')
217 parser.add_argument('-o',
218 dest='out_file',
219 required=True,
220 metavar='<file>',
221 help='Output molecule file')
222 parser.add_argument('-d',
223 dest='disc_file',
224 required=False,
225 metavar='<file>',
226 help='Discarded molecule output file')
227 parser.add_argument('-s',
228 dest='strip_comps',
229 required=False,
230 action='store_true',
231 default=False,
232 help='Keep only the largest molecule component (default: false)')
233 parser.add_argument('-c',
234 dest='min_charges',
235 required=False,
236 action='store_true',
237 default=False,
238 help='Minimize number of charged atoms (default: false)')
239 parser.add_argument('-x',
240 dest='excluded_elements',
241 required=False,
242 metavar="<element list>",
243 help='List of excluded chem. elements (default: no elements are excluded)')
244 parser.add_argument('-a',
245 dest='allowed_elements',
246 required=False,
247 metavar="<element list>",
248 help='List of allowed chem. elements (default: all elements are allowed)')
249 parser.add_argument('-m',
250 dest='min_atom_counts',
251 required=False,
252 metavar="<element count list>",
253 help='Minimum chem. element specific atom counts (default: no count limits)')
254 parser.add_argument('-M',
255 dest='max_atom_counts',
256 required=False,
257 metavar="<element count list>",
258 help='Maximum chem. element specific atom counts (default: no count limits)')
259 parser.add_argument('-v',
260 dest='verb_level',
261 required=False,
262 metavar='<0|1|2|3>',
263 choices=range(0, 4),
264 default=1,
265 help='Verbosity level (default: 1; 0 -> no console output, 1 -> print summary, 2 -> verbose, 3 -> extra verbose)',
266 type=int)
267
268 return parser.parse_args()
269
270def main() -> None:
271 args = parseArgs() # process command line arguments
272
273 # convert specified lists of chem. element/gen. type symbols into a corresponding list of
274 # numeric atom types (defined in Chem.Atomtype)
275 args.allowed_elements = parseElementList(args.allowed_elements)
276 args.excluded_elements = parseElementList(args.excluded_elements)
277
278 # convert specified comma separated lists of chem. element (or generic type) symbol/count pairs (sep. by colon) into
279 # corresponding dictionaries mapping numeric atom types (defined in Chem.AtomType) to integers
280 args.min_atom_counts = parseElementCountList(args.min_atom_counts)
281 args.max_atom_counts = parseElementCountList(args.max_atom_counts)
282
283 # create reader for input molecules (format specified by file extension)
284 reader = Chem.MoleculeReader(args.in_file)
285
286 # create writer for molecules passing the checks (format specified by file extension)
287 writer = Chem.MolecularGraphWriter(args.out_file)
288
289 # write canonical SMILES
290 Chem.setSMILESOutputCanonicalFormParameter(writer, True)
291
292 if args.disc_file:
293 # create writer for sorted out molecules (format specified by file extension)
294 disc_writer = Chem.MolecularGraphWriter(args.disc_file)
295 # write canonical SMILES
296 Chem.setSMILESOutputCanonicalFormParameter(disc_writer, True)
297 else:
298 disc_writer = None
299
300 # create instances of the default implementation of the Chem.Molecule interface for the input and output molecules
301 in_mol = Chem.BasicMolecule()
302
303 i = 0
304 num_changed = 0
305 num_disc = 0
306
307 try:
308 # read and process molecules one after the other until the end of input has been reached (or a severe error occurs)
309 while reader.read(in_mol):
310 # compose a molecule identifier
311 mol_id = Chem.getName(in_mol).strip()
312
313 if mol_id == '':
314 mol_id = '#' + str(i + 1) # fallback if name is empty or not available
315 else:
316 mol_id = '\'%s\' (#%s)' % (mol_id, str(i + 1))
317
318 try:
319 # process input molecule
320 out_mol, log_msg = processMolecule(in_mol, args)
321
322 if not out_mol: # check whether the molecule has been sorted out
323 if args.verb_level > 1 and log_msg:
324 print('- Molecule %s: discarded, %s' % (mol_id, log_msg))
325
326 num_disc += 1
327
328 if not disc_writer: # check whether discarded molecules should be saved to a separate file
329 i += 1
330 continue
331
332 # write discarded molecule to the specified target file
333 out_mol = in_mol
334 out_mol_writer = disc_writer
335 else:
336 if log_msg:
337 if args.verb_level > 1 :
338 print('- Molecule %s: %s' % (mol_id, log_msg))
339
340 num_changed += 1
341 else:
342 if args.verb_level > 2:
343 print('- Molecule %s: left unchanged' % mol_id)
344
345 # molecule passed all checks and needs to be written to the regular output file
346 out_mol_writer = writer
347
348 try:
349 # calculate (if not already present) some basic properties of the output molecule
350 # that might be required for writing (output format dependent)
351 Chem.calcImplicitHydrogenCounts(out_mol, False)
352 Chem.perceiveHybridizationStates(out_mol, False)
353 Chem.perceiveSSSR(out_mol, False)
354 Chem.setRingFlags(out_mol, False)
355 Chem.setAromaticityFlags(out_mol, False)
356 Chem.perceiveComponents(out_mol, False)
357
358 # output molecule
359 if not out_mol_writer.write(out_mol):
360 sys.exit('Error: writing molecule %s failed' % mol_id)
361
362 except Exception as e: # handle exception raised in case of severe write errors
363 sys.exit('Error: writing molecule %s failed: %s' % (mol_id, str(e)))
364
365 i += 1
366
367 except Exception as e: # handle exception raised in case of severe processing errors
368 sys.exit('Error: processing of molecule %s failed: %s' % (mol_id, str(e)))
369
370 except Exception as e: # handle exception raised in case of severe read errors
371 sys.exit('Error: reading of molecule %s failed: %s' % (str(i), str(e)))
372
373 if args.verb_level > 0:
374 print('Summary:')
375 print(' - %s molecules processed' % str(i))
376 print(' - %s modified ' % str(num_changed))
377 print(' - %s discarded' % str(num_disc))
378
379 writer.close()
380 sys.exit(0)
381
382if __name__ == '__main__':
383 main()