1.1.4.2. Editing of Substructures

The script edit_mols.py performs modifications on the molecular graph of molecules read from a specified file according to a provided set of SMARTS/SMILES pattern-based substructure editing rules and writes the results to a given output file.

Synopsis

python edit_mols.py [-h] -i <file> -o <file> -p <file/string> [-m] [-d] [-c] [-q]

Mandatory options

-i <file>

Molecule input file

-o <file>

Edited molecule output file

-p <file/string>

A string specifying search, (optional) exclude and result patterns or path to a file providing these (one set per line)

Other options

-h, --help

Show help message and exit

-m

Output input molecule before the resulting edited molecule (default: false)

-d

Remove ordinary explicit hydrogens (default: false)

-c

Saturate free valences with explicit hydrogens (default: false)

-q

Disable progress output (default: false)

A molecular graph editing operation is specified via a string of one or more SMARTS patterns describing the substructures to edit, optional substructure exclude patterns and a SMILES string encoding the specific atom and bond modifications to perform. This string of patterns can be specified directly on the command line as value of option -p. Alternatively, the value may also be the path to a file which stores one set of patterns per line and thus allows to specify multiple distinct editing operations which will then be processed for each input molecule in turn.

The sequence of whitespace-separated SMARTS/SMILES patterns must be formatted as follows (for examples see below):

<#Search Patterns> <SMARTS Pattern> … <#Exclude Patterns> [<SMARTS Pattern> …] <Editing Result SMILES>

In the substructure search pattern(s), any atoms to be edited and/or atoms connected by bonds to be modified must be labeled by a unique non-zero integer number (by means of a colon followed by the integer number at the end of the SMARTS atom specification). These numeric ids are used to establish an unambiguous mapping between the atoms/bonds of the search pattern and the SMILES string encoding the editing instructions. As mentioned above, the atom/bond editing operations are all specified by means of a simple SMILES string. For the purpose of substructure editing, the SMILES format has been extended by additional atom type and bond order symbols that allow to mark atom/bonds for deletion or act as a ‘do not change’ marker for atom type or bond order.

Editing result SMILES strings have to be composed according to the following rules:

Atoms of the molecule matching labeled search pattern atoms are referenced by their numeric id (likewise specified by means of a colon followed by the integer number at the end of the SMILES atom specification)
A bond between two labeled atoms in the result SMILES string will be mapped to the bond of the molecule that matched the corresponding bond of the search pattern
If such a bond of the molecule does not exist it will be created with the specified bond order
A molecule bond connecting two atoms that match labeled search pattern atoms which is not occurring in the result SMILES string will be left unchanged
A molecule atom matching a labeled search pattern atom which is not occurring in the result SMILES string will be left unchanged
A labeled atom in the result SMILES string with a numeric id that does not occur in the search pattern will be created with specified properties (symbol, form. charge, isotope, chirality, …)
Any unlabeled atoms in the result SMILES string will be created with the specified properties (symbol, form. charge, isotope, chirality, …)
Bonds to/between unlabeled result SMILES string atoms will be created with the specified bond order
For a mapped molecule atom only those properties (symbol, form. charge, isotope, chirality, …) that were specified for the corresponding result SMILES string atom will be modified
The special result SMILES string atom type symbol x (only valid in brackets) results in the removal of the mapped molecule atom including any incident bonds
The special result SMILES string atom type symbol ~ (only valid in brackets) indicates that the type of the mapped molecule atom shall be left unchanged
The special result SMILES string bond order symbol x results in the removal of the mapped molecule bond
The special result SMILES string bond order symbol ~ indicates that the order of the mapped molecule bond shall be left unchanged

Substructure editing examples

Nitro group standardization
Search pattern: [#6][N:1](~[O:2])~[O:3]

Exclude pattern: [#6][N+](=[O+0])-[O-]

Result pattern: [~+:1](-[~-:2])=[~+0:3]

Command line example:
$ python edit_mols.py -i wrong_trinitro_benz.sdf -o corr_trinitro_benz.smi -p '1 [#6][N:1](~[O:2])~[O:3] 1 [#6][N+](=[O+0])-[O-] [~+:1](-[~-:2])=[~+0:3]' -d
wrong_trinitro_benz.sdf

corr_trinitro_benz.smi
Oxidation of primary alcohols to carboxylic acids
Search pattern: [CD2^3:1]-[OD1]

Exclude pattern: -

Result pattern: [~:1]=O

Command line example:
$ python edit_mols.py -i 3ccz_A_5HI.sdf -o 3ccz_A_5HI_ox.sdf -p '1 [CX2^3:1][OX1] 0 [~:1]=O' -d
3ccz_A_5HI.sdf

3ccz_A_5HI_ox.sdf
Amide cleavage
Search pattern: [#6]-[C:1](=O)-[N:2]

Exclude pattern: -

Result pattern: [C:1](x[N:2])-O

Command line example:
$ python edit_mols.py -i cyclosporine.smi -o cyclosporine_cleaved.smi -p '1 [#6]-[C:1](=O)-[N:2] 0 [C:1](x[N:2])-O' -d
cyclosporine.smi

cyclosporine_cleaved.smi

Code

import sys
import argparse
import pathlib

import CDPL.Chem as Chem


# exhaustively edits matching substructures of the argument molecule according to the 
# specified editing instructions using the provided list of initialized
# Chem.SubstructureEditor instances
def editMolecule(mol: Chem.Molecule, ed_list: list, args: argparse.Namespace) -> int:
    # calculate several required properties
    Chem.initSubstructureSearchTarget(mol, False)

    h_changes = False
    
    if args.rem_h:   # remove ordinary (with standard form. charge, isotope, connectivity) hydrogens, if desired
        h_changes = Chem.makeOrdinaryHydrogenDeplete(mol, Chem.AtomPropertyFlag.ISOTOPE | Chem.AtomPropertyFlag.FORMAL_CHARGE | Chem.AtomPropertyFlag.EXPLICIT_BOND_COUNT, True)
    elif args.add_h: # make hydrogen complete, if desired
        h_changes = Chem.makeHydrogenComplete(mol)

    if h_changes:      # if expl. hydrogen count has changed -> recompute/invalidate dependent properties
        Chem.clearComponents(mol)
        Chem.calcAtomStereoDescriptors(mol, True, 0, False)
        Chem.calcBondStereoDescriptors(mol, True, 0, False)
 
    num_edits = 0

    # perform the editing work via the provided Chem.SubstructureEditor instances
    for editor in ed_list:
        num_edits += editor.edit(mol)

    # if structural changes were made clear 2D and 3D atom coordinates since they became invalid
    if num_edits > 0 or h_changes:
        Chem.setMDLDimensionality(mol, 0)        # for output in one of the MDL formats indicate that there are no atom coordinates present

        for atom in mol.atoms:
            Chem.clear2DCoordinates(atom)        # delete 2D coordinates
            Chem.clear3DCoordinates(atom)        # delete 3D coordinates
            Chem.clear3DCoordinatesArray(atom)   # delete conformer ensemble coordinates 

    return num_edits

# creates and initializes a Chem.SubstructureEditor instance as specified
# by the given string of substructure search, exclude (optional) and editing result patterns
# in the format <#Search Patterns> <Search Pattern SMARTS> ...  <#Exclude Patterns> [<Exclude Pattern SMARTS> ...] <Result Pattern SMILES>
def createSubstructureEditor(ed_ptns: str) -> Chem.SubstructureEditor:
    editor = Chem.SubstructureEditor()
    tokens = ed_ptns.split()
    i = 1

    for j in range(int(tokens[0])):
        editor.addSearchPattern(Chem.parseSMARTS(tokens[i]))
        i += 1
        
    for j in range(int(tokens[i])):
        i += 1
        editor.addExcludePattern(Chem.parseSMARTS(tokens[i]))

    editor.setResultPattern(Chem.parseSMILES(tokens[i + 1]))

    return editor

# processes the value of the argument -p which is either the path to a file containing multiple lines
# of substructure editing specifications (one per line) or a string directly providing a complete set
# of search, exclude (optional) and result patterns (format is outline above)
def createSubstructureEditors(ed_ptns: str) -> list:
    if pathlib.Path(ed_ptns).is_file():  # if the argument value is a path to an existing file process it line by line
        editors = []

        with open(ed_ptns, 'r') as ed_ptns_file:
            for line in ed_ptns_file.readlines():
                if line.startswith('#'): # lines starting with '#' ar comment lines
                    continue

                editors.append(createSubstructureEditor(line))

        return editors

    # at this point the argument value directly specifies a set of search, exclude (optional) and result patterns
    return [ createSubstructureEditor(ed_ptns) ] 

def parseArgs() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description='Performs modifications on the molecular graph of molecules read from a \
    specified file according to a provided set of SMARTS/SMILES pattern-based substructure editing rules.')

    parser.add_argument('-i',
                        dest='in_file',
                        required=True,
                        metavar='<file>',
                        help='Molecule input file')
    parser.add_argument('-o',
                        dest='out_file',
                        required=True,
                        metavar='<file>',
                        help='Edited molecule output file')
    parser.add_argument('-p',
                        dest='patterns',
                        required=True,
                        metavar='<file/string>',
                        help='A string specifying search, (optional) exclude and result patterns or path to a file providing these (one set per line)')
    parser.add_argument('-m',
                        dest='output_mol',
                        required=False,
                        action='store_true',
                        default=False,
                        help='Output input molecule before the resulting edited molecule (default: false)')
    parser.add_argument('-d',
                        dest='rem_h',
                        required=False,
                        action='store_true',
                        default=False,
                        help='Remove ordinary explicit hydrogens (default: false)')
    parser.add_argument('-c',
                        dest='add_h',
                        required=False,
                        action='store_true',
                        default=False,
                        help='Saturate free valences with explicit hydrogens (default: false)')
    parser.add_argument('-q',
                        dest='quiet',
                        required=False,
                        action='store_true',
                        default=False,
                        help='Disable progress output (default: false)')
      
    return parser.parse_args()

def main() -> None:
    args = parseArgs()

    # create reader for input molecules (format specified by file extension)
    reader = Chem.MoleculeReader(args.in_file) 

    # create writer for the generated 3D structures (format specified by file extension)
    writer = Chem.MolecularGraphWriter(args.out_file) 

    # create the list of one or more initialized Chem.SubstructureEditor instances doing the editing work
    ed_list = createSubstructureEditors(args.patterns)
    
    # create an instance of the default implementation of the Chem.Molecule interface
    mol = Chem.BasicMolecule()
    i = 1

    # read and process molecules one after the other until the end of input has been reached
    try:
        while reader.read(mol):
            # compose a simple molecule identifier
            mol_id = Chem.getName(mol).strip() 

            if mol_id == '':
                mol_id = '#' + str(i) # fallback if name is empty
            else:
                mol_id = f'\'{mol_id}\' (#{i})'

            try:
                # output original molecule before the editing result
                if args.output_mol:
                    Chem.calcBasicProperties(mol, False)

                    if not writer.write(mol):
                        sys.exit(f'Error: writing molecule {mol_id} failed')

                # modify the input molecule according to the specified editing rules 
                num_changes = editMolecule(mol, ed_list, args)
                
                if not args.quiet:
                    print(f'- Editing molecule {mol_id}: {num_changes} edit(s)')

                Chem.calcBasicProperties(mol, False)
                    
                # output the edited molecule
                if not writer.write(mol):   
                    sys.exit(f'Error: writing edited molecule {mol_id} failed')
                        
            except Exception as e:
                sys.exit(f'Error: editing or output of molecule {mol_id} failed: {str(e)}')

            i += 1

    except Exception as e: # handle exception raised in case of severe read errors
        sys.exit(f'Error: reading molecule failed: {str(e)}')

    writer.close()
    sys.exit(0)

if __name__ == '__main__':
    main()

Download source file