2. Signatures and duplicates selection

    1. 2019, 2020 Dr. Ramil Nugmanov;

    1. 2019 Dr. Timur Madzhidov; Ravil Mukhametgaleev

    1. 2022 Valentina Afonina

Installation instructions of CGRtools package information and tutorial’s files see on https://github.com/cimm-kzn/CGRtools

NOTE: Tutorial should be performed sequentially from the start. Random cell running will lead to unexpected results.

import pkg_resources
if pkg_resources.get_distribution('CGRtools').version.split('.')[:2] != ['4', '1']:
    print('WARNING. Tutorial was tested on 4.1 version of CGRtools')
# load data for tutorial
from pickle import load
from traceback import format_exc

with open('molecules.dat', 'rb') as f:
    molecules = load(f) # list of MoleculeContainer objects
with open('reactions.dat', 'rb') as f:
    reactions = load(f) # list of ReactionContainer objects

m1, m2, m3 = molecules[:3] # molecule
m7 = m3.copy()
m11 = m3.copy()
r1 = reactions[0] # reaction
cgr2 = ~r1
benzene = m3.substructure([4,5,6,7,8,9])
m3.delete_bond(4, 5)

2.1. Molecule Signatures

MoleculeContainer has methods for unique molecule signature generation. Signature is SMILES string with canonical atoms ordering.

For signature generation one need to call str function on MoleculeContainer object.
Fixed length hash of signature could be retrieved by calling bytes function on molecule (correspond to SHA 512 bitstring).

Order of atoms calculated by Morgan-like algorithm. On initial state for each atoms it’s integer code calculated based on its type. All bonds incident to atoms also coded as integers and stored in sorted tuple. Atom code and tuple of it’s bonds used for ordering and similar atoms detecting. Ordered atoms rank is replaced with new integer code. Atoms of the same type with the same bonds types incident to it have equal numbers.

Numbers codes found are used in Morgan algorithm cycle. Loop is repeated until all atoms will be unique or number of unique atoms will not change in 3 subsequent loops.

ms2 = str(m2)  # get and print signature
# or

hms2 = bytes(m2)  # get sha512 hash of signature as bytes-string

String formatting is supported that is useful for reporting

print(f'f string {m2}')  # use signature in string formatting
print('C-style string %s' % m2)
print('format method {}'.format(m2))
f string OC(=O)C([O-])=O.[Na+]
C-style string OC(=O)C([O-])=O.[Na+]
format method OC(=O)C([O-])=O.[Na+]

For Queries number of neighbors and hybridization will be added to signature. Note that in this case they are not readable as SMILES. But possible to hide this data.

mq = m2.substructure(m2, as_query=True)
print(f'{mq}')  # get signatures with neighbors, hydrogens, rings, and hybridization data
print('{:!n}'.format(mq))  # get signature without neighbors marks
print('{:!h}'.format(mq))  # get signature without hybridization marks
print('{:!H}'.format(mq))  # get signature without hydrogens marks
print('{:!R}'.format(mq))  # get signature without rings marks
print(format(mq, '!h!H')) # include only number of neighbors in signature
print(f'{mq:!n!h!H!R}')  # hide all data
Atoms in the QueryContainer are represented in the following way:
[isotope;element_symbol;stereo state;D;H;r;Z;radical state;charge].
D means number of neighbors,
H - number of hydrogens,
r - atom is part of ring includes number of atoms,
Z - hybridization.
Notation for hybridization is the following:
s - all bonds of atom are single
d - atom has one double bond and others are single
t - atom has one triple or two double bonds and other are single
a - atom is in aromatic ring
[C;D2;r6;Za] - carbon atom has 2 neighbors, it is the part of six-atoms ring, and it is in aromatic ring.
[14N;D1;H0;Zs] - nitrogen atom (isotope 14) has one neighbor, it isn’t bonded with hydrogen atoms, and it has s hybridization

Signatures for CGRContainer include only radical state marks (*) additionally to common SMILES notation.

Atoms in the QueryCGRContainer are represented in the following way:
[isotope;element_symbol;stereo state;h1>h2n1>n2;radical state;charge].

h1 means hybridization in reactant atom, h2 - hybridization in product atom, n1 means number of neighbors in reactant atom, n2 - in product atom.

[.>-][C;s>s0>1;*] - single bond formation and carbon atom, for which s hybridization is kept in reaction, and number of neighbors changed from 0 to 1; it is radical

Molecules comparable and hashable

Comparison of MoleculeContainer is based on its signatures. Moreover, since strings in Python are hashable, MoleculeContaier also hashable.

NOTE: MoleculeContainer can be changed. This can lead to unobvious behavior of the sets and dictionaries in which these molecules were placed before the change. Avoid changing molecules (standardize, aromatize, hydrogens and atoms/bonds changes) placed inside sets and dictionaries.

m1 != m2 # different molecules
m7 == m11 # copy of the same molecule
m7 is m11  # this is not same objects!
# Simplest way to exclude duplicated structures
len({m1, m2, m7, m11}) == 3 # create set of unique molecules. Only 3 of them were different.

2.2. Reaction signatures

ReactionContainer has its signature. Signature is SMIRKS string in which molecules of reactants, reagents, products presented in canonical order.

API is the same as for molecules


Get reaction signature with mapping:

print(f'f-string: {r1:m}')
print('format method: {:m}'.format(r1))
print('format function: ', format(r1, "m"))
f-string: [C:5]([C:6](=[O:9])[OH:10])(=[O:7])[OH:8].[CH2:11]([CH3:12])[O-:2].[CH2:4]([OH:1])[CH3:3]>>[O:1]([CH2:4][CH3:3])[C:6]([C:5]([O:2][CH2:11][CH3:12])=[O:7])=[O:9]
format method: [C:5]([C:6](=[O:9])[OH:10])(=[O:7])[OH:8].[CH2:11]([CH3:12])[O-:2].[CH2:4]([OH:1])[CH3:3]>>[O:1]([CH2:4][CH3:3])[C:6]([C:5]([O:2][CH2:11][CH3:12])=[O:7])=[O:9]
format function:  [C:5]([C:6](=[O:9])[OH:10])(=[O:7])[OH:8].[CH2:11]([CH3:12])[O-:2].[CH2:4]([OH:1])[CH3:3]>>[O:1]([CH2:4][CH3:3])[C:6]([C:5]([O:2][CH2:11][CH3:12])=[O:7])=[O:9]

2.3. CGR signature

CGRContainer have its signature. Signatures is SMIRKS-like strings where dynamic bond labels and dynamic atoms are also specified within squared brackets, so not only atoms but bonds could be written in brackets if a bond has complex parameters. Dynamic bonds in CGR have special label representing changes in bond orders. Dynamic atom corresponds to a change of formal charge or radical state of atom in reaction. Their labels are also given in brackets, including the atom symbol and text keys for atomic property in reactant and product, separated by symbol >. For a neutral atom A gaining a positive charge +n in reaction dynamic atom will be encoded as [A0>+n]. In case of charges +1 and -1, the number 1 is omitted. Properties for charges and radicals may be combined consecutively within one pair of brackets, e.g. [A0>-^>*] stands for an atom which becomes an anion-radical.