1. Data types and operations with them
2019, 2020 Dr. Ramil Nugmanov;
2019 Dr. Timur Madzhidov; Ravil Mukhametgaleev
2022 Valentina Afonina
Installation instructions of CGRtools package information and tutorial’s files see on https://github.com/cimm-kzn/CGRtools
NOTE: Tutorial should be performed sequentially from the start. Random cell running will lead to unexpected results.
[1]:
import pkg_resources
if pkg_resources.get_distribution('CGRtools').version.split('.')[:2] != ['4', '1']:
print('WARNING. Tutorial was tested on 4.1 version of CGRtools')
else:
print('Welcome!')
Welcome!
[2]:
# load data for tutorial
from pickle import load
from traceback import format_exc
with open('molecules.dat', 'rb') as f:
molecules = load(f) # list of MoleculeContainer objects
with open('reactions.dat', 'rb') as f:
reactions = load(f) # list of ReactionContainer objects
m1, m2, m3, m4 = molecules # molecule
m7 = m3.copy()
m11 = m3.copy()
m11.standardize()
m7.standardize()
r1 = reactions[0] # reaction
m5 = r1.reactants[0]
m8 = m7.substructure([4, 5, 6, 7, 8, 9])
m10 = r1.products[0].copy()
CGRtools has subpackage containers with data structures classes:
MoleculeContainer - for molecular structure
ReactionContainer - for chemical reaction
CGRContainer - for Condensed Graph of Reaction
QueryContainer - queries for substructure search in molecules
QueryCGRContainer - queries for substructure search in CGRs
[3]:
from CGRtools.containers import * # import all containers
1.1. MoleculeContainer
Molecules are represented as undirected graphs. Molecules contain Atom objects and Bond objects.
Atom objects are represented as dictionary with unique number for each atom as key.
Bond objects are stored as sparse matrix with adjacent atoms pair as keys for rows and columns.
Hereafter, atom number is unique integer used to enumerate atoms in molecule. Please, don’t confuse it with element number in Periodic Table, hereafter called element number.
Methods for molecule handling and the arguments of MoleculeContainer are described below.
[4]:
m1.meta # dictionary for molecule properties storage. For example, DTYPE/DATUM fields of SDF file are read into this dictionary
[4]:
{'cdid': '001'}
[5]:
m1 # MoleculeContainer supports depiction and graphic representation in Jupyter notebooks.
[5]:
[6]:
m1.depict() # depiction returns SVG image in format string
[6]:
'<svg width="2.93cm" height="1.66cm" viewBox="-11.44 -0.76 2.93 1.66" xmlns="http://www.w3.org/2000/svg" version="1.1">\n <g>\n <defs>\n <mask id="mask-e3d2afc7-d422-420a-8063-9735fbb3c771">\n <rect x="-11.44" y="-0.76" width="2.93" height="1.66" fill="white"/>\n <g fill="black">\n <circle cx="-9.38" cy="-0.14" r="0.20"/>\n </g>\n <g font-family="monospace" stroke="black">\n <g font-family="sans-serif" font-size="0.50" stroke-width="0.05">\n <text x="-9.38" y="-0.14" dx="-0.20" dy="0.20">OH</text>\n </g>\n <g font-size="0.25" stroke-width="0.03">\n <text x="-10.81" y="-0.14" dx="-0.05" dy="0.20" text-anchor="end">3</text>\n <text x="-10.10" y="0.28" dx="-0.05" dy="0.20" text-anchor="end">4</text>\n <text x="-9.38" y="-0.14" dx="-0.15" dy="0.35" text-anchor="end">1</text>\n </g>\n <g font-family="monospace" font-size="0.3" stroke-width="0.03">\n <text x="-9.38" y="-0.14" dx="-0.05" dy="-0.20" text-anchor="end">15</text>\n </g>\n </g>\n </mask>\n </defs>\n <g fill="none" stroke="black" stroke-width="0.04" mask="url(#mask-e3d2afc7-d422-420a-8063-9735fbb3c771)">\n <line x1="-10.78" y1="-0.19" x2="-10.07" y2="0.22"/>\n <line x1="-10.84" y1="-0.09" x2="-10.13" y2="0.33"/>\n <line x1="-10.10" y1="0.28" x2="-9.38" y2="-0.14"/>\n </g>\n <g font-family="monospace">\n <g fill="#FF0D0D" font-family="sans-serif">\n <text x="-9.38" y="-0.14" dx="-0.20" dy="0.20" font-size="0.50">OH</text>\n </g>\n <g font-family="monospace" fill="black" font-size="0.30">\n <text x="-9.38" y="-0.14" dx="-0.05" dy="-0.20" text-anchor="end">15</text>\n </g>\n <g fill="#0305A7" font-size="0.25">\n <text x="-10.81" y="-0.14" dx="-0.05" dy="0.20" text-anchor="end">3</text>\n <text x="-10.10" y="0.28" dx="-0.05" dy="0.20" text-anchor="end">4</text>\n <text x="-9.38" y="-0.14" dx="-0.15" dy="0.35" text-anchor="end">1</text>\n </g>\n </g>\n </g>\n</svg>'
[7]:
with open('molecule.svg', 'w') as f: # saving image to SVG file
f.write(m1.depict())
[8]:
m_copy = m1.copy() # copy of molecule
m_copy
[8]:
[9]:
len(m1) # get number of atoms in molecule
# or
m1.atoms_count
[9]:
3
[10]:
m1.bonds_count # number of bonds
[10]:
2
[11]:
m1.atoms_numbers # list of atoms numbers
[11]:
(3, 4, 1)
Each structure has additional atoms attributes: number of neighbors and hybridization. The following notations are used for hybridization of atoms. Values are given as numbers below (in parenthesis symbols that are used in SMILES-like signatures are shown):
1 (s) - all bonds of atom are single, i.e. sp3 hybridization
2 (d) - atom has one double bond and others are single, i.e. sp2 hybridization
3 (t) - atom has one triple or two double bonds and other are single, i.e. sp hybridization
4 (a) - atom is in aromatic ring
Neighbors and hybridizations atom attributes are required for substructure operations and structure standardization. See below
[12]:
# iterate over atoms using its numbers
list(m1.atoms()) # works the same as dict.items()
[12]:
[(3, C()), (4, C()), (1, O(15))]
[13]:
# iterate over bonds using adjacent atoms numbers
list(m1.bonds())
[13]:
[(3, 4, Bond(2)), (4, 1, Bond(1))]
[14]:
# access to atom by number
m1.atom(1)
[14]:
O(15)
[15]:
try:
m1.atom(10) # raise error for absent atom numbers
except KeyError:
print(format_exc())
Traceback (most recent call last):
File "/tmp/ipykernel_2099034/3657801809.py", line 2, in <module>
m1.atom(10) # raise error for absent atom numbers
File "/home/valia/miniconda3/envs/cgrtools-master/lib/python3.10/site-packages/CGRtools/containers/common.py", line 89, in atom
return self._atoms[n]
KeyError: 10
[16]:
# access to bond using adjacent atoms numbers
m1.bond(1, 4)
[16]:
Bond(1)
[17]:
try:
m1.bond(1, 3) # raise error for absent bond
except KeyError:
print(format_exc())
Traceback (most recent call last):
File "/tmp/ipykernel_2099034/2926617244.py", line 2, in <module>
m1.bond(1, 3) # raise error for absent bond
File "/home/valia/miniconda3/envs/cgrtools-master/lib/python3.10/site-packages/CGRtools/containers/common.py", line 134, in bond
return self._bonds[n][m]
KeyError: 3
Atom objects are dataclasses which store information about:
element (symbol and number)
atomic mass and atomic radius
isotope
charge
radical state
xy coordinates
Also atoms has methods for data integrity checks and include some internally used data.
[18]:
a = m1.atom(1)
# access to information
a.atomic_symbol # element symbol
[18]:
'O'
[19]:
a.atomic_number # atomic number in periodic table
[19]:
8
[20]:
a.atomic_mass # atomic mass
[20]:
15.003065
[21]:
a.atomic_radius # atomic radius in periodic table
[21]:
0.48
[22]:
a.charge # formal charge
[22]:
0
[23]:
a.is_radical # atom radical state
[23]:
False
[24]:
a.isotope # atom isotope. Default isotope if not set. Default isotopes are the same as used in InChI notation
[24]:
15
[25]:
a.x # coordinates
a.y
#or
a.xy
[25]:
(-9.3843, 0.1375)
[26]:
a.neighbors # Number of neighboring atoms. It is read-only.
[26]:
1
[27]:
a.hybridization # Atoms hybridization. It is read-only.
[27]:
1
[28]:
try:
a.hybridization = 2 # Not assignable. Read-only! Thus error is raised.
except AttributeError:
print(format_exc())
Traceback (most recent call last):
File "/tmp/ipykernel_2099034/1699116341.py", line 2, in <module>
a.hybridization = 2 # Not assignable. Read-only! Thus error is raised.
AttributeError: can't set attribute 'hybridization'
Atomic attributes are assignable.
CGRtools has integrity checks for verification of changes induced by user
[29]:
a.charge = 1
m1
[29]:
[30]:
a.charge = 0
a.is_radical = True
m1
[30]:
[31]:
# bond objects also are data-like classes which store information about bond order
b = m1.bond(3, 4)
b.order
[31]:
2
[32]:
try:
b.order = 1 # order change not possible
except AttributeError:
print(format_exc())
Traceback (most recent call last):
File "/tmp/ipykernel_2099034/1138391940.py", line 2, in <module>
b.order = 1 # order change not possible
AttributeError: can't set attribute 'order'
Bonds are Read-only
For bond modification one should to use delete_bond
method to break bond and add_bond
for creating new.
[33]:
m1.delete_bond(3, 4)
m1
[33]:
Method delete_atom
removes atom from the molecule
[34]:
m1.delete_atom(3)
m1
[34]:
[35]:
m_copy # copy unchanged!
[35]:
Atoms and bonds objects can be converted into integer representation that could be used to classify their types.
Atom type is represented by python 3.8 hash of tuple formed from isotope (or 0), atom number, formal charge, and radical state.
[36]:
# 15 isotope
# 8 Oxygen
# 0 uncharged
# 1 is radical
# hash((15, 8, 0, 1)) == 831621142045386143
# If python 3.7 will use for CGRtools running, in-house hash implementation
# the same as in python 3.8 will use for hash generation, so, the result of
# hash function will be different from int(a)
int(a)
[36]:
831621142045386143
[37]:
int(b) # bonds are encoded by their order
[37]:
2
[38]:
a = m_copy.atom(1)
print(a.implicit_hydrogens) # get number of implicit hydrogens on atom 1
print(a.explicit_hydrogens) # get number of explicit hydrogens on atom 1
print(a.total_hydrogens) # get total number of hydrogens on atom 1
1
0
1
[39]:
m1
[39]:
[40]:
m1.check_valence() # return list of numbers of atoms with invalid valences
[40]:
[]
[41]:
m4 # molecule with valence errors
[41]:
[42]:
m4.check_valence()
[42]:
[1]
[43]:
m3
[43]:
[44]:
m3.sssr # Method for application of Smallest Set of Smallest Rings algorithm for rings
# identification. Returns tuple of tuples of atoms forming smallest rings
[44]:
((4, 5, 6, 7, 8, 9),)
Connected components.
Sometimes molecules has disconnected components (salts etc).
One can find them and split molecule to separate components.
[45]:
m2 # it's a salt represented as one graph
[45]:
[46]:
m2.connected_components # tuple of tuples of atoms belonging to graph components
[46]:
((5, 6, 7, 8, 9, 10), (11,))
[47]:
anion, cation = m2.split() # split molecule to components
[48]:
anion # graph of only one salt component
[48]:
[49]:
cation # graph of only one salt component
[49]:
Union of molecules
Sometimes it is more convenient to represent salts as ion pair. Otherwise ambiguity could be introduced, for example in reaction of salt exchange:
Ag+ + NO3- + Na+ + Br- = Ag+ + Br- + Na+ + NO3-. Reactants and products sets are the same.
In this case one can combine anion-cation pair into single graph. It could be convenient way to represent other molecule mixtures.
[50]:
salt = anion | cation
# or
salt = anion.union(cation)
salt # this graph has disconnected components, it is considered single compound now
[50]:
Substructures could be extracted from molecules.
[51]:
sub = m3.substructure([4,5,6,7,8,9]) # substructure with passed atoms
sub
[51]:
augmented_substructure
is a substructure consisting from atoms and a given number of shells of neighboring atoms around it. deep argument is a number of considered shells.
It also returns projection by default.
[52]:
aug = m3.augmented_substructure([10], deep=2) # atom 10 is Nitrogen
aug
[52]:
Atoms Ordering.
This functionality is used for canonic numbering of atoms in molecules. Morgan algorithm is used for atom ranking. Property atoms_order
returns dictionary of atom numbers as keys and their ranks according to canonicalization as values. Equal rank mean that atoms are symmetric (are mapped to each other in automorhisms).
[53]:
m5.atoms_order
[53]:
{4: 1, 1: 2, 3: 3}
Atom number can be changed by remap
method.
This method is useful when it is needed to change order of atoms in molecules. First argument to remap
method is dictionary with existing atom numbers as keys and desired atom number as values. It is possible to change atom numbers for only part of atoms. Atom numbers could be non-sequencial but need to be unique.
If argument copy is set True new object will be created, else existing molecule will be changed. Default is False.
[54]:
m5
[54]:
[55]:
remapped = m5.remap({4:2}, copy=True)
remapped
[55]:
1.2. ReactionContainer
ReactionContainer objects has the following properties:
reactants - list of reactants molecules
reagents - list of reagents molecules
products - list of products molecules
meta - dictinary of reaction metadata (DTYPE/DATUM block in RDF)
[56]:
r1 # depiction supported
[56]:
[57]:
r1.meta
[57]:
{'CdId': '1872',
'solvent': '3',
'temperature': '129.5',
'tabulated_constant': '-6.87'}
[58]:
print(r1.reactants, r1.products) # Access to lists of reactant and products.
reactant1, reactant2, reactant3 = r1.reactants
product = r1.products[0]
[<CGRtools.containers.molecule.MoleculeContainer object at 0x7fb0b93fa660>, <CGRtools.containers.molecule.MoleculeContainer object at 0x7fb0b93fa700>, <CGRtools.containers.molecule.MoleculeContainer object at 0x7fb0b93fa7a0>] [<CGRtools.containers.molecule.MoleculeContainer object at 0x7fb0b93fa840>]
Reactions also has standardize
, kekule
, thiele
, implicify_hydrogens
, explicify_hydrogens
, etc methods (see part 3). These methods are applied independently to every molecule in reaction.
1.3. CGR
CGRContainer object is similar to MoleculeConrtainer, except some methods. The following methods are not suppoted for CGRContainer:
standardization methods
hydrogens count methods
check_valence
CGRContainer also has some methods absent in MoleculeConrtainer:
centers_list
center_atoms
center_bonds
CGRContainer is undirected graph. Atoms and bonds in CGR has two states: reactant and product.
Composing to CGR
As mentioned above, atoms in MoleculeContainer have unique numbers. These numbers are used as atom-to-atom mapping in CGRtools upon CGR creation. Thus, atom order for molecules in reaction should correspond to atom-to-atom mapping.
Pair of molecules can be transformed into CGR. Notice that, the same atom numbers in reagents and products imply the same atoms.
Reaction also can be composed into CGR. Atom numbers of molecules in reaction are used as atom-to-atom mapping of reactants to products.
[59]:
cgr1 = m7 ^ m8 # CGR from molecules
# or
cgr1 = m7.compose(m8)
print(cgr1)
cgr1
O[->.]C1=CC=C(C=C1)[->.][N+](=O)[O-]
[59]:
[60]:
r1
[60]:
[61]:
cgr2 = ~r1 # CGR from reactions
# or
cgr2 = r1.compose()
print(cgr2) # signature is printed out.
cgr2.clean2d()
cgr2
C(O[.>-]C(C([.>-][O->0]CC)([->.]O)=O)(=O)[->.]O)C
[61]:
[62]:
a = cgr2.atom(2) # atom access is the same as for MoleculeContainer
[63]:
a.atomic_symbol # element attribute
[63]:
'O'
[64]:
a.isotope # isotope attribute
For CGRContainer attributes charge
, is_radical
, neighbors
and hybridization
refer to atom state in reactant of reaction; arguments p_charge
, p_is_radical
, p_neighbors
and p_hybridization
could be used to extract atom state in product part in reaction.
[65]:
a.charge # charge of atom in reactant
[65]:
-1
[66]:
a.p_charge # charge of atom in product
[66]:
0
[67]:
a.p_is_radical # radical state of atom in product.
[67]:
False
[68]:
a.neighbors # number of neighbors of atom in reactant
[68]:
1
[69]:
a.p_neighbors # number of neighbors of atom in product
[69]:
2
[70]:
a.hybridization # hybridization of atom in reactant. 1 means only single bonds are incident to atom
[70]:
1
[71]:
a.p_hybridization # hybridization of atom in product. 1 means only single bonds are incident to atom
[71]:
1
[72]:
b = cgr1.bond(4, 10) # take bond
Bonds has order
and p_order
attribute
order
attribute value is None, it means that bond was formedp_order
is None, it means that bond was brokenBoth order
and p_order
can’t be None
[73]:
b.order # bond order in reactant
[73]:
1
[74]:
b.p_order is None # bond order in product in None
[74]:
True
CGR can be decomposed back to reaction, i.e. reactants and products.
Notice that CGR can lose information in case of unbalanced reactions (where some atoms of reactant does not have counterpart in product, and vice versa). Decomposition of CGRs for unbalanced reactions back to reaction may lead to strange (and erroneous) structures.
[75]:
reactant_part, product_part = ~cgr1 # CGR of unbalanced reaction is decomposed back into reaction
# or
reactant_part, product_part = cgr1.decompose()
[76]:
reactant_part # reactants extracted. One can notice it is initial molecule
[76]:
[77]:
product_part #extracted products. Originally benzene was the product.
[77]:
For decomposition of CGRContainer back into ReactionContainer ReactionContainer.from_cgr
constructor method can be used.
[78]:
decomposed = ReactionContainer.from_cgr(cgr2)
decomposed.clean2d()
decomposed
[78]:
You can see that water absent in products initially was restored. This is a side-effect of CGR decomposing that could help with reaction balancing. But balancing using CGR decomposition works correctly only if minor part atoms are lost but multiplicity and formal charge are saved. In next release electronic state balansing will be added.
[79]:
r1 # compare with initial reaction
[79]:
1.4 Queries
CGRtools supports special objects for Queries. Queries are designed for substructure isomorphism. User can set number of neighbors and hybridization by himself (in molecules they could be calculated but could not be changed).
Queries don’t have reset_query_marks
method
[80]:
from CGRtools.containers import*
[81]:
m10 # ether
[81]:
[82]:
carb = m10.substructure([5,7,2], as_query=True) # extract of carboxyl fragment
print(carb)
carb
[O;D1;H0;Zd]=[C;D3;H0;Zd]-[O;D2;H0;Zs]
[82]:
CGRs also can be transformed into Query.
QueryCGRContainer
is similar to QueryContainer class for CGRs and has the same API.
QueryCGRContainer
take into account state of atoms and bonds in reactant and product, including neighbors and hybridization
[83]:
cgr_q = cgr1.substructure(cgr1, as_query=True) # transfrom CGRContainer into QueryCGRContainer
cgr_q
[83]:
1.5. Molecules, CGRs, Reactions construction
CGRtools has API for objects construction from scratch.
CGR and Molecule has methods add_atom
and add_bond
for adding atoms and bonds.
[84]:
from CGRtools.containers import MoleculeContainer
from CGRtools.containers.bonds import Bond
from CGRtools.periodictable import Na
[85]:
m = MoleculeContainer() # new empty molecule
m.add_atom('C') # add Carbon atom using element symbol
m.add_atom(6) # add Carbon atom using element number. {'element': 6} is not valid, but {'element': 'O'} is also acceptable
m.add_atom('O', charge=-1) # add negatively charged Oxygen atom. Similarly other atomic properties can be set
# add_atom has second argument for setting atom number.
# If not set, the next integer after the biggest among already created will be used.
m.add_atom(Na(23), 4, charge=1) # For isotopes required element object construction.
[85]:
4
[86]:
m.add_bond(1, 2, 1) # add bond with order = 1 between atoms 1 and 2
m.add_bond(3, 2, Bond(1)) # the other possibility to set bond order
[87]:
m.clean2d() # experimental function to calculate atom coordinates. Has number of flaws yet
m
[87]:
Reactions can be constructed from molecules. Reactions are tuple-like objects. Modification impossible.
[88]:
r = ReactionContainer(reactants=[m1], products=[m11]) # one-step way to construct reaction
# or
r = ReactionContainer([m1], [m11]) # first list of MoleculeContainers is interpreted as reactants,
# second one - as products
[89]:
r
[89]:
[90]:
r.fix_positions() # this method fixes coordinates of molecules in reaction without calculation of atoms coordinates.
r
[90]:
QueryContainers can be constructed in the same way as MoleculeContainers.
Unlike other containers QueryContainers additionally support atoms, neighbors and hybridization lists.
[91]:
q = QueryContainer() # creation of empty container
q.add_atom('N') # add N atom, any isotope, not radical, neutral charge,
# number of neighbors and hybridization are irrelevant
q.add_atom('C', neighbors=[2, 3], hybridization=2) # add carbon atom, any isotope, not radical, neutral charge,
# has 2 or 3 explicit neighbors and sp2 hybridization
q.add_atom('O', neighbors=1)
q.add_bond(1, 2, 1) # add single bond between atom 1 and 2
q.add_bond(2, 3, 2) # add double bond between atom 1 and 2
# any amide group will fit this query
[92]:
print(q) # print out signature (SMILES-like)
q.clean2d()
q
[N]-[C;D23;Zd]=[O;D1]
[92]:
1.6. Extending CGRtools
As an example, we show how special marks on atoms for ligand donor centers can be added.
[93]:
from CGRtools.periodictable import Core, C, O
[94]:
class Marked(Core):
__slots__ = '__mark' # all new attributes should be slotted!
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.__mark = None # set default value for added attribute
@property
def mark(self): # created new property
return self.__mark
@mark.setter
def mark(self, mark):
# do some checks and calculations
self.__mark = mark
def __repr__(self):
if self.__isotope:
return f'{self.__class__.__name__[6:]}({self.__isotope})'
return f'{self.__class__.__name__[6:]}()'
@property
def atomic_symbol(self) -> str:
return self.__class__.__name__[6:]
class MarkedC(Marked, C):
pass
class MarkedO(Marked, O):
pass
[95]:
m = MoleculeContainer() # create newly developed container MarkedMoleculeContainer
m.add_atom(MarkedC()) # add custom atom C
m.add_atom(MarkedO()) # add custom atom O
m.add_bond(1, 2, 1)
m.atom(2).mark = 1 # set mark on atom.
print(m)
m.clean2d()
m
OC
[95]:
[96]:
m.atom(2).mark # one can return mark
[96]:
1