Git Product home page Git Product logo

datawarrior_saturate_murcko_scaffolds's Introduction

Background

The Bemis-Murcko scaffold1 provided by DataWarrior2 retains information about bond order and chirality. Sometimes, however, it suffices to retain only atom connectivity, like an assumption «there are only single bonds». Note, DataWarrior equally offers the export of Bemis-Murcko skeleton, however this simplifies e.g. the scaffold about an imidazole into one of cyclopentane.

Typical use

Running from the command line, the script processes one, or multiple SMILES strings listed in an input file (symbolized by text below) to report the processed SMILES strings back to the command line:

python saturate_murcko_scaffolds.py [-h] [-o] text

Instead of an input file, individual SMILES strings equally may be processed; to prevent their misinterpretation by the shell, enclose them by either single, or double quotes. Either approach elected, the result may be redirected into a permanent record with a user defined file name (flag --outfile, or shorter -o). Only modules the standard library of Python 3 provides (e.g., version 3.11.2) are used.

Example

For a collection of organic materials, the Bemis-Murcko scaffolds were extracted with DataWarrior (then release 5.0.0 for Linux, January 2019) as listing test_input.smi including higher bond orders (see folder test_data). The effect of the «artificial saturation» is easy to recognize while comparing the scaffold lists (fig. file_diff) in a difference view of the two .smi files.

Difference view of the SMILES strings of a Murcko scaffold prior (left hand column) and after an «artificial saturation» (right hand column). The processing affects explicit bond order indicators, e.g. double bond (equality sign, e.g., line #14), triple bond bond (number sign #, not shown); or about implicit aromatization (lower case to upper case) for atoms of carbon, nitrogen, oxygen (depicted); or phosphorus, sulfur (not depicted). Stereochemical indicators about double bonds will be removed (e.g., slashes in lines #18 and #19). Descriptors of stereogenic centers (@-signs, e.g., line #25) and charges (not shown) are copied verbatim.

The work can be illustrated by OpenBabel3 with instructions to the command line in the pattern of

obabel -ismi test_input.smi -O test_input_color.svg -xc10 -xr12 -xl --addinindex

to generate a .svg file (vector representation), or

obabel -ismi test_input_sat.smi -O test_input_sat_color.png -xc10 -xr12 -xl --addinindex -xp 3000

to generate a bitmap .png with structure formulae depicted in a grid of 10 columns by 12 rows.

It is remarkable how well OpenBabel's displays the molecular structures with advanced motifs. In addition to those shown in the first illustration of this guide, see sub-folder test_data for a more extensive survey (e.g., the scaffold of cyclophane [entry #33], sparteine [#38], or adamantane [#50]).

Known peculiarities

The script provides «saturation» by dropping explicit information related to double and triple bonds which SMILES encode (=, # regarding bond order; / (forward slash), \ (backward slash) regarding (cis)-(trans) relationship around double bonds). While processing double bonds of e.g., ketones to yield secondary alcohols, the script refrains from the assignment of new CIP priorities and a corresponding label. It then depends on the program used for a visualization, if an explicit wedge is used (e.g., OpenBabel), or the absence of information is highlighted (e.g., as question mark in DataWarrior, or the project of CDK depict4) as ambiguous. Absolute configuration of stereogenic centers (indicated in SMILES with the @ sign) already assigned in the input however is retained.

For a selection of elements (C, N, O, P, S), the implicit description of aromatic systems (e.g., as c1ccncc1 in pyridine, c1c[nH]cc1 in pyrrol) is recognized. To offer a «saturation», these characters returned as upper case characters to yield e.g., piperidine (C1CCNCC1) and pyrrolidine (C1C[NH]CC1).

The script equally preserves up to one single negative, or single positive charge of these five elements (e.g., [O-]c1ccccc1 about the phenolate anion, and C[N+](c1ccccc1)(C)C about N,N,N-trimethylbenzenaminium cation). Here, it can be sensible to «sanitize» the results this script provides by other libraries as e.g. RDKit.5

The capitalization of the five characters is constrained to prevent non sensible transformations of e.g., an (implicitly) aromatic atom of tin [sn] into the invalid form [SN]. Though the script is going to write tin as [Sn], an adjustment of valence for elements written with two characters is beyond the current scope of the script.

A SMILES string may describe more than one molecule. Thus, the concatenation with "." (period character) as seen for example in descriptions of co-crystals like about 1,4-benzoquinone and hydroquinone, C1=CC(=O)C=CC1=O.c1cc(ccc1O)O, is retained. The example is resolved as C1CC(O)CCC1O.C1CC(CCC1O)O.

License

Norwid Behrnd, 2019–23, GPLv3.

Footnotes

Footnotes

  1. Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887–2893 (https://doi.org/10.1021/jm9602928).

  2. Sander, T.; Freyss, J.; Von Korff, M.; Rufener, 1. DataWarrior: An Open-Source Program For Chemistry Aware Data Visualization And Analysis. J. Chem. Inf. Model. 2015, 55, 460–473 (https://doi.org/10.1021/ci500588j). The program, (c) 2002–2023 by Idorsia Pharmaceuticals Ltd., is freely available under http://www.openmolecules.org. For the source code (GPLv3), see https://github.com/thsa/datawarrior.

  3. www.openbabel.org For the most recent documentation, see https://open-babel.readthedocs.io/en/latest/ReleaseNotes/ob310.html

  4. https://www.simolecule.com/cdkdepict/depict.html For the mentioned annotation of CIP labels, change No Annotation (second pull down menu from the left) to CIP Stereo Label.

  5. For an overview about the freely available RDKit library, see www.rdkit.org. An introduction into the topic of «molecular sanitization» is provided in the section of this very title in the on-line RDKit Book.

datawarrior_saturate_murcko_scaffolds's People

Contributors

nbehrnd avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

datawarrior_saturate_murcko_scaffolds's Issues

processing constrained to one pair of square brackets (maximum)

In a SMILES string, square brackets are used to discern atoms described by single character (C, N, O, P, S, etc) from those with more than one character (e.g., Cl, Br as [Cl] and [Br], respectively). Square brackets are equally used to annotate charges (e.g., [Fe3+]) as well as stereogenic centres (e.g., in O=C([C@H](c1ccccc1)O)O for (S)-mandelic acid). On the other hand, the current version of the script only allows one pair of square brackets at maximum.

This constraint prevents to process compounds with multiple stereogenic centres (e.g., tartaric acid) and/or charged molecules including charge compensating anions/cations and «inner salts» (e.g., pyridine hydrochloride, pyridine N-oxide). Future versions of the script should identify an approach without this limitation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.