APRL-SSP (APRL Substructure Search Program) is made up of several primary units:
- "spider_query.py": Query the ChemSpider database for SMILES strings and other properties of a molecule.
- "substructure_search.py": Use the Open Babel chemoinformatics tool to find number of instances of a substructure (specified by SMARTS pattern) occurring in a molecule (specified by SMILES pattern).
- "substructure_generate_fulltable.py": Use the Open Babel chemoinformatics tool to find atoms associated with a substructure (specified by SMARTS pattern) occurring in a molecule (specified by SMILES pattern).
- "generate_carbontypes.py": Define carbon types from associated functional groups (adapted from aprl-carbontypes).
- "validation_triple.py": Evaluate a set of patterns for specificy and completeness (adapted from aprl-carbontypes).
- "simpol.py": Estimate compound vapor pressures using SIMPOL.1.
Its application is described by
Ruggeri, G. and Takahama, S.: Technical Note: Development of chemoinformatic tools to enumerate functional groups in molecules for organic aerosol characterization, Atmos. Chem. Phys., 16, 4401-4422, doi:10.5194/acp-16-4401-2016, 2016.
Takahama, S. and Ruggeri, G.: Technical Note: Relating functional group measurements to carbon types for improved model–measurement comparisons of organic aerosol composition, Atmos. Chem. Phys., 17, 4433–4450, doi:10.5194/acp-17-4433-2017, 2017.
The program is released under the GNU Public License (GPLv3). Please include a citation to our manuscript if used. The corresponding author can be contacted with any bug reports or questions.
Tables from the manuscript are provided in the "SMARTSpatterns/" subdirectory:
- "MCMgroups.csv": Table 1, substructures 1-30.
- "FTIRextra.csv": Table 1, substructures 31-54.
- "SIMPOLgroups.csv": Table 2, SIMPOL.1 groups. Use with
-e SIMPOLexportlist.csv
(see below) to get groups exported in the right order. - "OScBonds.csv": Table 4, bonds for calculating oxidation state for each carbon atom.
- "MCMgroupsv33.csv": Update to MCMgroups.csv to accommodate conditions for aprl-carbontypes (compatible with MCMv3.2) and new additions to MCMv3.3 isoprene mechanism. (Carbon specificity condition is not met for anhydrides.)
Supplementary scripts:
- "substructure_molecular_attributes.py": Extract molecular attributes that can be retrieved from a pybel Molecule object (e.g., molecular weight).
Scripts and input files which reproduce the validation figures in the manuscript are also described below.
You may want to add the directory of the program path so that the two python scripts can be accessed anywhere. In Linux/Mac (include in ~/.bash_profile
to make this available across various terminal sessions):
$ export PATH=$PATH:/path/to/aprl-ssp
In Windows (use setx
instead of set
to make available across various terminal sessions):
set PATH=%PATH%;Z:path\to\aprl-ssp
See this page for GUI instructions (there may be better sites). Note that these scripts have only been tested under Ubuntu Linux. Windows users will have to take note to enter path separators as \
rather than /
.
In examples below, it is assumed that the path has been added and the scripts are made executable (chmod +x scriptname.py
in Linux). Otherwise, each command must be issued as python /path/to/aprl-ssp/scriptname.py
. All examples are run assuming this has been configured, and the working directory is the "examples/" subdirectory (included in program directory).
All scripts require python (tested with 2.7). Anaconda distribution is recommended as it comes with the pandas library (assumed to be installed in further instructions). Python is installed with Mac OS X but a separate installation of the scientific python stack is recommended. Package installation can be carried through with the pip package manager:
$ pip install packagename
"substructure_search.py" and "substructure_generate_fulltable.py" additionally require Open Babel (link to installation guide included below).
Required python packages:
- chemspipy (search from chemspider). For more information, see http://chemspipy.readthedocs.org/.
Query to CSV is the default. Flags to change this behavior:
-D
: search compounds in database rather than ChemSpider-d
: output to database-b
: output to both database and CSV file
Three main arguments:
-p
: value ofPREFIX
. Name of prefix to database or CSV tables. When-D
is present, this also indicates the name of database to be read in. If unspecified, will default to "queryresults".-i
: value ofINPUTFILE
(optional). Name of file containing compounds; a csv file with a column called "compound". Optional if-D
; otherwise required.-t
: value ofTOKEN
(optional). Name of file containing ChemSpider token if not "~/.chemspidertoken". See next section.
Input/output included in folder "examples/".
-
Query to CSV:
$ spider_query.py -p example -i compounds.csv ``` This will create files called "examples\_main.csv" and "examples\_alternates.csv".
-
Query to database only:
$ spider_query.py -d -p example -i compounds.csv ``` This will create a file called "examples\_p.db". ("\_p" stands for python "pickle object created by shelve module)
-
Query to both database and CSV:
$ spider_query.py -b -p example -i compounds.csv ``` This will create files called "examples\_main.csv", "examples\_alternates.csv", and "examples\_db.p".
-
Database to CSV:
$ spider_query.py -D -p example -i compounds.csv ``` This will read compounds from file called "examples\_p.db" and create files called "examples\_main.csv", "examples\_alternates.csv". if `-i compounds.csv` is omitted, all compounds in the database will be included.
If the ChemSpider token is not written to "~/.chemspidertoken", specify its location with the -t
argument. For instance,
$ spider_query.py -t /path/to/token.txt -p example -i compounds.csv
Optionally, the argument passed can be the value of the token itself.
Where possible, superfluous spaces should be removed in the name of the compound. We have found inconsistencies in returned results. For instance:
- "benzo[b]fluoranthene" returns benzo[b]fluoranthene with SMILES string of "c1ccc2c(c1)cc-3c4c2cccc4-c5c3cccc5"
- "benzo[b] fluoranthene" returns (+)-Vincamine with SMILES string of "CC[C@@]12CCCN3[C@@H]1c4c(c5ccccc5n4C@(C(=O)OC)O)CC3"
Required external programs:
- Open Babel (http://openbabel.org/wiki/Category:Installation)
Required python packages:
- pybel (interface to Open Babel) here
Main arguments:
-g
: value ofGROUPFILE
. Name of file which contains columns {substructure, pattern}. An additional column, "export", consisting of 0/1 values indicating whether this substructure should be included inOUTPUTFILE
is allowed.-i
: value ofINPUTFILE
. Name of file which contains columns {compound, SMILES}.-o
: value ofOUTPUTFILE
. Name of file which contains matrix of compound x substructure.-e
: value ofEXPORT
(optional). Name of file which contains list of substructures to include inOUTPUTFILE
. Overrides "export" column if present inGROUPFILE
.
Flags:
-d
: When present, indicates thatGROUPFILE
(andEXPORT
, if provided) exists in the subdirectory,SMARTSpatterns/
, distributed with "substructure_search.py".
$ substructure_search.py -d -g SIMPOLgroups.csv -e SIMPOLexportlist.csv \
-i apinenemech.csv -o apinenemech_SIMPOLgroups.csv
In this example, SIMPOLexportlist.csv provides a list of groups in the order that they should be exported.
Similar to substructure_search.py but returns atoms associated with each fragment. Technically, this information can be used to enumerate fragments as done by substructure_search.py; but the syntax can be more complex (on account of using set operations) when specifying fragments as a function of other SMARTS patterns.
Required external programs:
- Open Babel (http://openbabel.org/wiki/Category:Installation)
Required python packages:
- pybel (interface to Open Babel)
Main arguments:
-g
: value ofGROUPFILE
. Name of file which contains columns {substructure, pattern}. An additional column, "export", consisting of 0/1 values indicating whether this substructure should be included inOUTPUTFILE
is allowed.-i
: value ofINPUTFILE
. Name of file which contains columns {compound, SMILES}.-o
: value ofOUTPUTPREFIX
. Name of file prefix to be used for generated output files: {PREFIX}_atomcounts.csv, {PREFIX}_groupcounts.csv, {PREFIX}_atomicmass.csv, {PREFIX}_atomfulltable.csv. {PREFIX}_groupcounts.csv is similar to the output of substructure_search.py but does not contain the full set of patterns in theINPUTFILE
-- ony the matched ones.-e
: value ofEXPORT
(optional). Name of file which contains list of substructures to include inOUTPUTFILE
. Overrides "export" column if present inGROUPFILE
. (currently not implemented)
Flags:
-d
: When present, indicates thatGROUPFILE
(andEXPORT
, if provided) exists in the subdirectory,SMARTSpatterns/
, distributed with "substructure_search.py".
$ substructure_generate_fulltable.py -d -g MCMgroups.csv \
-i apinenemech.csv -o apinenemech_MCMgroups
Generate functional group and carbon type matrices (from aprl-carbontypes; adapted to python).
Main arguments:
-i
: value ofINPUTFILE
. File generated by substructure_generate_fulltable.py; csv format.-o
: value ofOUTPUTPREFIX
. Output prefix.
$ generate_carbontypes.py -i apinene_MCMgroups_atomfulltable.csv -o apinene
Adopted from aprl-carbontypes. Supercedes validation_atoms.R, except validation_triple.py does not currently export table of unmatched or non-unique atoms [TODO].
Main arguments:
-f
: value ofATOMFULLTABLE
. File generated by substructure_generate_fulltable.py; csv format.-a
: value ofATOMCOMMON
. File generated by substructure_seach.py with SMARTSpatterns/common_atoms.csv patterns; csv format.-o
: value ofOUTPUTPREFIX
. Output prefix.
$ validation_triple.py -f apinene_MCMgroups_atomfulltable.csv -a apinene_common_atoms.csv -o apinene
Uses SIMPOL.1 (Pankow and Asher, doi:10.5194/acp-8-2773-2008, 2008) to estimate vapor pressure p^0 (atm) and \Delta H (kJ/mole) for a given temperature.
Main arguments:
-i
: value ofINPUTFILE
. File generated by substructure_search.py with SMARTSpatterns/SIMPOLgroups.csv patterns; csv format.-o
: value ofOUTFILE
. Name of outputfile.-t
: value ofTEMP
. Temperature at which to calculate properties.
$ simpol.py -i apinene_SIMPOLgroups.csv -o apinene_props_298.csv -t 298.15
Patterns specified in GROUPFILE can be derived from a combination of SMARTS patterns using set operations. For instance, ester, all
is defined as "[CX3,CX3H1](=O)[OX2H0][#6]"
. nitroester
is defined as "[#6][OX2H0][CX3,CX3H1](=O)[C;$(C[N+](=O)[O-]),$(CC[N+](=O)[O-]),$(CCC[N+](=O)[O-]),$(CCCC[N+](=O)[O-]),$(CCCCC[N+](=O)[O-])]"
. ester
can be defined as {ester, all}-{nitroester}
. When present, such custom patterns are computed after all the SMARTS patterns have been matched and counted. Additionally, functions can be provided by the user. In current implementation, functions would presumably use OpenBabel methods.
- SMARTS pattern.
- Bracketed expression. E.g.,
{ester, all}-{nitroester}
. - Quoted expression. Substituted patterns in
'{}
are not evaluated before passing to function. E.g.,count_nitrophenol(molecule,'{phenol},'{ester})
. - Keyword
eval
denotes another type of python expression, but can also be used to preface expressions containing brackets (redundant). E.g.,eval 1
,eval count_aromatic_rings(molecule)
. - Keyword
molecule
denotes the pybel Molecule class instance (useful for passing to user-defined functions). E.g.,eval count_aromatic_rings(molecule)
.
If only SMARTS patterns are used, the same pattern file can be provided to "substructure_search.py" and "substructure_search_fulltable.py". When additional expressions are supplied, they must be changed such that arithmetic operations are used for matched entries in "substructure_search.py" and set operations for "substructure_search_fulltable.py".
User can define functions to be called for evaluation. These functions should be contained in a file called "userdef.py." These functions will generally accept as its first argument the pybel Molecule object upon which operations are to be performed. In the pattern file, the argument to the function should be provided as molecule
(lowercase) as this is the object whose value will be evaluated.
(Note this has been superceded by "validation_triple.py".)
Three scripts are provided for generating validation figures:
- "run_validation.py": Calls "validation_atoms.R" and "validation_groups.R" and applies to specific CSV files in "validation/" directory (output figures are also placed there).
- "validation_atoms.R": Tests completeness and specificity at the atom level (2 plots generated for each set of compounds).
- "validation_groups.R": Tests that the number of groups are matched against manually enumerated values (1 plot generated for each set of compounds).
The main script, "run_validation.py" script can be run in the "aprl-ssp/" directory. It will read in two lists of files and apply the comparison against reference and matched groups.
"validation/filelist_atoms.csv" for validation of atom membership criteria (completeness and specificity) includes:
- apinenemech.csv
- apinenepropenemech.csv
- tmbmech.csv
- tmbpropenemech.csv
"validation/filelist_groups.csv" for validation of group counts against manually enumerated values:
- compounds_SIMPOLgroups.csv
- compounds_FTIRextra.csv
Required sofware for validation:
-
python (tested with 2.7)
-
R (http://cran.r-project.org), also free. Required packages:
reshape2
,dplyr
.> install.packages(c("reshape2","dplyr"))