openchemistry / 42 Goto Github PK

The answer

42's Introduction

Open Chemistry

Introduction

The Open Chemistry project is a collection of open source, cross platform libraries and applications for the exploration, analysis and generation of chemical data. The project builds upon various efforts by collaborators and innovators in open chemistry such as the Blue Obelisk, Quixote and the associated projects. We aim to improve the state of the art, and facilitate the open exchange of ideas and exchange of chemical data leveraging the best technologies ranging from quantum chemistry codes, molecular dynamics, informatics and visualization.

This repository contains git submodules for the Open Chemistry projects: Avogadro, MoleQueue and MongoChem. It can be used to download all relevant source files as well as building many of the necessary dependencies. Please see the documentation in the submodules for more details about each project.

Installing

We provide nightly binaries built by our dashboards for Mac OS X and Windows. If you would like to build from source we recommend that you follow our building Open Chemistry guide that will take care of building most dependencies.

Contributing

Our project uses the standard GitHub pull request process for code review and integration. Please check our development guide for more details on developing and contributing to the project. The GitHub issue tracker can be used to report bugs, make feature requests, etc.

Our wiki is used to document features, flesh out designs and host other documentation. Our API is [documented using Doxygen][Doxygen] with updated documentation generated nightly. We have several mailing lists to coordinate development and to provide support.

42's People

Watchers

42's Issues

Remote Open Chemistry server from local JupyterLab

We want to be able to connect to a specified Open Chemistry server using the server URL for public API/searches, and a username/API key for privileges endpoints. This can be used with a local JupyterLab instance, or central resources such as those provided at NERSC.

Search API and interfaces to it

The current state is that endpoints such as molecules, calculations will return all elements in the database which will not scale, or they can search on name, inchi etc. Our search capabilities are also quite limited at present. I would say on the backend we want to support search with some common features across our different collections/endpoints:

Automatically limit to 25 search results
Ensure returned data is useful, but not too big
Support for changing the limit
Support for specifying an offset
Support for ordering by parameters (default to most recent first)
Support for changing the sort order/direction

The returned data should follow a similar pattern too, with a JSON object containing high level summary of the results, and a results array containing result objects:

{
  "matches": 42,
  "limit": 2,
  "offset": 0,
  "results": [ { ... }, { ... } ]
}

I think we need to work on extending our concept of users to include some useful data such as ORCID, Twitter username, etc that can be set publicly so that you might search on a name, ORCID, etc to see results for that person in molecules, calculations, ... There are a few things we should try and get working in search too, including queries like USER AND heavy atom count = 10, > 10, etc. Same for molecular weight, formula, InChI, InChI key, SMILES.

I think starting with molecules search is good as it is simpler, then calculations doing things like calculations run by Marcus using NWChem or Psi4 sorted by most recent would be good to think about. These should come after the card stuff, and it likely needs further discussion but writing down some ideas.

Card and table view for molecules

Previous work featured a card view:

and a table view:

We need some equivalents for displaying results to searches, and should look at possibilities for 3D structures that summarize a sequence of results. I think we need to think about this for the single page interface, and also what we can do within Jupyter.

Errors from codes/Docker containers

JSON object communicating high level success/failure. Access to errors, warnings, etc from the Jupyter Python kernel APIs.

Machine learning Jupyter kernel objects/API

Inputs, APIs, and data flow for machine learning tasks.

Export triples in ingest, add to Jena

Export triples for InChI, energy, SMILES, push to Jena triple store, expose as an endpoint.

Run multiple SMILES in one docker container

In order to be able to do this, I think we will need to modify a little bit of our calculation workflow. @alesgenova can correct me if I'm wrong about any parts, but I think this is what happens:

Before the docker container runs, a calculation is posted to the calculations collection. This calculation contains a single molecule ID and most of the needed input. It is considered to be a pending calculation.
A taskflow is created using information from this calculation. It gets the molecule information via the molecule ID, converts it to the format needed by the docker container, and also gets the input from the calculation. The docker container is then ran for this single molecule.
When the docker container is finished, the output is written to the calculation and then put back into the database (overwriting the original calculation used for the input).
When a user tries to run a calculation in the future, it checks to see if a calculation already exists that used that moleculeId, input parameters, and docker image (to avoid re-running the same calculation).

In order to, in general, be able to run multiple calculations in a single docker container (such as running multiple SMILES through chemml), I think we'll need to modify this workflow.

How? I'm not quite sure yet. But I have one proposal at least: what if we make it so that a calculation can have a list of molecule IDs instead of just a single molecule ID? And then the single docker container can run the same input parameters on all of the molecules in the list in one go.

Thoughts, @alesgenova, @cjh1, and @cryos?

Notes on issues with data.openchemistry.org

I was just taking a look at the new deployment, and thought I would point out a few issues I saw as I was using it.

Pasting in a link from a view I was looking at results in a 404, e.g. here
The menu on the visualization for that page (or any of the menus on the visualization widget) is broken, just shows text but no actual menu
Need user API key management piece

I think that is all I have for now, it would be good to get both working soon.

Geometry collection, linking to molecule collection

This collection would always have a parent molecule, and have one geometry, with optional provenance, i.e. from calculation a, generated from InChI using Open Babel, from structure resolver, etc.

Plots/graphs in notebooks

Plotly and bqplot offer significant out of the box functionality for standard plots in notebooks. We should integrate some examples of their use, especially for simple plots where we just want to display scatter plots, bar charts, etc. This likely involves adding some optional dependencies, and ensuring they are present in our demonstration installation.