edrn / cancerdataexpo Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 3.41 MB

Buildout for the EDRN backend data application server we affectionately call the CancerDataExpo

Home Page: https://edrn.jpl.nasa.gov/cancerdataexpo

License: Apache License 2.0

Python 99.87% Dockerfile 0.13%

cancer knowledge ontology rdf

cancerdataexpo's People

Contributors

Watchers

cancerdataexpo's Issues

Buildout fails making "zope" recipe

The buildout of this software fails with an error:

TypeError: 'Version' object has no attribute '__getitem__'

The issue seems to be an older version of plone.recipe.zope2instance incompatible with a later setuptools. Pinning to 4.4.1 might help.

For more info, see: https://community.plone.org/t/buildout-typeerror-version-object-has-no-attribute-getitem/6607

Support Project Scientist and Program Officer

According to Jackie Dahlgren in <MW3PR11MB4617C5475637324BFFB90006C2A8A@MW3PR11MB4617.namprd11.prod.outlook.com>, there are two new roles appearing in the SOAP XML API for the Committee_Membership operation:

Project Scientist
Program Officer

Current the CancerDataExpo treats these as mere "members", but they should have distinguished roles in the RDF.

Support DataSharingPolicyDT slot on Site() API

BW said:

In our TEST region, Perdy, both the Registered_Person() with new fields InterestNameList/ InterestDescList and Site() with new field DataSharingPolicyDT are available for you to test. You’ve previously tested there before, so hopefully you’ve still got access to that connection point and everything tests out smoothly. Let us know.

GenericSetup error

Buildout out results in:

Got Products.PloneLDAP 1.2.
Version and requirements information containing products.genericsetup:
  [versions] constraint on products.genericsetup: 1.7.7
  Requirement of plone.app.ldap: Products.GenericSetup>=1.8.2
  Requirement of plone.app.dexterity: Products.GenericSetup
  Requirement of plone.app.caching: Products.GenericSetup
  Requirement of Products.CMFPlone: Products.GenericSetup>=1.4
  Requirement of Products.CMFPlacefulWorkflow: Products.GenericSetup
While:
  Installing zope.
Error: The requirement ('Products.GenericSetup>=1.8.2') is not allowed by your [versions] constraint (1.7.7)

Pinning Products.GenericSetup to 1.8.2 might help.

LabCAS Collaborative Group inconsistency

Map LabCAS's inconsistent naming of collaborative groups to the official group names.

The names currently in EDRN LabCAS Solr are:

Breast and Gynecologic (missing "Cancers Research Group") ❌
Breast/GYN ❌
GI and Other Associated (missing periods, "Cancers Research Group") ❌
Lung and Upper Aerodigestive Cancers Research Group ✅
Lung and Upper Aerodigestive (missing "Cancers Research Group") ❌
Lung and Upper Areodigestive (misspelled "aerodigestive", missing words) ❌
Not Applicable (not a collaborative group) ❌
Prostate and Urologic (missing "Cancers Research Group") ❌
TBD (not a collaborative group) ❌

The official names are:

Breast and Gynecologic Cancers Research Group
G.I. and Other Associated Cancers Research Group
Lung and Upper Aerodigestive Cancers Research Group
Prostate and Urologic Cancers Research Group

RDF for Protocols has incorrect cancerType

The RDF for protocols from the CancerDataExpo looks like this:

  <ns2:Protocol rdf:about="http://edrn.nci.nih.gov/data/protocols/288">
        …
        <ns1:cancerType>174, 182, 183, 182                                                                                  </ns1:cancerType>
        …

That cancerType is useless; it should be an rdf:resource to the corresponding Disease object.

Collaborative Group Filter

In LabCAS, we need to map these to proper values:

Breast and Gynecologic (missing "Cancers Research Group") ❌
Breast/GYN ❌
GI and Other Associated (missing periods, "Cancers Research Group") ❌
Lung and Upper Aerodigestive (missing "Cancers Research Group") ❌
Lung and Upper Areodigestive (misspelled "aerodigestive", missing words) ❌
Prostate and Urologic (missing "Cancers Research Group") ❌

And these should drop the collaborative group predicates:

Not Applicable (not a collaborative group) ❌
TBD (not a collaborative group) ❌

Old Plone Hotfixes prevent running

A number of old Plone hotfixes are installed; these prevent HTTP requests from being serviced:

AttributeError: 'module' object has no attribute 'decode_htmlentity

Removing the hotfixes from the Zope instance's eggs list helps. However we should ascertain if there are newer and/or better hotfixes that go with the version of Plone used in the CancerDataExpo.

BioMuta data uses incorrect object class

Apparently when David added BioMuta data he reused the object class for Biomarkers, i.e., http://edrn.nci.nih.gov/rdf/rdfs/bmdb-1.0.0#Biomarker. But that's the class for actual biomarkers, not mutation data.

The new portal mixes all the RDF into a single statement database which results in mutations and biomarkers being treated the same, and it cannot ingest mutations as biomarkers.

A better object class would be urn:edrn:types:biomarkers:mutation.

Staff photo resource URIs are incorrect

Grab a copy of https://edrn.jpl.nasa.gov/cancerdataexpo/rdf-data/registered-person/@@rdf

Look at, for example, Dean Brenner's photo URI:

http://edrn.jpl.nasa.gov/dmcc/staff-photographs/piPhoto67.gif

That URI is a URL and it is not 404 not found. It should be found. The correct URI is

https://edrn.jpl.nasa.gov/cancerdataexpo/staff-photographs/piphoto67.gif/@@images/image.gif

MemberGroup API support

DMCC has added a new API called MemberGroup to their SOAP API.

We will need to convert from proprietary SOAP to neutral RDF.

Upgrade to Python 3

Like it says

Cull Old RDF

The CancerDataExpo app saves every single RDF file it generates whenever its upstream data changes. While this is nice in theory, there's probably no use for reviewing a list of protocols from 2017 when the ones in 2020 are authoritative.

The app should automatically archive or just outright delete older RDF.

Support "Interest" and "Interest Description" slots on Registered_Person() API

The SOAP API at the DMCC is being upgraded so that the Registered_Person() endpoint will have two new slots:

Interest (varchar 500) is a |-separated list of interests that a person possesses
Interest Description (varchar 8000) is a |-separated list of in-depth summaries of those interests

Dockerize

What it says: use Docker to "containerize" the application and make a Docker Composition so we don't need to rely on sysadmins to participate in deployment of this.

Consortium FIlter

For LabCAS RDF generation, filter on the "Consortium" field from Solr; include "EDRN" only.

Cannot buildout

A new buildout of this software fails trying to find distributions for z3c.recipe.staticlxml. In addition, a pinned setuptools prevents it from working, plus it's missing an unpinned zc.buildout.

Adding these allow-hosts seems to help:

[buildout]
allow-hosts +=
    oodt.jpl.nasa.gov
    pypi.fury.io
    *.githubusercontent.com
    *.github.com
    *.python.org
    *.plone.org
    launchpad.net
    files.pythonhosted.org
    pypi.org
    effbot.org

In addition, these version unpins are needed:

[versions]
setuptools =
zc.buildout =

Furthermore, these version pins are needed

[versions]
biopython = 1.66
Products.GenericSetup = 1.8.2
plone.recipe.zope2instance = 4.4.1
Products.LDAPUserFolder = 2.27

Need separate DB server

Currently, the CancerDataExpo uses a single Zope instance to as both app server and database server. This means we cannot do database maintenance (specifically packing the database) without also bringing down the app server.

The application should use the separate Zope instance as app server that uses a Zope Enterprise Objects (ZEO) instance as database server.

This will also let us do daily database backups.

Automated Docker Imaging

So we don't have to hand-build multi-architecture images anymore

Add clearance

The clearance = CL № 22-6806

LabCAS RDF should have public data only

As the title says: LabCAS RDF should have public data only. Currently it has everything.

Provide more information on LabCAS to the portal

The issue EDRN/P5#102 requests that the same graphs that appear on EDRN LabCAS also appear on the portal. However, the RDF from LabCAS produced by the CancerDataExpo doesn't include the additional information necessary to produce these graphs, specifically

Number of datasets
Number of files

In addition, it may be more efficient to have the CancerDataExpo produce the graphs that the portal can then ingest for rapid display; this could happen too in this issue.

LabCAS RDF needs additional fields

In order to support statistical charting in the new Wagtail portal, LabCAS RDF generation needs these predicates:

discipline (proteomics, genomics…)
data category (mass spectrometry, DNA microarray…)

LabCAS Collection numbers incorrect

The LabCAS RDF generator produces a statement

  <ns2:statistics rdf:about="https://edrn-labcas.jpl.nasa.gov/data-access-api/collections">
    <ns1:cardinality>44</ns1:cardinality>
  </ns2:statistics>

but then has just 36 <ns2:collection> objects in its output.

The reason is that it asks Solr for the count of all collections and uses that for the <ns2:statistics> object. It then iterates over the collections and dumps all non-EDRN consortium collections. That's why we end up with 36 < 44. We should constrain the searches to just Consortium:EDRN.

edrn / cancerdataexpo Goto Github PK

cancerdataexpo's People

Contributors

Watchers

cancerdataexpo's Issues

Recommend Projects

Recommend Topics

Recommend Org