Git Product home page Git Product logo

rd-datamodel's Introduction

Unified Molgenis Data Model

Welcome! The Unified Molgenis Data Model β€”or UMDMβ€” is a plug and play Molgenis EMX model for collating metadata on patients and samples, analyses that were performed on the samples, and files that were generated. The model comprises several modules built on the FAIR data principles and the FAIR Genomes Semantic Model. The module system enables the model to be adapted for research or care contexts.

Download the latest version of the model or view the model schema to see what's in the model.

Features

Here is what is included.

  • πŸ“¦ Unified Model: a plug and play FAIR data model for Molgenis databases on rare diseases. The model contains modules for patients, studies, consent, clinical events, biomaterials collected, preparation of samples, sample sequencing, and metadata from generated files.
  • πŸ“š Reference Datasets: an extensive library of reference datasets standardized to ontologies, international standards, and field specific standards.
  • πŸ‘₯ User Module: table structure and script tracking users, registrations, and authorization levels. Ideal for auditing and database management!
  • πŸ”§ Jobs: table structure for logging custom jobs and results. This is an extension of the existing jobs entity, but geared towards database management.

For more information on Molgenis, visit molgenis.org.

Getting Started

This repository includes scripts for building the model and collating lookup datasets, but you only need the model and a few other things to get started with building your own database. See the steps below.

  1. Install Molgenis: you will need a server running the latest version of Molgenis. See the Molgenis Documentation for more information about setting up Molgenis. Alternatively, you can use the latest Molgenis Docker container to create an instance on your local machine.
  2. Install the Molgenis Commander: Install the latest version of Molgenis commander
  3. Run the setup script: Run the UMDM Setup script. This will import the model and all datasets into your Molgenis instance.

Updating, building and deploying the UMDM

To update the UMDM, follow the steps below.

1 Make changes to the model

Make all of the changes that are requested. For new entities (i.e., tables), open the YAML file and scroll to the entities section and enter the following information.

entities:
    ...
    - name: myNewTable
      label: My New Table
      description: some description about my new table will go here
      tags: https://url.to.some/ontology/code/that/describes/my_table
      attributes:
        ...

If you are adding a new lookup, use one of the predefined attribute templates.

  • attributeTemplateDefault: the recommended attribute template where the column value is the primary key. The value, which is a label, will be displayed when referenced in another table
  • attributeTemplateCode: an alternative template where the column code is the primary key. This is useful for lookups where the code should be displayed instead of the label

To add new attributes, locate the appropriate table in the yaml script and create a new block under attributes.

entities:
    ...
    - name: myTableThatIWantToEdit
      label: My table that I want to edit
      description: The table that I want to edit
      tags: https://url.to.some/ontology/code/that/describes/my_table
      attributes:
        ...
        
        - name: myNewAttribute
          description: a description about my new attribute
          dataType: ...
        # refEntity: ... # if type is a ref
          tags: https://url.to.some/ontology/code/that/describes/my_attribute

2 Building

Test the model once all changes have been made.

yarn test

This test runs some basic error checking to make sure the model contains valid EMX markup. At the end of test, a print summary of the errors will be provided (if errors were detected). Scroll through the report to see find the errors.

When all of the changes have been made, build the model. Make sure the python library yamlemxconvert is installed and run the following yarn script.

yarn emx:build

Make sure all tag errors have been resolved (Unable to process tag: None).

3 Deploying

When built, deploy to the server. NOTE: make sure you export all data before importing the new model.

yarn m:config
yarn m:predeploy
yarn m:deploy
yarn m:postdeploy

yarn m:demo # if setting up a demo database

If you have added a new lookup table, you will need to update the setup.sh script.

yarn m:refresh-setup

rd-datamodel's People

Contributors

davidruvolo51 avatar dependabot[bot] avatar

Watchers

K. Joeri van der Velde avatar James Cloos avatar Harm-Jan Westra avatar  avatar  avatar

Forkers

igg-bioinfo

rd-datamodel's Issues

Proof of concept: discussion items

This issue lists all of the items to discuss in order to complete #3.

Package level discussion items

  • What is an an appropriate name, label, and description for the data model? At the moment, I used rdmodel
  • Which tags should be applied at the package level? I have used dcat:catalog
  • Which tags should we include? (this can probably be saved for the end)
  • Should reference entities be defined in another package? I prefer to put lookups in another package. Sometimes users find it difficult to navigate the folders when there are dozens of entities listed in a folder.

Entity level Discussion Items

  • How should we organize the tables? The RD3 core structure (subjects, samples, labs, files, etc.) has been used other projects and works well with in other contexts. I think the core structure should be kept, but we can move a few things around to make the model more flexible.
  • Should we incorporate row-level attributes for internal record keeping? All tables have dateLastUpdated, but I think we can add more. For example, availability of the data, authorization status, or other flags.

The personal Information table

  • What names should we use for the individuals? Patients, subjects, individuals, personal?
  • Other family IDs (maternal, paternal, linked) are not in FG, how should we name these?
  • Should the additional family identifier attributes reference the personal identifier? This means these attributes will be defined as xref or mref and reference they will reference personal identifier attribute. The family identifiers must be present in the personal identifier attribute or else you will encounter a reference error. In practice, this would mean removing linked family identifiers if they were not included in the data.
  • For date columns, the import format should be standardized (e.g.,yyyy-mm-dd). However, for yearsOf* attributes, should these attributes have the class int?
  • Should there also be a yearOfDeath column? (Given that there is a year of birth)
  • inclusionStatus: this attribute is used in COSAS and it would be useful to have for creating subsets. For the prototype, I think we can go ahead an integrate the FG lookup. What should be the default value?
  • How should we define sex? Genotypic or phenotypic? Another attribute altogether or multiple attributes?
  • For country-based attributes and ancestry, we can integrate the FG modules.
  • primaryAffiliation is to capture which organization is owns the entry. Ideally, organizations should be standardized to ROR
  • affiliatedNetworks is designed to group records by research networks such as ERNs. These networks should be standardizes to ROR or other reference list. ERNs aren't available in ROR at the moment, but we could use the ern_metadata.csv file.
  • altPersonalIDs can contain any number of additional values. Values should be formatted as a comma separated string
  • consentStatus: even though this is a separate module for consent information, there should be an attribute at the patient-level that indicates if consent information is available or consent was given

Develop Proof of Concept

Using the model.yaml file as a starting point, develop a proof of concept using an example dataset. The prototype should demonstrate the following.

  1. We can create a single data model that is FAIR Genomes compatible
  2. The creation and setup of a database is reproducible
  3. The data model can be applied in more than one project

See #4 for additional information.

Main Tasks

To create a prototype, the major steps are list below. Subtasks are listed in the following sections.

  • Create a single data model that covers a majority of use cases (model.yaml)
  • Create a new molgenis database using the model
  • Write mapping script and populate database with example data

Data Model Tasks

Use FairGenomes, RD3, COSAS, or some other system to create the data model.

  • Define table structure (i.e., how data will be organized)
  • Set the names for all tables
  • Determine where attributes belong and the order in the tables
  • Set attribute names
  • Set attribute labels where applicable
  • Set attribute descriptions where applicable
  • Set attribute dataType where possible; ignore reference types for now
  • Define semantic tags where possible

Database Tasks

  • Create a new Molgenis instance or use an existing one
  • Identify example dataset(s)
  • Create mapping script
  • Populate database

Out of scope

There are certain aspects of the prototype that require additional discussion and planning. These are listed below.

  • Lookup Tables: lookup tables β€” or reference entities β€” will take a bit more time to design and create. We can use some of the FAIR Genomes lookups as a starting point, but we will need to define a robust process for making example datasets more compatible with Fair Genomes.

Related Projects

The prototype will also use the yaml-emx-converter module to compile the YAML-EMX.

Feat: incorporate EMX for Variant module

New Variant Classification Module

We would like to create a module for capturing variant information. Using the existing variants project as a starting point, create a module...

  • where attributes follow the same naming format as the rest of the model. e.g., belongsTo..., reasonFor..., etc.
  • that includes semantic tags for variant attributes
  • and lookup tables relevant to the new variant table

Tasks

  • convert existing model to yaml format
  • harmonize attribute names
  • add semantic tags
  • create lookups

Feat: reference for biospecimen usability

In the samples table, the column biospecimenUsability can be used to indicate if a sample can be used for further testing. At the moment, the type is bool, but this is still a bit vague. It would nice to have a reference table where users could select multiple options to indicate how a sample could be used.

Model Revisions for Initial Release (v1.0)

Discussion Points

  • Name change: what should the new name be? And the appropriate short name? All entity IDs should be adjusted accordingly
  • naming for samples: is it materials, samples, or biospecimen?
  • further discussion needed for resolved (in clinical table; consult experts)
  • Discuss pathologicalState lookup
  • Is the term sequencing too specific?
  • Should we split the sequencing table into sequencing + analysis? (follow FG approach)
  • Discuss how to categorize subsets in studies? Maybe belongsToCohorts?
  • In the files table, we may need to revisit the status variables. Should these be a lookup?
  • Add analyses table and split sequencing table (use Fair Genomes as a guide) WAIT FOR FAIR GENOMES

Changes

This is a sticky for all changes that are needed for v1.0

  • organization entity: move out of lookups
  • releases: move out of lookups
  • in subjects, change inclusionStatus to subjectStatus (update tag)
  • add yearOfDeath and tag
  • record metadata: consider adding who imported/updated the record
  • change labIndication to samplingReason
  • add samplingDate or change sampleTimestamp
  • change samplingProtocol to type reference (value, description, codesystem, code, iri)
  • change samplingProtocolDeviation to type text
  • change biospecimenUsability to type xref or mref and create lookup SEE COMMENT BELOW
  • change pcrFree to type bool
  • change to targetEnrichmentKit to ref
  • change umIsPresent to type bool and rename to umisPresent
  • move labProcedure lookup out of lookups folder
  • in files, change belongsToSequencing to type xref and change name to producedBy*
  • in files, also add ref to cohort and study
  • create cohorts lookup table and reference in files, subjects, study, consent

Features

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.