Git Product home page Git Product logo

igvf-subsample-db's Introduction

IGVF Subsample DB

This tool subsamples Postgres database of ENCODE/IGVF servers based on a subsampling rule JSON file.

Subsampling rule JSON

This file defines subsampling rule(s) for each profile (e.g. experiment for ENCODE, measurement_set for IGVF). Multiple rules are allowed for each profile. Here is an example for ENCODE.

{
    "file": [
        {
            "subsampling_min": 100,
            "subsampling_rate": 1e-03
        }
    ],
    "experiment": [
        {
            "subsampling_min": 3,
            "subsampling_rate": 1e-05,
            "subsampling_cond": {
                "assay_term_name": "ATAC-seq"
            }
        },
        {
            "subsampling_min": 5,
            "subsampling_rate": 1e-05,
            "subsampling_cond": {
                "assay_term_name": "ChIP-seq"
            }
        }
    ]
}

A rule is consist of subsampling_min, subsampling_rate and subsampling_cond (optional). See the following example of experiment profile of ENCODE.

{
    "subsampling_min": 5,
    "subsampling_rate": 1e-05,
    "subsampling_cond": {
        "assay_term_name": "ChIP-seq"
    }
}
  • subsampling_min defines the minimum number of objects in the profile after subsampling. It's bound to the actual number of objects. i.e. taking MIN(number_of_objects, subsampling_min).
  • subsampling_rate defines the minimum number of objects as total (respecting subsampling_cond if defined) number of objects in the profile multiplied by the rate. MAX of these two values will be taken as the final number of subsampled objects in the profile.
  • subsampling_cond is a JSON object that defines conditions for the rule. For the above example, this will only subsample objects with a property assay_term_name defined as ChIP-seq. You can use any valid property in a profile. See profile's schema JSON to find such property.

There are currently 12548 ChIP-seq experiments and it will subsample 12548 objects down to MAX(5, 1e-05*12548) = 5.

For the case of file profile in the above example, there are currently 1458539 file objects on ENCODE. So it will subsample 1458539 objects down to MAX(100, 1e-03 * 1458539) = 1458.

You can have multiple rules under a single profile. See the case of experiment profile in the above example. It will include at least 3 ATAC-Seq experiments and 5 ChIP-seq experiments.

IMPORTANT: Some users and their access_keys are important to run a server. Therefore, two examples keep ALL user and access_key for subsampling. i.e. "subsampling_min": 1000000 is defined for user and access_key.

Examples keep ALL users after subsampling.

Requirements

Install postgresql dev library on your system.

# apt
$ sudo apt-get install libpq-dev

# yum
$ sudo yum install postgresql-devel

Running the tool with an RDS database (IGVF)

See this document for details.

Running the tool on a running demo (ENCODE)

See this document for details.

igvf-subsample-db's People

Contributors

leepc12 avatar

Watchers

Ben Hitz avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.