Git Product home page Git Product logo

capice-resources's Introduction

THIS MOLGENIS VERSION IS IN ARCHIVE MODE. PLEASE USE NEXT GENERATION AT MOLGENIS-EMX2

Build status Quality Status

Welcome to MOLGENIS

MOLGENIS is a collaborative open source project on a mission to generate great software infrastructure for life science research.

Develop

MOLGENIS has a frontend and a backend. You can develop on them separately. When you want to develop an API and and App simultaneously you need to checkout both.

Useful links

Deploy MOLGENIS

capice-resources's People

Watchers

 avatar  avatar  avatar

capice-resources's Issues

[compare-model-performance] Violinplot x axis labels seem flipped

Describe the bug

The x axis labels of the violinplots "benign" and "pathogenic" seem flipped, causing confusion with the legend.

How to Reproduce

Steps to reproduce the behavior:

  1. Run compare-model-performance with 2 score files and a label file.

Expected behavior

Pathogenic labelled variants should match x-axis label "pathogenic" and benign labelled variants should match x-axis label "benign". Both labels should match the amount of variants in the legend.

train_data_creator: Better handling of duplicates.

In case a duplicate is found (identical '#CHROM', 'POS', 'REF', 'ALT', 'gene', 'class'), the current approach simply removes any duplicates after the first. This could mean a duplicate which would yield a higher review score (=a higher weight for training) is removed.

A better approach would be to use ensure the highest review score is kept among all sources so that high quality data is treated as such.

train_data_creator: Add conflict handling if sources disagree about pathogenicity classification.

Currently there is no check whether the sources agree about the pathogenicity classification, but it is assumed they agree (based on the fact that the public VKGL is added to ClinVar). However, due to non-public VKGL data that is also used, it cannot be guaranteed that both will always agree. To ensure high quality of data (especially as one of the duplicates will be dropped), it should be ensured the different sources align in their classification.

compare-model-performance crashes on violinplot when empty dataframe is supplied

Describe the bug

When 2 model performances are compared, each with their own validation file, through compare-model-performance and no samples are present for a consequence for one of the 2 models, seaborn will throw an error stating that both hue levels must be present.

System information

  • OS: Not applicable
  • Version: 5.0.0.dev0
  • Python version: Python3.10
  • Shell: ZSH

How to Reproduce

Steps to reproduce the behavior:

  1. Run compare-model-performance with supplied files: compare-model-performance -a capice_5.0.0-v2_score_grch37.tsv.gz -l validation_vep_processed.tsv.gz -b score_normal.tsv.gz -m validation.tsv.gz -o .

capice_5.0.0-v2_score_grch37.tsv.gz
score_normal.tsv.gz
validation.tsv.gz
validation_vep_processed.tsv.gz

Expected behavior

Violinplot only shows 1 hue. In the case that no samples are present, show plot with "no data" instead of no plot at all.

Logs

Traceback (most recent call last):
  File "/home/robert/git/capice-resources/venv/bin/compare-model-performance", line 33, in <module>
    sys.exit(load_entry_point('capice-resources', 'console_scripts', 'compare-model-performance')())
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/__main__.py", line 367, in main
    CompareModelPerformance().run()
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/core/__init__.py", line 53, in run
    output = self.run_module(args)
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/__main__.py", line 174, in run_module
    plots = plotter.plot(model_1, model_2)
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/plotter.py", line 241, in plot
    self._plot_consequences(merged_model_1_data, merged_model_2_data)
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/plotter.py", line 311, in _plot_consequences
    self._plot_score_dist(subset_m1, m1_samples, subset_m2, m2_samples, consequence)
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/plotter.py", line 685, in _plot_score_dist
    self._create_violinplot_for_column(
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/plotter.py", line 880, in _create_violinplot_for_column
    sns.violinplot(
  File "/home/robert/git/capice-resources/venv/lib/python3.10/site-packages/seaborn/categorical.py", line 2305, in violinplot
    plotter = _ViolinPlotter(x, y, hue, data, order, hue_order,
  File "/home/robert/git/capice-resources/venv/lib/python3.10/site-packages/seaborn/categorical.py", line 920, in __init__
    raise ValueError(msg)
ValueError: There must be exactly two hue levels to use `split`.

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

[compare-model-performance] Model 1 and model 2 violinplot can be flipped

Describe the bug

In the violinplots, model 2 (benign) can be shown before model 1 (benign) in the plot, causing confusion.

System information

  • Version: 5.1.0
  • Python version: 3.11.3
  • Shell: ZSH

Expected behavior

Model 1 benign should always be shown first, then model 2 benign, then model 1 pathogenic, then model 2 pathogenic.

Screenshots

(see frameshift_variant)
score_differences_vio

Additional context

Potential solution: Add sort on "dataset source" on line 903

liftover_variants.sh not removing index file

How to reproduce:

sbatch --output=/path/to/train_test_liftover_grch38.log --error=/path/to/train_test_liftover_grch38.err ./liftover_variants.sh -i /path/to/train_test.vcf.gz -o /path/to/train_test_grch38

Stdout:

CLA passed
Loading Picard
Running Picard
Gzipping outputs
Removing indexing file if made
Done

Seems Indexing file removed echo is missing, so if statement is not triggered.

Expected output files:

  • train_test_grch38.vcf.gz
  • train_test_grch38.vcf_rejected.vcf.gz

Actual output files:

  • train_test_grch38.vcf.gz
  • train_test_grch38.vcf.idx
  • train_test_grch38.vcf_rejected.vcf.gz

New ClinVar release (20230430) is not supported (error thrown)

Describe the bug

train-data-creator throws an exception when using the new ClinVar release (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20230430.vcf.gz). Seems there are additional possibilities (starting with either NW_ or NT_) in the CHROM column that weren't present in earlier releases.

How to Reproduce

Steps to reproduce the behavior:

  1. train-data-creator -v VKGL_public_consensus_202304.tsv -c clinvar_20230430.vcf.gz -o output_train_data

Expected behavior

  • New ClinVar release can be used.
  • Warning message is thrown stating how many of these rows were present in the input data.

Matplotlib fails to install

Describe the bug

Matplotlib fails to install when using Python 3.11

System information

  • OS: Manjaro 23.0.0 UltimaThule (Kernel: x86_64 Linux 6.1.31-2-MANJARO)
  • Version: 5.1.0
  • Python version: Python3.11.3
  • Shell: ZSH

How to Reproduce

Steps to reproduce the behavior:

  1. Install python 3.11.3
  2. Clone repository
  3. Checkout tag 5.1.0 (at the moment of writing, 5.1.0 is the latest version)
  4. Run command pip install .
  5. See error.

Expected behavior

Matplotlib installs properly.

Logs

If available, the generated logging information and/or error message (can also be attached as a file if very large).

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

Comparison scripts not functioning out-of-the-box

Currently when following the steps on https://github.com/molgenis/capice-resources#tldr, using either comparison scripts (compare_build37_build38_models.py nor compare_old_model.py) allow for comparison of 2 custom models. The vep_to_train.py which is used to generate data for creating the final train-test & validation .tsv.gz files filters out a column needed for the comparison (%ID).

The flow should be looked at regarding at which step it's best to make adjustments to ensure a data file is available that can be used for the comparison. Additionally, a more generic comparison script would be preferred (even if it's just renaming an existing one) as currently it can be very misleading what should be used for comparison.

`utility_scripts/process_vep_tsv.py`: When purely giving output filename, an error is thrown instead of file being written in current directory.

When purely using the filename as output without giving a relative/absolute path, an NotADirectoryError: Output file has to be placed in an directory that already exists! error is given:

$ python3 ./utility_scripts/process_vep_tsv.py -i ../output/train_input_annotated.tsv.gz -o train_input_test.tsv.gz 
Parsing CLI

Validating input arguments.
Traceback (most recent call last):
  File "/path/to/capice-resources-main/./utility_scripts/process_vep_tsv.py", line 178, in <module>
    main()
  File "/path/to/capice-resources-main/./utility_scripts/process_vep_tsv.py", line 104, in main
    validator.validate_output_cla(output)
  File "/path/to/capice-resources-main/./utility_scripts/process_vep_tsv.py", line 82, in validate_output_cla
    raise NotADirectoryError('Output file has to be placed in an directory that already '
NotADirectoryError: Output file has to be placed in an directory that already exists!

If using ./train_input_test.tsv.gz as output argument, this error does not occur.

Expected behavior:

Giving a filename simply results in output file being created in current directory.

Train-data-creator doesn't take multiple SYMBOL (ID's) into account

Describe the bug

A "GENEINFO" entry can contain multiple SYMBOL (ID's):

##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)">

Example:

1 1474871 1295591 G C . . ALLELEID=1285386;CLNDISDB=MedGen:CN517202;CLNDN=not_provided;CLNHGVS=NC_000001.10:g.1474871G>C;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=TMEM240:339453|LOC121967044:121967044;MC=SO:0001627|intron_variant;ORIGIN=1

In this case, "LOC121967044" is discarded, which could lead to mapping problems in process-vep.

System information

  • OS: Not applicable
  • Version: 5.0.0.dev0
  • Python version: Not applicable
  • Shell: Not applicable

How to Reproduce

Steps to reproduce the behavior:

  1. Run train-data-creator with a VCF containing a sample with multiple GENEINFO entries.
  2. Run VEP.
  3. Convert VEP output VCF to TSV.
  4. Run process-vep and see that only 1 of the entries has been mapped.

Expected behavior

Currently, process-vep maps 1 to 1 from the initially supplied SYMBOL to the VEP output SYMBOL. This needs to be changed so that it maps back 1 to many (1 being the VEP output SYMBOL, many being the "ID" column SYMBOLs)

Logs

If available, the generated logging information and/or error message (can also be attached as a file if very large).

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

#51 (comment)

Pip won't build on master!

Error message:

error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [16 lines of output]
      /<path-to>/capice-resources/venv/lib/python3.10/site-packages/setuptools/dist.py:530: UserWarning: Normalizing '1.0.0-dev' to '1.0.0.dev0'
        warnings.warn(tmpl.format(**locals()))
      error: Multiple top-level packages discovered in a flat-layout: ['validation', 'utility_scripts', 'train_data_creator'].

How to reproduce:

  1. git clone [email protected]:molgenis/capice-resources.git
  2. Open project in PyCharm.
  3. Add Python Interpreter (venv), used Python 3.10 myself.
  4. From the terminal (within PyCharm or after loading venv), run: pip install -e '.[test]'

See also warning message in https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#flat-layout

Train-data-creator doesn't use SYMBOL ID's

Describe the bug

Currently, due to limitations of supplied data, train-data-creator doesn't use SYMBOL ID's, but instead relies on (not-normalized) SYMBOL to map back a variant after being annotated by VEP. This could potentially cause issues as SYMBOL ID's are set, but SYMBOL is not, so SYMBOL could change at any minute.

System information

  • OS: [e.g. iOS]
  • Version: [e.g. 3.0.0]
  • Python version: [e.g. Python3.9.1]
  • Shell: [e.g. ZSH]

How to Reproduce

Steps to reproduce the behavior:

  1. cd to dir [...]
  2. Run the command [...]
  3. See error.

Expected behavior

train-data-creator uses SYMBOL ID's instead of SYMBOL. This means that process-vep should also take this into account, although mapping 1 to 1 (or 1 to many: #54 ) should not limit the use of SYMBOL ID's.

Logs

If available, the generated logging information and/or error message (can also be attached as a file if very large).

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

#51 (comment)

[compare-model-performance] Supplying score file of different size than label file raises masking error

Describe the bug

When a score file is supplied that is scored on a different validation file, of different size than the validation file used with the -l flag, including the -f flag, raises the pandas error "Cannot mask with non-boolean array containing NA/ NaN values", indicating that something went wrong with the merge

System information

  • Version: 5.0.0
  • Python version: 3.10.10

How to Reproduce

Steps to reproduce the behavior:

  1. score one validation file
  2. score a different validation file of different sample size
  3. Run compare-model-performance with -a and -b set to the 2 different sized score files, setting -l to the validation file used to generate the score file supplied with -a, include the -f flag.
  4. See error.

Expected behavior

Warnings to be raised when a consequence sample size mismatches between a score file and a label file, not a "Cannot mask with NaN" error.

Logs

Traceback (most recent call last):
  File "/home/robert/git/capice-resources/venv/bin/compare-model-performance", line 33, in <module>
    sys.exit(load_entry_point('capice-resources', 'console_scripts', 'compare-model-performance')())
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/__main__.py", line 367, in main
    CompareModelPerformance().run()
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/core/__init__.py", line 53, in run
    output = self.run_module(args)
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/__main__.py", line 151, in run_module
    consequence_tools.validate_consequence_samples_equal(
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/consequence_tools.py", line 82, in validate_consequence_samples_equal
    m1 = merged_model_1[
  File "/home/robert/git/capice-resources/venv/lib/python3.10/site-packages/pandas/core/frame.py", line 3797, in __getitem__
    if com.is_bool_indexer(key):
  File "/home/robert/git/capice-resources/venv/lib/python3.10/site-packages/pandas/core/common.py", line 135, in is_bool_indexer
    raise ValueError(na_msg)
ValueError: Cannot mask with non-boolean array containing NA / NaN values

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

Raise error instead of warning "unknown review status ClinVar"

Is your feature request related to a problem? Please describe.
Not yet, but as soon as ClinVar introduces a new a new star in their review status this problem becomes priority.

Describe the solution you'd like
Instead of raising a warning, raising an error would be preffered.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

When running train_data_creator I get error

Traceback (most recent call last):
  File "capice-resources/train_data_creator/main.py", line 5, in <module>
    from train_data_creator.src.main.exporter import Exporter
ModuleNotFoundError: No module named 'train_data_creator'

Removing train_data_creator/ from imports in main.py fixes it

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.