molgenis / capice-resources Goto Github PK

Repository for resource files for CAPICE and updating CAPICE model

License: GNU Lesser General Public License v3.0

Python 93.81% Shell 6.19%

capice-resources's Introduction

THIS MOLGENIS VERSION IS IN ARCHIVE MODE. PLEASE USE NEXT GENERATION AT MOLGENIS-EMX2

Welcome to MOLGENIS

MOLGENIS is a collaborative open source project on a mission to generate great software infrastructure for life science research.

Develop

MOLGENIS has a frontend and a backend. You can develop on them separately. When you want to develop an API and and App simultaneously you need to checkout both.

Useful links

Deploy MOLGENIS

Deploy MOLGENIS

capice-resources's People

Watchers

capice-resources's Issues

AUC.png shows incorrect number of samples

When running compare_models.py, the output figure "AUC.png" shows an incorrect amount of samples used for the per-consequence plots.

[compare-model-performance] Violinplot x axis labels seem flipped

Describe the bug

The x axis labels of the violinplots "benign" and "pathogenic" seem flipped, causing confusion with the legend.

How to Reproduce

Steps to reproduce the behavior:

Run compare-model-performance with 2 score files and a label file.

Expected behavior

Pathogenic labelled variants should match x-axis label "pathogenic" and benign labelled variants should match x-axis label "benign". Both labels should match the amount of variants in the legend.

train_data_creator: Better handling of duplicates.

In case a duplicate is found (identical '#CHROM', 'POS', 'REF', 'ALT', 'gene', 'class'), the current approach simply removes any duplicates after the first. This could mean a duplicate which would yield a higher review score (=a higher weight for training) is removed.

A better approach would be to use ensure the highest review score is kept among all sources so that high quality data is treated as such.

train_data_creator: Add conflict handling if sources disagree about pathogenicity classification.

Currently there is no check whether the sources agree about the pathogenicity classification, but it is assumed they agree (based on the fact that the public VKGL is added to ClinVar). However, due to non-public VKGL data that is also used, it cannot be guaranteed that both will always agree. To ensure high quality of data (especially as one of the duplicates will be dropped), it should be ensured the different sources align in their classification.

compare-model-performance crashes on violinplot when empty dataframe is supplied

Describe the bug

When 2 model performances are compared, each with their own validation file, through compare-model-performance and no samples are present for a consequence for one of the 2 models, seaborn will throw an error stating that both hue levels must be present.

System information

OS: Not applicable
Version: 5.0.0.dev0
Python version: Python3.10
Shell: ZSH

How to Reproduce

Steps to reproduce the behavior:

Run compare-model-performance with supplied files: compare-model-performance -a capice_5.0.0-v2_score_grch37.tsv.gz -l validation_vep_processed.tsv.gz -b score_normal.tsv.gz -m validation.tsv.gz -o .

capice_5.0.0-v2_score_grch37.tsv.gz
score_normal.tsv.gz
validation.tsv.gz
validation_vep_processed.tsv.gz

Expected behavior

Violinplot only shows 1 hue. In the case that no samples are present, show plot with "no data" instead of no plot at all.

Logs

Traceback (most recent call last):
  File "/home/robert/git/capice-resources/venv/bin/compare-model-performance", line 33, in <module>
    sys.exit(load_entry_point('capice-resources', 'console_scripts', 'compare-model-performance')())
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/__main__.py", line 367, in main
    CompareModelPerformance().run()
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/core/__init__.py", line 53, in run
    output = self.run_module(args)
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/__main__.py", line 174, in run_module
    plots = plotter.plot(model_1, model_2)
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/plotter.py", line 241, in plot
    self._plot_consequences(merged_model_1_data, merged_model_2_data)
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/plotter.py", line 311, in _plot_consequences
    self._plot_score_dist(subset_m1, m1_samples, subset_m2, m2_samples, consequence)
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/plotter.py", line 685, in _plot_score_dist
    self._create_violinplot_for_column(
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/plotter.py", line 880, in _create_violinplot_for_column
    sns.violinplot(
  File "/home/robert/git/capice-resources/venv/lib/python3.10/site-packages/seaborn/categorical.py", line 2305, in violinplot
    plotter = _ViolinPlotter(x, y, hue, data, order, hue_order,
  File "/home/robert/git/capice-resources/venv/lib/python3.10/site-packages/seaborn/categorical.py", line 920, in __init__
    raise ValueError(msg)
ValueError: There must be exactly two hue levels to use `split`.

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

[compare-model-performance] Model 1 and model 2 violinplot can be flipped

Describe the bug

In the violinplots, model 2 (benign) can be shown before model 1 (benign) in the plot, causing confusion.

System information

Version: 5.1.0
Python version: 3.11.3
Shell: ZSH

Expected behavior

Model 1 benign should always be shown first, then model 2 benign, then model 1 pathogenic, then model 2 pathogenic.

Screenshots

(see frameshift_variant)

Additional context

Potential solution: Add sort on "dataset source" on line 903

liftover_variants.sh not removing index file

How to reproduce:

sbatch --output=/path/to/train_test_liftover_grch38.log --error=/path/to/train_test_liftover_grch38.err ./liftover_variants.sh -i /path/to/train_test.vcf.gz -o /path/to/train_test_grch38

Stdout:

CLA passed
Loading Picard
Running Picard
Gzipping outputs
Removing indexing file if made
Done

Seems Indexing file removed echo is missing, so if statement is not triggered.

Expected output files:

train_test_grch38.vcf.gz
train_test_grch38.vcf_rejected.vcf.gz

Actual output files:

train_test_grch38.vcf.gz
train_test_grch38.vcf.idx
train_test_grch38.vcf_rejected.vcf.gz

New ClinVar release (20230430) is not supported (error thrown)

Describe the bug

train-data-creator throws an exception when using the new ClinVar release (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20230430.vcf.gz). Seems there are additional possibilities (starting with either NW_ or NT_) in the CHROM column that weren't present in earlier releases.

How to Reproduce

Steps to reproduce the behavior:

train-data-creator -v VKGL_public_consensus_202304.tsv -c clinvar_20230430.vcf.gz -o output_train_data

Expected behavior

New ClinVar release can be used.
Warning message is thrown stating how many of these rows were present in the input data.

Matplotlib fails to install

Describe the bug

Matplotlib fails to install when using Python 3.11

System information

OS: Manjaro 23.0.0 UltimaThule (Kernel: x86_64 Linux 6.1.31-2-MANJARO)
Version: 5.1.0
Python version: Python3.11.3
Shell: ZSH

How to Reproduce

Steps to reproduce the behavior:

Install python 3.11.3
Clone repository
Checkout tag 5.1.0 (at the moment of writing, 5.1.0 is the latest version)
Run command pip install .
See error.

Expected behavior

Matplotlib installs properly.

Logs

If available, the generated logging information and/or error message (can also be attached as a file if very large).

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

Comparison scripts not functioning out-of-the-box

Currently when following the steps on https://github.com/molgenis/capice-resources#tldr, using either comparison scripts (compare_build37_build38_models.py nor compare_old_model.py) allow for comparison of 2 custom models. The vep_to_train.py which is used to generate data for creating the final train-test & validation .tsv.gz files filters out a column needed for the comparison (%ID).

The flow should be looked at regarding at which step it's best to make adjustments to ensure a data file is available that can be used for the comparison. Additionally, a more generic comparison script would be preferred (even if it's just renaming an existing one) as currently it can be very misleading what should be used for comparison.

`utility_scripts/process_vep_tsv.py`: When purely giving output filename, an error is thrown instead of file being written in current directory.

When purely using the filename as output without giving a relative/absolute path, an NotADirectoryError: Output file has to be placed in an directory that already exists! error is given:

$ python3 ./utility_scripts/process_vep_tsv.py -i ../output/train_input_annotated.tsv.gz -o train_input_test.tsv.gz 
Parsing CLI

Validating input arguments.
Traceback (most recent call last):
  File "/path/to/capice-resources-main/./utility_scripts/process_vep_tsv.py", line 178, in <module>
    main()
  File "/path/to/capice-resources-main/./utility_scripts/process_vep_tsv.py", line 104, in main
    validator.validate_output_cla(output)
  File "/path/to/capice-resources-main/./utility_scripts/process_vep_tsv.py", line 82, in validate_output_cla
    raise NotADirectoryError('Output file has to be placed in an directory that already '
NotADirectoryError: Output file has to be placed in an directory that already exists!

If using ./train_input_test.tsv.gz as output argument, this error does not occur.

Expected behavior:

Giving a filename simply results in output file being created in current directory.

Testing for Windows fail

Running the train_data_creator tests on Windows fail with an Attribute "dtype" are different [left]: int32 [right]: int64. This is an Windows related issue due to the nature of Numpy obtaining a long int is set to 32 bit, even on 64 bit OS (https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine).

Train CAPICE models using SpliceAI masked instead of raw data

since the masked files are meant to be used for variant interpretation: https://github.com/Illumina/SpliceAI#frequently-asked-questions

Train-data-creator doesn't take multiple SYMBOL (ID's) into account

Describe the bug

A "GENEINFO" entry can contain multiple SYMBOL (ID's):

##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)">

Example:

1 1474871 1295591 G C . . ALLELEID=1285386;CLNDISDB=MedGen:CN517202;CLNDN=not_provided;CLNHGVS=NC_000001.10:g.1474871G>C;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=TMEM240:339453|LOC121967044:121967044;MC=SO:0001627|intron_variant;ORIGIN=1

In this case, "LOC121967044" is discarded, which could lead to mapping problems in process-vep.

System information

OS: Not applicable
Version: 5.0.0.dev0
Python version: Not applicable
Shell: Not applicable

How to Reproduce

Steps to reproduce the behavior:

Run train-data-creator with a VCF containing a sample with multiple GENEINFO entries.
Run VEP.
Convert VEP output VCF to TSV.
Run process-vep and see that only 1 of the entries has been mapped.

Expected behavior

Currently, process-vep maps 1 to 1 from the initially supplied SYMBOL to the VEP output SYMBOL. This needs to be changed so that it maps back 1 to many (1 being the VEP output SYMBOL, many being the "ID" column SYMBOLs)

Logs

If available, the generated logging information and/or error message (can also be attached as a file if very large).

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

#51 (comment)

Security vulnerabilities

Pandas 1.3.4 is vulnerable by: https://devhub.checkmarx.com/cve-details/CVE-2020-13091/?utm_source=jetbrains&utm_medium=referral&utm_campaign=idea

Numpy 1.22.0 is vulnerable by: https://devhub.checkmarx.com/cve-details/CVE-2021-41495/?utm_source=jetbrains&utm_medium=referral&utm_campaign=idea

Pip won't build on master!

Error message:

error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [16 lines of output]
      /<path-to>/capice-resources/venv/lib/python3.10/site-packages/setuptools/dist.py:530: UserWarning: Normalizing '1.0.0-dev' to '1.0.0.dev0'
        warnings.warn(tmpl.format(**locals()))
      error: Multiple top-level packages discovered in a flat-layout: ['validation', 'utility_scripts', 'train_data_creator'].

How to reproduce:

git clone [email protected]:molgenis/capice-resources.git
Open project in PyCharm.
Add Python Interpreter (venv), used Python 3.10 myself.
From the terminal (within PyCharm or after loading venv), run: pip install -e '.[test]'

Train-data-creator doesn't use SYMBOL ID's

Describe the bug

Currently, due to limitations of supplied data, train-data-creator doesn't use SYMBOL ID's, but instead relies on (not-normalized) SYMBOL to map back a variant after being annotated by VEP. This could potentially cause issues as SYMBOL ID's are set, but SYMBOL is not, so SYMBOL could change at any minute.

System information

OS: [e.g. iOS]
Version: [e.g. 3.0.0]
Python version: [e.g. Python3.9.1]
Shell: [e.g. ZSH]

How to Reproduce

Steps to reproduce the behavior:

cd to dir [...]
Run the command [...]
See error.

Expected behavior

train-data-creator uses SYMBOL ID's instead of SYMBOL. This means that process-vep should also take this into account, although mapping 1 to 1 (or 1 to many: #54 ) should not limit the use of SYMBOL ID's.

Logs

If available, the generated logging information and/or error message (can also be attached as a file if very large).

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

#51 (comment)

[compare-model-performance] Supplying score file of different size than label file raises masking error

Describe the bug

When a score file is supplied that is scored on a different validation file, of different size than the validation file used with the -l flag, including the -f flag, raises the pandas error "Cannot mask with non-boolean array containing NA/ NaN values", indicating that something went wrong with the merge

System information

Version: 5.0.0
Python version: 3.10.10

How to Reproduce

Steps to reproduce the behavior:

score one validation file
score a different validation file of different sample size
Run compare-model-performance with -a and -b set to the 2 different sized score files, setting -l to the validation file used to generate the score file supplied with -a, include the -f flag.
See error.

Expected behavior

Warnings to be raised when a consequence sample size mismatches between a score file and a label file, not a "Cannot mask with NaN" error.

Logs

Traceback (most recent call last):
  File "/home/robert/git/capice-resources/venv/bin/compare-model-performance", line 33, in <module>
    sys.exit(load_entry_point('capice-resources', 'console_scripts', 'compare-model-performance')())
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/__main__.py", line 367, in main
    CompareModelPerformance().run()
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/core/__init__.py", line 53, in run
    output = self.run_module(args)
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/__main__.py", line 151, in run_module
    consequence_tools.validate_consequence_samples_equal(
  File "/home/robert/git/capice-resources/src/molgenis/capice_resources/compare_model_performance/consequence_tools.py", line 82, in validate_consequence_samples_equal
    m1 = merged_model_1[
  File "/home/robert/git/capice-resources/venv/lib/python3.10/site-packages/pandas/core/frame.py", line 3797, in __getitem__
    if com.is_bool_indexer(key):
  File "/home/robert/git/capice-resources/venv/lib/python3.10/site-packages/pandas/core/common.py", line 135, in is_bool_indexer
    raise ValueError(na_msg)
ValueError: Cannot mask with non-boolean array containing NA / NaN values

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

GeneSymbols with `_` causes invalid formatted ID field.

Currently gene symbols such as C4B_2 will cause issues due to using _ as separator here. A more compatible separator should be chosen, keeping the naming guideline in mind.

Raise error instead of warning "unknown review status ClinVar"

Is your feature request related to a problem? Please describe.
Not yet, but as soon as ClinVar introduces a new a new star in their review status this problem becomes priority.

Describe the solution you'd like
Instead of raising a warning, raising an error would be preffered.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

When running train_data_creator I get error

Traceback (most recent call last):
  File "capice-resources/train_data_creator/main.py", line 5, in <module>
    from train_data_creator.src.main.exporter import Exporter
ModuleNotFoundError: No module named 'train_data_creator'

Removing train_data_creator/ from imports in main.py fixes it

molgenis / capice-resources Goto Github PK

capice-resources's Introduction

THIS MOLGENIS VERSION IS IN ARCHIVE MODE. PLEASE USE NEXT GENERATION AT MOLGENIS-EMX2

Welcome to MOLGENIS

Develop

Useful links

Deploy MOLGENIS

capice-resources's People

Watchers

capice-resources's Issues

Describe the bug

How to Reproduce

Expected behavior

Describe the bug

System information

How to Reproduce

Expected behavior

Logs

Screenshots

Additional context

Describe the bug

System information

Expected behavior

Screenshots

Additional context

Describe the bug

How to Reproduce

Expected behavior

Describe the bug

System information

How to Reproduce

Expected behavior

Logs

Screenshots

Additional context

Expected behavior:

Describe the bug

System information

How to Reproduce

Expected behavior

Logs

Screenshots

Additional context

Describe the bug

System information

How to Reproduce

Expected behavior

Logs

Screenshots

Additional context

Describe the bug

System information

How to Reproduce

Expected behavior

Logs

Screenshots

Additional context

Recommend Projects

Recommend Topics

Recommend Org