molgenis / capice Goto Github PK

License: GNU Lesser General Public License v3.0

Shell 2.96% Python 97.04%

capice's Introduction

THIS MOLGENIS VERSION IS IN ARCHIVE MODE. PLEASE USE NEXT GENERATION AT MOLGENIS-EMX2

Welcome to MOLGENIS

MOLGENIS is a collaborative open source project on a mission to generate great software infrastructure for life science research.

Develop

MOLGENIS has a frontend and a backend. You can develop on them separately. When you want to develop an API and and App simultaneously you need to checkout both.

Useful links

Deploy MOLGENIS

Deploy MOLGENIS

capice's People

Contributors

Stargazers

Watchers

Forkers

shulp2211 tdmedina tianguolangzi anngaocui jyun18 barrydigby novapyth lisawanghsu ziphra harel-coffee

capice's Issues

Preprocessor processes irrelevant training features

Describe the bug

The preprocessor does not take into account the train_features.json, so it processes all "object" type features within a supplied dataset for both train and predict, adding unnecessary processing time and resources.

System information (command line)

Not applicable.

System information (web service)

Not applicable.

How to Reproduce

Steps to reproduce the behavior:
capice -v train -i /path/to/train.tsv.gz -m /path/to/train_features.json -o /path/to/out.pickle.dat
capice -v predict -i /path/to/predict.tsv.gz -m /path/to/model.pickle.dat -o /path/to/out.tsv.gz

Expected behavior

Preprocessor takes into account the train_features.json to skip the features that are in an input dataset, but not in the train_features.json

Logs

2022-10-05 14:07:30     INFO: Preprocessor started.
2022-10-05 14:07:37     INFO: Training protocol, creating new categorical conversion identifiers.
2022-10-05 14:07:37     INFO: For feature: ref saved the following values: C, G, T, A, CT
2022-10-05 14:07:38     INFO: For feature: alt saved the following values: T, A, C, G, CT
2022-10-05 14:07:38     INFO: For feature: Allele saved the following values: A, T, G, C, -
2022-10-05 14:07:38     INFO: For feature: IMPACT saved the following values: LOW, MODIFIER, MODERATE, HIGH
2022-10-05 14:07:38     INFO: For feature: BIOTYPE saved the following values: protein_coding, lncRNA, RNase_MRP_RNA, misc_RNA, transcribed_pseudogene
2022-10-05 14:07:38     INFO: For feature: Exon saved the following values: 3/3, 2/2, 4/4, 1/1, 10/10
2022-10-05 14:07:38     INFO: For feature: Intron saved the following values: 2/2, 9/9, 8/9, 8/8, 3/5
2022-10-05 14:07:38     INFO: For feature: Codons saved the following values: gaC/gaT, gcC/gcT, gaG/gaA, ccC/ccT, aaC/aaT
2022-10-05 14:07:39     INFO: For feature: FLAGS saved the following values: cds_start_NF, cds_end_NF
2022-10-05 14:07:39     INFO: For feature: SpliceAI_pred_SYMBOL saved the following values: TTN, BRCA2, NF1, ATM, BRCA1
2022-10-05 14:07:39     INFO: For feature: gnomAD saved the following values: 18:55335787-55335787, 20:62046497-62046497, 12:21375307-21375307, 3:37067121-37067121, 7:82763559-82763559
2022-10-05 14:07:39     INFO: For feature: oAA saved the following values: L, A, R, P, S
2022-10-05 14:07:39     INFO: For feature: nAA saved the following values: L, S, X, A, T
2022-10-05 14:07:40     INFO: For feature: Type saved the following values: SNV, DEL, INS, DELINS
2022-10-05 14:07:40     INFO: For feature: PolyPhenCat saved the following values: benign, probably_damaging, possibly_damaging
2022-10-05 14:07:40     INFO: For feature: SIFTcat saved the following values: deleterious, tolerated
2022-10-05 14:07:41     INFO: Successfully preprocessed data.

'there shouldn't be any nulls' and 'feature from the model not in data' log messages

Running CAPICE on trio-filtered.vcf.gz using the CAPICE easybuild module on gearshift results in the log: cadd_capice.log.

The log file contains entries such as:

After imputation, there shouldn't be any nulls, but check below:

motifEName 46 0.96
GeneID 4 0.08
GeneName 4 0.08
CCDS 9 0.19
Intron 42 0.88
Exon 18 0.38







 False
Categorical variables 10
In total, there are 48 samples

Feature from the model not in data:  Alt_AA
Feature from the model not in data:  Alt_other
Feature from the model not in data:  PolyPhenCat_possibly_damaging
Feature from the model not in data:  Ref_CT
Feature from the model not in data:  Segway_R5
Feature from the model not in data:  Segway_TF0
Feature from the model not in data:  Type_INS
Feature from the model not in data:  nAA_V
(48, 131)

What do these messages mean? Do they indicate a problem?

VEP command incomplete

Describe the bug

The VEP command described in the README is not complete, as it is missing --total_length. This will cause all values of "real...Pos" to be imputed to it's standard imputing value, causing bias and incorrect scores.

Feature "feature_type" is not checked upon loading in the input dataset in Predict

Describe the bug

The feature feature_type, required in the export of an CAPICE prediction, is not checked upon loading in the dataset, resulting in a long wait for the user only to end up in an error while trying to export the final TSV within the exporter.

System information (command line)

OS: Manjaro
Version: 3.1.0
Python version: Python3.10.4
Shell: ZSH

How to Reproduce

Steps to reproduce the behavior:

capice predict -i ./resources/train_input.tsv.gz -m <path/to/GRCh37_3_0_0_model.pickle.dat

Expected behavior

Earlier error raised that feature feature_type is not present within the supplied dataset.

Add tests for Python code

Move model to GRCH37 subfolder

Test resources model out of date

Describe the bug

Test resources model is still versioned as 3.0.0rc3, while this should be 3.0.0. It doesn't cause any tests to fail, but will raise errors when trying to very quickly test something.

System information (command line)

OS: N.A.
Version 3.1.0
Python version N.A.
Shell N.A.

How to Reproduce

Steps to reproduce the behavior:

capice predict -i ./resources/predict_input.tsv.gz -m ./tests/resources/xgb_booster_poc.pickle.dat

Expected behavior

CAPICE should just predict scores like it should with a 3.0.0 model, even though the scores will be very poor.

Trailing empty fields are missing instead of empty

when fields in the capice output lack a value, if the fields are at the end of the line they are missing instead of empty.

This causes problems when splitting the output for use in other tools

Expose transcript identifier for predicted score

A variant annotated with VEP might have multiple transcript annotations. It could be useful information to know which transcript identifier is related to the predicted score.

Output path not determined correctly if containing dots in non-extension part

Describe the bug

In case of an input file such as model_3.1.0.pickle.dat and not adding an output path, the auto-generated output path will be something like model_3_capice.tsv.gz. One would expect model_3.1.0_capice.tsv.gz.

How to Reproduce

Steps to reproduce the behavior:

capice explain -i model_3.1.0.pickle.dat

Additional context

Best solution is probably re-writing the code so that it is able to remove the file extension using the same tuple as given to input_validator.validate_input_path(input_path, extension: tuple).

Additionally, the tests for this file should also be looked at and rewritten:

Additional tests should be added for issue above.
Cleaner to call self.processor = InputProcessor('/test/input/file.txt', output, True, '.txt') for each test individually with different parameters instead of overriding variables which were set during initialization directly and re-running code which was ran during initialization (unless there is a good reason to do otherwise).

Need a tutorial

Thanks for your wonderful tool.

I've been trying to understand how to use capice to predict pathonicity of indels. I have two questions.

I don't know how to get a golden standard training dataset. How could I get/generate pickled model and json files? How to use your model as described in the article to predict my own dataset (genome version: GRCh38)?
I'm working on GRCh38 genome. When using the online service https://capice.molgeniscloud.org/ , I found some exonic coordinates were labeld as intergenic or intronic. Does the website take GRCh37 as reference?

Thank you so much.

Log warnings for variants that can't be processed

Alts containing breakends
Alts containing symbolic alleles
VCF records for which no VEP annotations are available
VCF records for which no CADD annotations are available

Apply command-line argument conventions

-i --input
-o --output
-m --model
-f --force <-- overwrite output if it already exists

Informative debug print statements during training

Is your feature request related to a problem? Please describe.

It is not related to a problem.
During training (which takes a while) I would love to know if a newly added feature is added to the features that XGBoost can learn on.
I would love to know how many samples are going into the training as train samples and test samples.
Since we are going to support balancing again, I would love to know how many samples are used in training per consequence.

Describe the solution you'd like

Debug level verbosity logging statement that describe the self.processed_features right before the train method is called.
Debug level verbosity logging statement that describes the amount of samples within the training set and within the test set, again right before the train method is called.
Method that extracts all unique consequences present within specifically the training set and counts the amount of rows that contain this consequence. Command to extract all unique consequences within data x (assuming the column names have been corrected of their % sign due to the BCFTools conversion): pd.Series(x['Consequence'].str.split('&', expand=True).values.ravel()).dropna().sort_values(ignore_index=True).unique() and to count the amount of rows (assuming c is your for c in consequences: x[x['Consequence'].str.contains(c)].shape[0]. Also DEBUG level. EDIT: this point can also be implemented within the balancing script itself. As long as it is documented somewhere, this makes validating a model performance easier as you now know how many training samples are present per consequence, so how well a model should perform for each and every consequence.

Describe alternatives you've considered

None.

Additional context
Add any other context or screenshots about the feature request here.

Required command line arguments are shown as optional

When trying to access CAPICE through the command line (as it is supposed to), calling the help for (for instance) predict shows all arguments as optional, while -i and -m should be required. (although the optional arguments are between brackets in usage)

Synonymous input raises unclear error

Describe the bug

When CAPICE is supplied with an input file that contains only synonymous variants, the following error will raise:

  Traceback (most recent call last):
    File "/usr/local/bin/capice", line 8, in <module>
      sys.exit(main())
    File "/usr/local/lib/python3.10/dist-packages/molgenis/capice/capice.py", line 12, in main
      argument_handler.handle()
    File "/usr/local/lib/python3.10/dist-packages/molgenis/capice/core/args_handler.py", line 34, in handle
      args.func(args)
    File "/usr/local/lib/python3.10/dist-packages/molgenis/capice/cli/args_handler_parent.py", line 97, in _handle_args
      self._handle_module_specific_args(input_path, output_path, output_filename, output_given,
    File "/usr/local/lib/python3.10/dist-packages/molgenis/capice/cli/args_handler_predict.py", line 69, in _handle_module_specific_args
      CapicePredict(input_path, model, output_path, output_given).run()
    File "/usr/local/lib/python3.10/dist-packages/molgenis/capice/main_predict.py", line 29, in run
      capice_data = self.process(loaded_data=capice_data)
    File "/usr/local/lib/python3.10/dist-packages/molgenis/capice/main_predict.py", line 40, in process
      processed_data = super().process(loaded_data)
    File "/usr/local/lib/python3.10/dist-packages/molgenis/capice/main_capice.py", line 73, in process
      processed_data = processor.process(dataset=loaded_data)
    File "/usr/local/lib/python3.10/dist-packages/molgenis/capice/utilities/manual_vep_processor.py", line 34, in process
      dataset = processor.process(dataset)
    File "/usr/local/lib/python3.10/dist-packages/molgenis/capice/vep/template.py", line 45, in process
      return self._process(dataframe)
    File "/usr/local/lib/python3.10/dist-packages/molgenis/capice/vep/amino_acids.py", line 26, in _process
      dataframe[self.columns] = dataframe[self.name].str.split('/', expand=True)
    File "/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py", line 3643, in __setitem__
      self._setitem_array(key, value)
    File "/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py", line 3685, in _setitem_array
      check_key_length(self.columns, key, value)
    File "/usr/local/lib/python3.10/dist-packages/pandas/core/indexers/utils.py", line 428, in check_key_length
      raise ValueError("Columns must be same length as key")
  ValueError: Columns must be same length as key

This is due to an error in processing the Amino_acids feature, as the VEP processor amino_acids.py expects the field to be oAA/nAA (observed amino acid / alternative amino acid), but can be both observed and alternative in the same field in the case of synonymous variants.

System information (command line)

OS: Non-specific
Version: 4.0.0
Python version: non-specific
Shell: non-specific

How to Reproduce

Steps to reproduce the behavior:

Supply the attached TSV to capice predict (any model)
Error

Expected behavior

Processing of this VEP processor should not error when encountering synonymous only variants. It should also not have to check if the input contains only synonymous. Maybe add a check first if dataframe[self.oaa].str.contains('/').any(), then split? The fillna step does not have to change.

Logs

If available, the generated logging information and/or error message (can also be attached as a file if very large).

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

chunk_5_prepared_chunk5_capice_input.tsv.gz

No output argument given auto-forces overwriting of existing output

Describe the bug

When capice predict -i /path/to/file.tsv.gz -m /path/to/model.pickle.dat is run, without supplying the -o flag, CAPICE does not check if the file exists or not. This results in automatic overwriting of the existing file if there is an existing file.

System information (command line)

OS: All
Version: 3.0.0rc3
Python version: 3.10.1
Shell: ZSH

How to Reproduce

Steps to reproduce the behavior:

Install: pip install capice
Run the command: capice predict -i /path/to/file.tsv.gz -m /path/to/model.pickle.dat twice.
No error is raised that the file is already present.

Expected behavior

I expect a FileExistsError to be raised.

Additional information

Also holds for the train module

[CAPICE2VCF] Updated CAPICE causes a "not a .tsv file" error.

Running the updated CAPICE though the VIP causes an error in the Capice2VCF as following:

java.lang.IllegalArgumentException: Input file '/local/415774/tmp.poywevebbh/2_annotate/2_capice/3_capice_predict/test_vip.vcf.gz.tsv.gz' is not a .tsv file.
	at org.molgenis.capice.AppCommandLineOptions.validateInput(AppCommandLineOptions.java:90)
	at org.molgenis.capice.AppCommandLineOptions.validateCommandLine(AppCommandLineOptions.java:69)
	at org.molgenis.capice.AppCommandLineRunner.createSettings(AppCommandLineRunner.java:104)
	at org.molgenis.capice.AppCommandLineRunner.run(AppCommandLineRunner.java:74)
	at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:795)
	at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:779)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:322)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:1237)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:1226)
	at org.molgenis.capice.App.main(App.java:10)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49)
	at org.springframework.boot.loader.Launcher.launch(Launcher.java:109)
	at org.springframework.boot.loader.Launcher.launch(Launcher.java:58)
	at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:88)

The bug can be replicated by using the new CAPICE, the easybuild CAPICE/v1.3.0 module and trying to convert the output of new CAPICE through CAPICE2VCF.

Upgrade pandas from 1.2.4 to 1.3.3

Upgrading pandas from 1.2.4 to 1.3.3 fails with ValueError: Columns must be same length as key:

When running CAPICE with pandas version 1.3.3, the manual annotator "domain.py" will throw an error stating that the amount of columns must match the key in dataframe[self.columns] = subset. This is with any dataset and regardless of --train or not.

Current solution: downgrade to pandas==1.2.4

Scipy version not specified

When installing scikit-learn, it requires SciPy. However, in requirements.txt, the SciPy version is not specified, making a 'pip install -r requirements.txt' install the latest version of SciPy (at this moment 1.5.0). This causes a RuntimeError. Solution: specify scipy==1.0.1 in requirements.txt

Property_checker doesn't raise TypeError when value is not expected

When PropertyChecker is called with include_none = True, PropertyChecker will not raise a TypeError if value is not the expected value type.

Imputing on input instead of derative columns

Is your feature request related to a problem? Please describe.
Currently imputing is done on the the fields generated after processing the data. This causes a lot of "extra" fields that are imputed and changes in code could result in extra impute fields (instead of the raw input itself). To keep the impute.json understandable and small, imputing should be done before splitting columns up into different fields or other mechanisms. The impute json file should not need to know how to deal with the data after all kinds of processing steps done within CAPICE, just how to deal with missing data that is given to CAPICE directly.

Describe the solution you'd like
Example of current situation:
consequence = stop_gained -> is_stop_gained=1 & other 35 consequence columns=0
The impute json file has a default value for all 36 accepted consequences.

Example of solution:
Impute json has an impute value for when consequence is null (if that would even be allowed). Imputing is done on the original data, so that any derivative column doesn't need imputing itself but instead has been created from already imputed data. In scenario's where empty values should be treated as data (f.e. consequence is null -> is_unknown=1) , this could be indicated using a specific impute label or removing the impute value and making sure the code knows how to handle every situation.

Describe alternatives you've considered
Leaving it as it is. It is a low-priority, but making sure the impute json columns is equal to the input columns instead of all the columns that are derived from the input keeps things more clear when adding/removing features. Additionally, if f.e. unsupported changes are made to certain features (such as including more consequences), this can simply be fixed by a bugfix without needing to create an updated impute json as well.

[CAPICE2VCF] Updated CAPICE causes a NumberFormatException for input "String"

Running updated CAPICE though CAPICE2VCF, temporary disabling gzipping the output of CAPICE to avoid Issue #30, raises the following error:

java.lang.IllegalStateException: java.lang.NumberFormatException: For input string: "FRAME_SHIFT"
	at org.molgenis.capice.vcf.TsvToVcfMapperImpl.getPrediction(TsvToVcfMapperImpl.java:104)
	at org.molgenis.capice.vcf.TsvToVcfMapperImpl.mapLine(TsvToVcfMapperImpl.java:66)
	at org.molgenis.capice.vcf.TsvToVcfMapperImpl.mapCapiceOutput(TsvToVcfMapperImpl.java:56)
	at org.molgenis.capice.vcf.TsvToVcfMapperImpl.map(TsvToVcfMapperImpl.java:43)
	at org.molgenis.capice.vcf.CapiceServiceImpl.mapPredictionsToVcf(CapiceServiceImpl.java:36)
	at org.molgenis.capice.AppCommandLineRunner.run(AppCommandLineRunner.java:87)
	at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:795)
	at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:779)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:322)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:1237)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:1226)
	at org.molgenis.capice.App.main(App.java:10)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49)
	at org.springframework.boot.loader.Launcher.launch(Launcher.java:109)
	at org.springframework.boot.loader.Launcher.launch(Launcher.java:58)
	at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:88)
Caused by: java.lang.NumberFormatException: For input string: "FRAME_SHIFT"
	at java.base/jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2054)
	at java.base/jdk.internal.math.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
	at java.base/java.lang.Float.parseFloat(Float.java:455)
	at org.molgenis.capice.vcf.TsvToVcfMapperImpl.getPrediction(TsvToVcfMapperImpl.java:101)
	... 19 common frames omitted

This is likely, reading the error message, due to changes in the output format of updated CAPICE.

Currently, the output of CAPICE looks like this:

chr_pos_ref_alt	GeneName	Consequence	PHRED	probabilities	prediction	combined_prediction
string	string	string	float	float	string	string

New CAPICE output looks like this:

chr_pos_ref_alt	GeneName	FeatureID	Consequence	probabilities
string	string	string	string	float

Overwrite gets ignored with multiple model files

When the config [OVERWRITES] model is set and multiple model files are present, there is a chance that the overwrite model file is not correctly set due to incorrect breakpoints in both the imputer and preprocessor. Will be fixed with the PR of the new models.

PIP install pandas==1.0.2 on Python 3.9+ fails to install

Attempting to run pip install pandas==1.0.2 in a virtual environment created in python 3.9+ issues the error: ERROR: Command errored out with exit status 1

id_source output column actually represents symbol source

Describe the bug

id_source column in the output actually represents the symbol source: https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#defaultout

SYMBOL_SOURCE - the source of the gene symbol

capice/src/molgenis/capice/utilities/load_file_postprocessor.py

Line 37 in a83dddc

'SYMBOL_SOURCE': Column.id_source.value,

Upgrade CADD from 1.4 to 1.6

Unexpected field values do not throw an error.

Describe the bug

While data is adjusted in such a way that it functions best for machine learning, in case unexpected values are encountered, these are simply ignored. This could cause issues if f.e. the expected output changes over time.

Example: https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html contains a list of possible consequences. splice_donor_5th_base_variant, splice_donor_region_variant & splice_polypyrimidine_tract_variant are currently missing in the conversion here. If such a consequence was present in the input, it would be ignored for machine learning as it does not throw an error is actually converted.

Expected behavior

An error (or at least a warning) is thrown when the input contains unexpected values. This way there is feedback to a user when the current code might not function as expected (which hopefully will lead to feedback to developers that changes are required).

If output file is created while CAPICE is running, file will be overwritten even without -f

Describe the bug

CAPICE validates whether the output file exists during startup, but not when file needs to be written. This can cause issues with files being overwritten without using -f, especially as certain functionalities such as training can take several hours. If multiple runs are executed (or would manually create a file) while CAPICE is already running, this will cause the already existing file to be overwritten.

System information

OS: [e.g. iOS]
Version: [e.g. 3.0.0]
Python version: [e.g. Python3.9.1]
Shell: [e.g. ZSH]

How to Reproduce

Steps to reproduce the behavior:

capice train -i ./resources/train_input.tsv.gz -e ./resources/train_features.json -o ~/Desktop/test1.ubj
while above is still running: touch ~/Desktop/test1.ubj

Created empty file will be overwritten by CAPICE even without -f.

Expected behavior

File is not overwritten when no -f is used (or a new file is created with a warning if file was created during run of CAPICE).

Incorrect model file raises incorrect input error

Describe the bug

When capice predict is run, with an -m that points to a "model file" that does not exist, CAPICE raises the following error: capice predict: error: Input file does not exist!, which is not accurate as the input file does exist, but the model file does not.

System information (command line)

OS: Manjaro 22.0.0 Sikaris (Kernel: x86_64 Linux 5.15.76-1-MANJARO)
Version: 4.0.0
Python version Python3.10.8
Shell: ZSH

How to Reproduce

Steps to reproduce the behavior:

source capice/venv/bin/activate
Run the command capice predict -i capice/resources/predict_input.tsv.gz -m /path/to/model.pickle.dat -o some_output.tsv.gz
See error.

Expected behavior

Following input error is raised:
capice predict: error: Input model file does not exist!

Logs

If available, the generated logging information and/or error message (can also be attached as a file if very large).

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

Add GRCh38 gradient boosting tree model

as well as pre-computed score files.

Use pre-computed score files if available

CAPICE v3.0.0 output classification always output VUS

it should provide classifications based on the score

ValueError raised with XGBoost v1.4.2

Running:
python3 capice.py -i ./CAPICE_example/train_dataset.tsv.gz -o ./test_output --train
will result in an xgboost error: ValueError: too many values to unpack (expected 2). Likely cause: input to fit changed from XGBoost 0.90 to XGBoost 1.4.2.

Broken .tsv to .vcf converter is outdated

The Java code to convert .tsv to .vcf is conceptually incorrect due to the following change in CAPICE v2.0.0:

Output:
- Changed output of CAPICE to no longer remove duplicate entries upon 'chrom-pos-ref-alt'

Furthermore the .tsv to .vcf converter doesn't make sense anymore because CAPICE produces per-transcript scores instead of chr-pos-ref-alt scores since CAPICE v2.0.0:

Output:
-    Changed output of CAPICE to no longer remove duplicate entries upon 'chrom-pos-ref-alt'
    Exposed transcript identifier

The converter should be removed in favor of e.g. a VEP plugin that annotates per-transcript using the .tsv directly.

Move version to init instead of version.py

Currently in order to get the version of CAPICE one must import from molgenis.capice.__version__ import __version__, while the import from molgenis.capice import __version__ will look cleaner and shorter.

Decomplexifying CLI

Is your feature request related to a problem? Please describe.
The command line interface parsing is very complex right now, with argparse.add_parser(action='append') and methods build around said action to parse a single argument. This can be done much more cleanly with a custom action class: https://stackoverflow.com/questions/55430668/display-error-when-argument-is-specified-multiple-times . Further object-orientated-ifying would also allow for more useful tests to be written.

Describe the solution you'd like
Custom action class instead of the action='append' and the methods around it.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Python3.10 support

The following packages are not available for Python3.10:

Scipy 1.6.2 (reason: Python3.10 not supported in setup.py of Scipy)
Numpy 1.21.0 (reason: Scipy 1.7.3 not compatible)
Pandas 1.2.4 (reason: numpy 1.22.0 deprecation warnings)

Advised new versions:

Scipy 1.7.3
Numpy 1.22.0
Pandas 1.3.5

The following tests will fail with Python3.10, Scipy 1.7.3, Numpy 1.22.0 and Pandas 1.3.5:

test_edge_cases_predict.py (all tests)
test_main_predict.py (all tests)
test_main_train.py (test_integration_training)
test_imputer.py (all tests)
test_manual_vep_processor.py (all tests)
test_predict.py (all tests)
test_predictor.py (Error on all tests)
test_preprocessing (all tests)
test_cons_detail.py (all tests)
test_domain.py (test_process_not_null)
test_length.py (all tests)
test_motif_e_score_change (all tests)
test_motif_ehi_pos.py (all tests)

Tests fail due to a "ValueError: Columns must be same length as key"

Ensure columns that are removed for training do not depend on the imputing json file

Is your feature request related to a problem? Please describe.
Currently it does not cause an issue. However, in case CAPICE would deprecate imputing, training would fail due to it using the imputing file to exclude certain fields. Additionally, if one would include such a field within imputing json (.f.e. gene_name), this would cause such a field to be used for training as well.

Describe the solution you'd like
Ensure certain fields are excluded by default instead of depending on the imputing json to be excluded.

Describe alternatives you've considered
Leaving it as it is. It is a low-priority issue so if deemed not needed, it can be put to won't fix.

GRCh38 precomputed scores for SNVs and InDels

Hi! Similar to the CAPICE GRCh37 precomputed scores in Zenodo, I would like to know if you intend to publish GRCh38 precomputed scores in the near future. Thank you!

Sklearn/XGBoost n threads is hardcoded

Describe the bug

Number of parallel threads used by XGBoost is hardcoded instead of checking n threads available on system:

Expected behavior

Either:
A) Value is based on available threads on system. Should be fixable with n_jobs=-1 (except perhaps on cluster, see below).
B) Set through a command line argument (with value 1 or option A as default).

Additional context

Pre-1.3.0 versions of XGBoost have a bug and should use something like this instead.

XGBoost uses cgroup to determine if CPU limit should be set lower than actual system resources (when using n_jobs=-1). Not sure if this works nicely together with Slurm. Something like this might be needed for that instead. But either way would require some testing on the cluster no matter what solution is chosen.

Add training scripts to Github to enable reproducing the model

Update externalsortinginjava in capice2vcf

and remove workaround for issue in externalsortinginjava 0.4.5

[CAPICE2VCF] Unable to process duplicate entries

Running VIP with the legacy version of updated CAPICE with the test VCF supplied within VIP shows different scores for samples:

Chromosome	Pos	Ref	Alt	Gene	Old score	New score
19	17,451,997	GA	G/GA	GTPBP3	0.9838	0.7837
8	145,140,500	CAG	C/CAG	GPAA1	0.9733	0.0752

Due to the following reason:
Updated CAPICE does not throw away duplicate entries unless all columns within the input file are duplicated, to prevent that some variants would not be scored if they are located on the same chromosomal position, but might be on the reverse strand or vice versa. This results in 15 more result variants of updated CAPICE compared to current CAPICE. This means that now duplicated entries are present, which CAPICE2VCF is not able to correctly process, since now GeneName and consequence play a very important role.

Output "." raises extension error in Predict

Describe the bug

When CAPICE predict is run in Predict mode, supplying -o . on it's command line arguments, will raise an extension error that the file should end with either .tsv or .tsv.gz

System information (command line)

OS: Manjaro
Version: 3.1.0
Python version: Python3.10.4
Shell: ZSH

How to Reproduce

Steps to reproduce the behavior:

capice predict -i ./resources/predict_input.tsv.gz -m <path/to/model.pickle.dat> -o .

Expected behavior

If -o . is supplied, either raise error that -o . is default and doesn't need to be supplied OR rewrite code that -o . will also be the directory from where the CAPICE call is made.

Rename output column `suggested_class` to `class`

Since you are using capice predict the term class already implies that this is the predicted classification by CAPICE.

Extend capice2vcf to convert precomputed scores

Replace CADD scripts with own scripts that use CADD annotations

README updates

Is your feature request related to a problem? Please describe.

Problem 1:
Currently the README redirects to a folder called "CAPICE_models", but it does not exist anymore. To a new user this causes confusion and frustration.

Problem 2:
Within the section Usage; CAPICE, description of the force flag states that force does not work for log files. This is an artifact of CAPICE 2.0.0, and should be removed to avoid confusion.

Describe the solution you'd like

For problem 1: Redirect the user to the GitHub release assets for the latest and greatest CAPICE models.
For problem 2: Remove the misleading text.

Describe alternatives you've considered

No alternative is suggested.

Additional context

None.

Make features not case sensitive

Is your feature request related to a problem? Please describe.
Not now, but presumably in the future.

Describe the solution you'd like
Input the train features with support for upper, lower and Spongebob Mocking case, such as this:

{
    Ref: null,
    alt: null,
    feAtUrE_1: null,
    FEATURE_2: null
}

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

molgenis / capice Goto Github PK

capice's Introduction

THIS MOLGENIS VERSION IS IN ARCHIVE MODE. PLEASE USE NEXT GENERATION AT MOLGENIS-EMX2

Welcome to MOLGENIS

Develop

Useful links

Deploy MOLGENIS

capice's People

Contributors

Stargazers

Watchers

Forkers

capice's Issues

Describe the bug

System information (command line)

System information (web service)

How to Reproduce

Expected behavior

Logs

Describe the bug

Describe the bug

System information (command line)

How to Reproduce

Expected behavior

Describe the bug

System information (command line)

How to Reproduce

Expected behavior

Describe the bug

How to Reproduce

Additional context

Describe the bug

System information (command line)

How to Reproduce

Expected behavior

Logs

Screenshots

Additional context

Describe the bug

System information (command line)

How to Reproduce

Expected behavior

Additional information

Describe the bug

Describe the bug

Expected behavior

Describe the bug

System information

How to Reproduce

Expected behavior

Describe the bug

System information (command line)

How to Reproduce

Expected behavior

Logs

Screenshots

Additional context

Describe the bug

Expected behavior

Additional context

Describe the bug

System information (command line)

How to Reproduce

Expected behavior

Recommend Projects

Recommend Topics

Recommend Org