kcleal / dysgu Goto Github PK

View Code? Open in Web Editor NEW

85.0 85.0 10.0 47.41 MB

Toolkit for calling structural variants using short or long reads

License: MIT License

Python 28.94% Cython 56.93% C++ 14.13%

bioinformatics genomics long-read paired-end structural structural-variation variant variant-calling

dysgu's People

Contributors

Stargazers

Watchers

Forkers

brentp unique379r iamh2o yuzhenpeng zm-git-dev panguangze mkyriak ecroot kearseya jmencius

dysgu's Issues

Read Depth

Hello!

I've noticed in the FORMAT field there is a metric that reports the number of reads supporting the structural variant, but I was wondering if there is any metric or a way to report the total number of reads in that region.

Best regards,
Jonatan

Supporting reads

Hello,
I was wondering if it would be possible to add an option to list the supporting reads for each SV? Even if only for the pacbio/ont reads, I think it would be useful to have as an information.
Thank you in advance,
Andrea

Bug in re-mapping?

Hi,
I am currently testing dysgu on some chicken samples. For some samples, sequenced with 2*151bp paired-end Illumina reads, I encounter an error if --remap=True:

...
2022-08-09 13:41:43,692 [INFO   ]  Number of matching SVs from --sites 101804
Traceback (most recent call last):
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/dysgu/main.py", line 444, in call_events
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1337, in dysgu.cluster.cluster_reads
  File "dysgu/cluster.pyx", line 1192, in dysgu.cluster.pipe1
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/dysgu/re_map.py", line 427, in remap_soft_clips
    gstart, ref_seq_big, idx)
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/dysgu/re_map.py", line 279, in process_contig
    e.ref_seq = ref_seq_clipped[500 - 1]
IndexError: string index out of range

The chicken reference genome actually has some small contigs < 500bp, but I'm not sure whether ref_seq_clipped holds the reference contig. Further, I would expect the error then for all samples when force-calling, but it appears only in 4 out of 6 test samples.
Do you have any clue whether this could cause the problem?
Thanks,
Johannes

Error with --max-cov auto

Hi @kcleal and thanks for the great tool!

I was playing around with a few settings and ran into the following error when trying out --max-cov auto.
Note that the pipeline runs perfectly when this is set to -1 or the default 200 for instance.

2022-03-23 19:26:25,545 [INFO   ]  [dysgu-run] Version: 1.3.7
2022-03-23 19:26:25,546 [INFO   ]  run --mode nanopore --diploid True --min-support 3 --min-size 30 --max-cov auto -o output.vcf -p 18 -c genome.fa /scratch input.bam
2022-03-23 19:26:25,546 [INFO   ]  Destination: /scratch
[W::hts_idx_load3] The index file is older than the data file: input.bam.bai
Traceback (most recent call last):
  File "/opt/venv/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/dysgu/main.py", line 270, in run_pipeline
    max_cov_value = sv2bam.process(ctx.obj)
  File "dysgu/sv2bam.pyx", line 165, in dysgu.sv2bam.process
  File "dysgu/coverage.pyx", line 47, in dysgu.coverage.auto_max_cov
TypeError: 'bool' object is not callable

re-using temp files with --ibam

How can I re-use the temp files with the --ibam option?

AttributeError: module 'numpy' has no attribute 'float'

Hi,

Would you please add dysgu to bioconda.

After trying very hard to install dysgu, I got following errors, do you know why?
dysgu -h
Traceback (most recent call last):
File "/Bio/User/kxie/software/anaconda3/envs/dysgu/bin/dysgu", line 5, in
from dysgu.main import cli
File "/Bio/User/kxie/software/anaconda3/envs/dysgu/lib/python3.9/site-packages/dysgu/main.py", line 11, in
from dysgu import cluster, view, sv2bam
File "dysgu/cluster.pyx", line 15, in init dysgu.cluster
File "dysgu/coverage.pyx", line 5, in init dysgu.coverage
File "dysgu/io_funcs.pyx", line 19, in init dysgu.io_funcs
File "/Bio/User/kxie/software/anaconda3/envs/dysgu/lib/python3.9/site-packages/numpy/init.py", line 284, in getattr
raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'float'

Best,
Kun

Segmentation fault

I ran three samples simultanously. One of them processed to completion but two failed due to Segmentation fault during the Building graph step

2021-07-27 16:01:08,978 [INFO   ]  [dysgu-run] Version: 1.1.7
2021-07-27 16:01:08,980 [INFO   ]  run -p5 GRCh38_full_analysis_set_plus_decoy_hla.fa X_dysgu X.cram.bam
2021-07-27 16:01:08,980 [INFO   ]  Destination: X_dysgu
2021-07-27 20:52:22,501 [INFO   ]  dysgu fetch X.cram.bam written to X_dysgu/X.cram.dysgu_reads.bam, n=32435857, time=4:51:13 h:m:s
2021-07-27 20:52:22,501 [INFO   ]  Input file is: X_dysgu/X.cram.dysgu_reads.bam
[E::idx_find_and_load] Could not retrieve index file for 'X_dysgu/X.cram.dysgu_reads.bam'
2021-07-27 20:52:22,715 [INFO   ]  Input file has index False
2021-07-27 20:52:23,231 [WARNING]  Warning: more than one @RG, using first sample (SM) for output: X
2021-07-27 20:52:23,231 [INFO   ]  Sample name: X
2021-07-27 20:52:23,231 [INFO   ]  Writing vcf to stdout
2021-07-27 20:52:23,231 [INFO   ]  Running pipeline
2021-07-27 20:52:26,642 [INFO   ]  Removed 55 outliers with insert size >= 1661
2021-07-27 20:52:26,659 [INFO   ]  Inferred read length 151.0, insert median 444, insert stdev 202
2021-07-27 20:52:26,660 [INFO   ]  Max clustering dist 1454
2021-07-27 20:52:26,660 [INFO   ]  Minimum support 3
2021-07-27 20:52:26,660 [INFO   ]  Building graph with clustering distance 1454 bp, scope length 1454 bp
Segmentation fault

KeyError: "sequence '1' not present"

I am running dysgu installed on Linux WSL2 (Windows 10). The command is default

dysgu run -p4 GRCh38_full_analysis_set_plus_decoy_hla.fa sample.bam > sample_dysgu_sv.vcf

the data is 30x WGS (illumina) PE 150. The output an empty vcf, the folder contains 25 bin files and the reads.bam file which is about 30GB. The output log is too long to paste but here is the last few lines of it.

  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/husamia/.local/lib/python3.8/site-packages/dysgu/main.py", line 220, in run_pipeline
    cluster.cluster_reads(ctx.obj)
  File "/home/husamia/.local/lib/python3.8/site-packages/dysgu/re_map.py", line 473, in drop_svs_near_reference_gaps
    logging.warning("Error fetching reference chromosome: {}".format(chrom), errors)
Message: 'Error fetching reference chromosome: Y'
Arguments: (KeyError("sequence 'Y' not present"),)
2021-07-13 15:33:00,848 [INFO   ]  N near gaps dropped 0

2021-07-13 15:33:45,354 [INFO   ]  Loaded n=25 chromosome coverage arrays from /mnt/e/20A0012672_Proband/dysgu
2021-07-13 15:35:37,356 [INFO   ]  Adding genotype
Traceback (most recent call last):
  File "/home/husamia/.local/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/husamia/.local/lib/python3.8/site-packages/dysgu/main.py", line 220, in run_pipeline
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1051, in dysgu.cluster.cluster_reads
  File "dysgu/cluster.pyx", line 970, in dysgu.cluster.pipe1
  File "dysgu/post_call_metrics.pyx", line 465, in dysgu.post_call_metrics.ref_repetitiveness
  File "pysam/libcfaidx.pyx", line 303, in pysam.libcfaidx.FastaFile.fetch
KeyError: "sequence '1' not present"

Error - pip3 dysgu installation

Hi,
I am upgrading to the newest version of dysgu and I have the following error with pip3. I created a dedicated conda env especially for that.

Collecting dysgu
  Using cached dysgu-1.3.7-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (72.7 MB)
Collecting lightgbm
  Using cached lightgbm-3.3.2-py3-none-manylinux1_x86_64.whl (2.0 MB)
Requirement already satisfied: numpy>=1.16.5 in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from dysgu) (1.19.5)
Collecting networkx>=2.4
  Using cached networkx-2.7.1-py3-none-any.whl (2.0 MB)
Requirement already satisfied: pandas in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from dysgu) (1.3.3)
Requirement already satisfied: scipy in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from dysgu) (1.7.1)
Collecting click>=8.0
  Using cached click-8.0.4-py3-none-any.whl (97 kB)
Collecting scikit-learn>=0.22
  Using cached scikit_learn-1.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.4 MB)
Collecting edlib
  Using cached edlib-1.3.9-cp39-cp39-manylinux2010_x86_64.whl (327 kB)
Requirement already satisfied: pysam in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from dysgu) (0.17.0)
Collecting cython
  Using cached Cython-0.29.28-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.9 MB)
Collecting scikit-bio
  Using cached scikit-bio-0.5.6.tar.gz (8.4 MB)
  Preparing metadata (setup.py) ... done
Collecting sortedcontainers
  Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Collecting joblib>=0.11
  Using cached joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Requirement already satisfied: wheel in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from lightgbm->dysgu) (0.37.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from pandas->dysgu) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from pandas->dysgu) (2021.1)
Collecting lockfile>=0.10.2
  Using cached lockfile-0.12.2-py2.py3-none-any.whl (13 kB)
Collecting CacheControl>=0.11.5
  Using cached CacheControl-0.12.10-py2.py3-none-any.whl (20 kB)
Collecting decorator>=3.4.2
  Using cached decorator-5.1.1-py3-none-any.whl (9.1 kB)
Collecting IPython>=3.2.0
  Using cached ipython-8.1.1-py3-none-any.whl (750 kB)
Collecting matplotlib>=1.4.3
  Using cached matplotlib-3.5.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)
Collecting natsort>=4.0.3
  Using cached natsort-8.1.0-py3-none-any.whl (37 kB)
Collecting hdmedians>=0.13
  Using cached hdmedians-0.14.2.tar.gz (7.6 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: requests in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from CacheControl>=0.11.5->scikit-bio->dysgu) (2.26.0)
Collecting msgpack>=0.5.2
  Using cached msgpack-1.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (322 kB)
Collecting prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0
  Using cached prompt_toolkit-3.0.28-py3-none-any.whl (380 kB)
Requirement already satisfied: setuptools>=18.5 in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from IPython>=3.2.0->scikit-bio->dysgu) (60.5.0)
Collecting matplotlib-inline
  Using cached matplotlib_inline-0.1.3-py3-none-any.whl (8.2 kB)
Collecting pexpect>4.3
  Using cached pexpect-4.8.0-py2.py3-none-any.whl (59 kB)
Collecting stack-data
  Using cached stack_data-0.2.0-py3-none-any.whl (21 kB)
Collecting jedi>=0.16
  Using cached jedi-0.18.1-py2.py3-none-any.whl (1.6 MB)
Collecting traitlets>=5
  Using cached traitlets-5.1.1-py3-none-any.whl (102 kB)
Collecting backcall
  Using cached backcall-0.2.0-py2.py3-none-any.whl (11 kB)
Collecting pygments>=2.4.0
  Using cached Pygments-2.11.2-py3-none-any.whl (1.1 MB)
Collecting pickleshare
  Using cached pickleshare-0.7.5-py2.py3-none-any.whl (6.9 kB)
Collecting kiwisolver>=1.0.1
  Using cached kiwisolver-1.4.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)
Collecting fonttools>=4.22.0
  Using cached fonttools-4.31.2-py3-none-any.whl (899 kB)
Collecting cycler>=0.10
  Using cached cycler-0.11.0-py3-none-any.whl (6.4 kB)
Collecting pillow>=6.2.0
  Using cached Pillow-9.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB)
Collecting packaging>=20.0
  Using cached packaging-21.3-py3-none-any.whl (40 kB)
Collecting pyparsing>=2.2.1
  Using cached pyparsing-3.0.7-py3-none-any.whl (98 kB)
Requirement already satisfied: six>=1.5 in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas->dysgu) (1.15.0)
Collecting parso<0.9.0,>=0.8.0
  Using cached parso-0.8.3-py2.py3-none-any.whl (100 kB)
Collecting ptyprocess>=0.5
  Using cached ptyprocess-0.7.0-py2.py3-none-any.whl (13 kB)
Collecting wcwidth
  Using cached wcwidth-0.2.5-py2.py3-none-any.whl (30 kB)
Requirement already satisfied: idna<4,>=2.5 in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from requests->CacheControl>=0.11.5->scikit-bio->dysgu) (3.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from requests->CacheControl>=0.11.5->scikit-bio->dysgu) (1.26.7)
Requirement already satisfied: charset-normalizer~=2.0.0 in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from requests->CacheControl>=0.11.5->scikit-bio->dysgu) (2.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /home/kgagalova/.linuxbrew/lib/python3.9/site-packages (from requests->CacheControl>=0.11.5->scikit-bio->dysgu) (2021.5.30)
Collecting asttokens
  Using cached asttokens-2.0.5-py2.py3-none-any.whl (20 kB)
Collecting executing
  Using cached executing-0.8.3-py2.py3-none-any.whl (16 kB)
Collecting pure-eval
  Using cached pure_eval-0.2.2-py3-none-any.whl (11 kB)
Building wheels for collected packages: scikit-bio, hdmedians
  Building wheel for scikit-bio (setup.py) ... error
ERROR: Command errored out with exit status 1:
   command: /home/kgagalova/.linuxbrew/opt/[email protected]/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-22gwih1p/scikit-bio_fd5adfa3cbcc4059926676ebae73c81f/setup.py'"'"'; __file__='"'"'/tmp/pip-install-22gwih1p/scikit-bio_fd5adfa3cbcc4059926676ebae73c81f/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-c_4psgrc
       cwd: /tmp/pip-install-22gwih1p/scikit-bio_fd5adfa3cbcc4059926676ebae73c81f/
  Complete output (663 lines):
  running bdist_wheel
  running build
[....]
running build_ext
  creating build/temp.linux-x86_64-3.9
  creating build/temp.linux-x86_64-3.9/skbio
  creating build/temp.linux-x86_64-3.9/skbio/metadata
  gcc-5 -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -O3 -Wall -fPIC -I/home/kgagalova/.linuxbrew/opt/[email protected]/lib/python3.9/site-packages/numpy/core/include -I/home/kgagalova/.linuxbrew/opt/[email protected]/include/python3.9 -c skbio/metadata/_intersection.c -o build/temp.linux-x86_64-3.9/skbio/metadata/_intersection.o
  error: command 'gcc-5' failed: No such file or directory
  ----------------------------------------
  ERROR: Failed building wheel for scikit-bio
  Running setup.py clean for scikit-bio
  Building wheel for hdmedians (pyproject.toml) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/kgagalova/.linuxbrew/opt/[email protected]/bin/python3.9 /home/kgagalova/.linuxbrew/opt/[email protected]/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmpke8nnicz
       cwd: /tmp/pip-install-22gwih1p/hdmedians_34f0c0b3631144b6aeec2bd7484540bd
  Complete output (9 lines):
  running bdist_wheel
  running build
  running build_py
  running build_ext
  cythoning hdmedians/geomedian.pyx to hdmedians/geomedian.c
  /tmp/pip-build-env-b8fpnjvp/overlay/lib/python3.9/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /tmp/pip-install-22gwih1p/hdmedians_34f0c0b3631144b6aeec2bd7484540bd/hdmedians/geomedian.pyx
    tree = Parsing.p_module(s, pxd, full_module_name)
  building 'hdmedians.geomedian' extension
  error: command 'gcc-5' failed: No such file or directory
  ----------------------------------------
  ERROR: Failed building wheel for hdmedians
Failed to build scikit-bio hdmedians
ERROR: Could not build wheels for hdmedians, which is required to install pyproject.toml-based projects

Looks like I have some issues installing gcc-5 on my system.

ERROR conda.core.link:_execute(699): An error occurred while installing package 'psi4::gcc-5-5.2.0-1'.
Rolling back transaction: done

LinkError: post-link script failed for package psi4::gcc-5-5.2.0-1
location of failed script: /home/kgagalova/miniconda3/envs/py3.7/bin/.gcc-5-post-link.sh
==> script messages <==
<None>
==> script output <==
stdout: Couldn't locate crtXXX.o in default library search paths. You may not have it  at all. It is usually packaged in libc6-dev/glibc-devel packages. We will try  to locate crtXXX.o with system installed gcc...
Installation failed: gcc is not able to compile a simple 'Hello, World' program.

stderr: ln: failed to create symbolic link '/home/kgagalova/miniconda3/envs/py3.7/lib/gcc/x86_64-unknown-linux-gnu/5.2.0/crt1.o': File exists
ln: failed to create symbolic link '/home/kgagalova/miniconda3/envs/py3.7/lib/gcc/x86_64-unknown-linux-gnu/5.2.0/crti.o': File exists
ln: failed to create symbolic link '/home/kgagalova/miniconda3/envs/py3.7/lib/gcc/x86_64-unknown-linux-gnu/5.2.0/11.2.0': File exists
ln: failed to create symbolic link '/home/kgagalova/miniconda3/envs/py3.7/lib/gcc/x86_64-unknown-linux-gnu/5.2.0/crtn.o': File exists
ln: failed to create symbolic link '/home/kgagalova/miniconda3/envs/py3.7/lib/gcc/x86_64-unknown-linux-gnu/5.2.0/11.2.0': File exists
/home/kgagalova/miniconda3/envs/py3.7/bin/.gcc-5-post-link.sh: line 98: /home/kgagalova/miniconda3/envs/py3.7/lib/gcc/x86_64-conda-linux-gnu/11.2.0 /home/kgagalova/miniconda3/envs/py3.7/lib/gcc/x86_64-unknown-linux-gnu/5.2.0/specs: No such file or directory
sed: can't read /home/kgagalova/miniconda3/envs/py3.7/lib/gcc/x86_64-conda-linux-gnu/11.2.0 /home/kgagalova/miniconda3/envs/py3.7/lib/gcc/x86_64-unknown-linux-gnu/5.2.0/specs: No such file or directory
sed: can't read /home/kgagalova/miniconda3/envs/py3.7/lib/gcc/x86_64-conda-linux-gnu/11.2.0 /home/kgagalova/miniconda3/envs/py3.7/lib/gcc/x86_64-unknown-linux-gnu/5.2.0/specs: No such file or directory
sed: can't read /home/kgagalova/miniconda3/envs/py3.7/lib/gcc/x86_64-conda-linux-gnu/11.2.0 /home/kgagalova/miniconda3/envs/py3.7/lib/gcc/x86_64-unknown-linux-gnu/5.2.0/specs: No such file or directory
/home/kgagalova/miniconda3/envs/py3.7/gcc/libexec/gcc/x86_64-unknown-linux-gnu/5.2.0/cc1: error while loading shared libraries: libisl.so.10: cannot open shared object file: No such file or directory

return code: 1

()

Do you have any suggestion on how to install dysgu with pip3? Thank you in advance

Manual installation

Just a note for others.

Running manual installation (without conda) worked for me, but only after a little mucking around.

conda deactivate

Then from the github README

git clone --recursive https://github.com/kcleal/dysgu.git
cd dysgu/dysgu/htslib
make
cd ../../
pip install --user -r requirements.txt
pip install .

Successfully installed dysgu-1.3.6


$ dysgu

Traceback (most recent call last):
  File "/home/rcug/.local/bin/dysgu", line 5, in <module>
    from dysgu.main import cli
  File "/home/rcug/.local/lib/python3.8/site-packages/dysgu/main.py", line 106, in <module>
    version = pkg_resources.require("dysgu")[0].version
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 901, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 792, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (Click 7.0 (/usr/lib/python3/dist-packages), Requirement.parse('click>=8.0.0'), {'black'})

solved with

pip install --user click==8.0

Now works

Usage: dysgu [OPTIONS] COMMAND [ARGS]...

  Dysgu-SV is a set of tools calling structural variants from bam/cram files

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  call   Call structural variants from alignment file/stdin
  fetch  Filters input .bam/.cram for read-pairs that are discordant or...
  merge  Merge .vcf/csv variant files
  run    Run the dysgu pipeline.
  test   Run dysgu tests

RAM Issue - merging large cohorts

Hi
I am working with plant WGS 20X data from more than 300 accessions.
First, I call SV variants with dysgu run and then dysgu merge
I re-genotyped at the sample level and when I have to merge the re-genotyped samples I got RAM issue (basically 500 gb RAM are not enough)
What I found really weird is that the first merge was fine with no problem.
Any idea on how I can manage this issue?
Cheers

Does the merge command work for translocations?

Apologies if I missed this somewhere, but does the merge command work for merging translocations across input samples? I used the merge command and it didn't seem to merge any, though I could be missing something. If it does merge translocations, is there information for how it does so? Thanks for any assistance.

joint sample VCF

Hi,

Thanks for the nice tool. Do you have some method (or future plan) for joint calling?
I tried merge to combine multiple samples' VCFs but am not sure whether this is a correct way to get reference homozygous genotypes..

Best protocol about dysgu run/merge with WGS 35X from multiple patients? Calling cohorts

Hi,

I'm working with Human WGS 35X data from multiple patients.

First, I call SV variants with dysgu run (in 26 samples below example for sample 1):

dysgu run \
  --procs 12 \
  --mode pe --pl pe \
  --diploid True \
  --drop-gaps True \
  --max-cov auto \
  --min-support 5 \
  --mq 20 \
  --exclude ${blacklist} \
  --min-size 50 \
  --verbosity 2 \
  ${hg38} \
  ${output}/tmp1 \
  ${input}/sample1.bam > ${output_run}/sample1.SV.vcf

Then, I merge the samples into a unified site list:

dysgu merge \
${output_run}/sample1.SV.vcf  \
${output_run}/sample2.SV.vcf ... \
> ${output_merge}/merged.vcf

I re-genotype at the sample level:

dysgu run --sites ${output_merge}/merged.vcf  ${hg38} tmp1 sample1.bam > ${output_geno}/sample1.re_geno.vcf
dysgu run --sites ${output_merge}/merged.vcf  ${hg38} tmp2 sample1.bam > ${output_geno}/sample2.re_geno.vcf
....

Finally, I merge re-genotyped samples:

dysgu merge \
${output_geno}/sample1.re_geno.vcf  \
${output_geno}/sample2.re_geno.vcf... \
> ${output_merge}/merged.re_geno.vcf

My questions are:
1) Based on WGS with 35X (Illumina paired-end), what better recommended value for "--min-support"?
2) Is the last step (merge regenotyped samples) correct? because is not mentioned in the doc https://github.com/kcleal/dysgu/blob/master/README.rst
3) Do you have any specific recommendations for calling germline variants?

Thanks for any help you can provide me on this protocol,

Best,
Tarek

Large memory request

Hello,

I ran into an issue while running a basic test with dysgu that seems to be allocating an extremely large array for some reason. The error is below (redacting some file paths):

2021-07-28 08:59:57,769 [INFO   ]  [dysgu-run] Version: 1.2.7
2021-07-28 08:59:57,770 [INFO   ]  run -o output.vcf --mode pe --pl pe /cluster/home/jholt/reference/hg38_asm5_alt/hg38.fa ./working_dir <redacted>/pipeline/merged_alignments/hg38_asm5_alt/sentieon-202010.02/HALB3002753.bam
2021-07-28 08:59:57,770 [INFO   ]  Destination: ./working_dir
2021-07-28 09:43:45,827 [INFO   ]  dysgu fetch <redacted>/pipeline/merged_alignments/hg38_asm5_alt/sentieon-202010.02/HALB3002753.bam written to ./working_dir/HALB3002753.dysgu_reads.bam, n=65472206, time=0:43:48 h:m:s
2021-07-28 09:43:45,827 [INFO   ]  Input file is: ./working_dir/HALB3002753.dysgu_reads.bam
2021-07-28 09:43:48,444 [INFO   ]  Sample name: HALB3002753
2021-07-28 09:43:48,444 [INFO   ]  Writing SVs to output.vcf
2021-07-28 09:43:48,446 [INFO   ]  Running pipeline
2021-07-28 09:43:49,103 [INFO   ]  Removed 34 outliers with insert size >= 903.0
2021-07-28 09:43:49,124 [INFO   ]  Inferred read length 151.0, insert median 391, insert stdev 99
2021-07-28 09:43:49,125 [INFO   ]  Max clustering dist 886
2021-07-28 09:43:49,126 [INFO   ]  Minimum support 3
2021-07-28 09:43:49,126 [INFO   ]  Building graph with clustering distance 886 bp, scope length 886 bp
2021-07-28 10:01:48,141 [INFO   ]  Total input reads 63689262
2021-07-28 10:02:50,425 [INFO   ]  Graph constructed
(315,)
(132,)
(array([[1.320000e+02, 1.870000e+02],
       [1.230000e+02, 1.980000e+02],
       [1.400000e+02, 2.810000e+02],
       ...,
       [1.496422e+06, 1.496339e+06],
       [1.496425e+06, 1.496419e+06],
       [1.496474e+06, 1.496389e+06]]),)
Traceback (most recent call last):
  File "dysgu/call_component.pyx", line 663, in dysgu.call_component.partition_single
  File "<redacted>/miniconda3/envs/dysgu_test/lib/python3.9/site-packages/scipy/cluster/hierarchy.py", line 1064, in linkage
    if not np.all(np.isfinite(y)):
numpy.core._exceptions.MemoryError: Unable to allocate 41.6 GiB for an array with shape (44717395096,) and data type bool

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<redacted>/miniconda3/envs/dysgu_test/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "<redacted>/miniconda3/envs/dysgu_test/lib/python3.9/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "<redacted>/miniconda3/envs/dysgu_test/lib/python3.9/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "<redacted>/miniconda3/envs/dysgu_test/lib/python3.9/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "<redacted>/miniconda3/envs/dysgu_test/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "<redacted>/miniconda3/envs/dysgu_test/lib/python3.9/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "<redacted>/miniconda3/envs/dysgu_test/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "<redacted>/miniconda3/envs/dysgu_test/lib/python3.9/site-packages/dysgu/main.py", line 252, in run_pipeline
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1110, in dysgu.cluster.cluster_reads
  File "dysgu/cluster.pyx", line 901, in dysgu.cluster.pipe1
  File "dysgu/cluster.pyx", line 653, in dysgu.cluster.component_job
  File "dysgu/call_component.pyx", line 1747, in dysgu.call_component.call_from_block_model
  File "dysgu/call_component.pyx", line 1755, in dysgu.call_component.call_from_block_model
  File "dysgu/call_component.pyx", line 1736, in dysgu.call_component.multi
  File "dysgu/call_component.pyx", line 874, in dysgu.call_component.single
  File "dysgu/call_component.pyx", line 668, in dysgu.call_component.partition_single
  File "<redacted>/miniconda3/envs/dysgu_test/lib/python3.9/site-packages/scipy/cluster/hierarchy.py", line 1060, in linkage
    y = distance.pdist(y, metric)
  File "<redacted>/miniconda3/envs/dysgu_test/lib/python3.9/site-packages/scipy/spatial/distance.py", line 2250, in pdist
    return pdist_fn(X, out=out, **kwargs)
numpy.core._exceptions.MemoryError: Unable to allocate 333. GiB for an array with shape (44717395096,) and data type float64

It seems like the memory requirements should be much lower according to the docs. Any suggestions?

Pickle Warning in run command

Towards the end of the SV run command I noticed the following line:
python3.7/site-packages/sklearn/base.py:338: UserWarning: Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations UserWarning

Are you all aware of this? Should I be concerned about the unpickle?

Thanks in advance. Very easy to use tool btw.

changing default argument when high read depth

Hi
I am using a high coverage depth pacbio hifi dataset with depth 300+
I noticed that the default --mode pacbio includes --max-cov 150 which is less than my actual avg data depth

pacbio: 
  --mq 20 
  --paired False 
  --min-support 2 
  --max-cov 150 
  --dist-norm 200 
  --trust-ins-len True

Can I add --max-cov 500 after --mode pacbio and overwrite that parameter at runtime or should I set manually all arguments behind --mode pacbio and replace it altogether?

dysgu run -p 24 --mode pacbio --max-cov 500 -x -c --thresholds 0.45,0.45,0.45,0.45,0.45 Sc.R64-1-1.fa wd mappings.bam > var.vcf

Another point; what other values than 0.45 would one want to use for --thresholds and what would that mean?

thanks for your help

Problem in re genotyping a sample using merged sites

Hi,

Nice tool! I like that it can run with both short and long reads.
In my case I have few small families (duos or trios) sequenced using PacBio HiFi reads for which I want to create family VCFs

Following your suggestions I've first generated SV calls for each individual using dysgu run --mode pacbio, then for each family I've merged calls from the various family members using dysgu merge vcf_sample1.vcf vcf_sample2.vcf | bgzip -c > merged.vcf.gz

Finally I want to re-call SV so I've tried something like the following

dysgu run -p12 --mode pacbio --sites merged.vcf.gz --sites-prob 0.8 --all-sites True GRCh38.fa tmpdir sample1.bam

I've tried the above also using uncompressed merged file and without --site-prob and the results are the same as described below

In the beginning, the tool seems to run fine and it can read variants from my input I think. Here is an extract of the log:

2021-11-16 15:35:56,906 [INFO   ]  Writing vcf to stdout
2021-11-16 15:35:56,906 [INFO   ]  Running pipeline
2021-11-16 15:35:56,940 [INFO   ]  Minimum support 2
2021-11-16 15:35:56,940 [INFO   ]  Reading --sites
2021-11-16 15:35:57,623 [INFO   ]  Building graph with clustering 500000 bp
2021-11-16 15:36:31,347 [INFO   ]  Total input reads 244912
2021-11-16 15:36:31,417 [INFO   ]  Added 43336 variants from input sites
2021-11-16 15:36:31,504 [INFO   ]  Graph constructed

But then just after that I always get this error:

Traceback (most recent call last):
  File "/well/gel/HICF2/software/conda_envs/dysgu/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/gpfs3/well/gel/HICF2/software/conda_envs/dysgu/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/gpfs3/well/gel/HICF2/software/conda_envs/dysgu/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/gpfs3/well/gel/HICF2/software/conda_envs/dysgu/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/gpfs3/well/gel/HICF2/software/conda_envs/dysgu/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/gpfs3/well/gel/HICF2/software/conda_envs/dysgu/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/gpfs3/well/gel/HICF2/software/conda_envs/dysgu/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/gpfs3/well/gel/HICF2/software/conda_envs/dysgu/lib/python3.7/site-packages/dysgu/main.py", line 280, in run_pipeline 
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1115, in dysgu.cluster.cluster_reads
  File "dysgu/cluster.pyx", line 919, in dysgu.cluster.pipe1
  File "/gpfs3/well/gel/HICF2/software/conda_envs/dysgu/lib/python3.7/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/gpfs3/well/gel/HICF2/software/conda_envs/dysgu/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'dysgu.sites_utils.Site'>: attribute lookup Site on dysgu.sites_utils failed

Any idea of possible solutions?

Thanks!

Zero Division Error During Call Component

Hello Kez,

Exciting work you've put together here with dysgu. Very happy to have come across your preprint as I've been looking for a tool that would generate such useful read metrics for the major SV classes.
I have been trying to get the software to complete on one of my samples and continue to run into this same error upon running the call step.

dysgu fetch working_dir PD26400a_T.final.bam
2021-12-17 11:44:15,327 [INFO   ]  [dysgu-fetch] Version: 1.3.0
2021-12-17 11:59:25,375 [INFO   ]  dysgu fetch PD26400a_T.final.bam written to working_dir/PD26400a_T.final.dysgu_reads.bam, n=6183390, time=0:15:10 h:m:s

dysgu call --ibam PD26400a_T.final.bam --sites PD26400a_T_vs_PD26400b_N.consensus.somatic.sv.vcf Homo_sapiens_assembly38.fasta working_dir workin
g_dir/PD26400a_T.final.dysgu_reads.bam > svs.vcf
2021-12-17 12:01:57,321 [INFO   ]  [dysgu-call] Version: 1.3.0
2021-12-17 12:01:57,321 [INFO   ]  Input file is: working_dir/PD26400a_T.final.dysgu_reads.bam
2021-12-17 12:01:57,321 [INFO   ]  call --ibam PD26400a_T.final.bam --sites PD26400a_T_vs_PD26400b_N.consensus.somatic.sv.vcf Homo_sapiens_assembly38.fasta working_dir working_dir/PD26400a_T.final.dysgu_reads.bam
[W::hts_idx_load3] The index file is older than the data file: PD26400a_T.final.bai
2021-12-17 12:01:57,573 [INFO   ]  Sample name: PD26400a_T
2021-12-17 12:01:57,573 [INFO   ]  Writing vcf to stdout
2021-12-17 12:01:57,573 [INFO   ]  Running pipeline
2021-12-17 12:01:58,283 [INFO   ]  Calculating insert size. Removed 27 outliers with insert size >= 938.0
2021-12-17 12:01:58,299 [INFO   ]  Inferred read length 151.0, insert median 458, insert stdev 89
2021-12-17 12:01:58,301 [INFO   ]  Max clustering dist 903
2021-12-17 12:01:58,301 [INFO   ]  Minimum support 3
2021-12-17 12:01:58,301 [INFO   ]  Reading --sites
[W::vcf_parse_info] INFO/END=34143939 is smaller than POS at chr1:84232901
2021-12-17 12:01:58,351 [INFO   ]  Building graph with clustering 903 bp
2021-12-17 12:03:57,234 [INFO   ]  Total input reads 6183142
2021-12-17 12:03:59,688 [INFO   ]  Added 57 variants from input sites
2021-12-17 12:04:01,797 [INFO   ]  Graph constructed
Traceback (most recent call last):
  File "/conda/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/conda/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/conda/lib/python3.7/site-packages/dysgu/main.py", line 442, in call_events
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1115, in dysgu.cluster.cluster_reads
  File "dysgu/cluster.pyx", line 898, in dysgu.cluster.pipe1
  File "dysgu/cluster.pyx", line 633, in dysgu.cluster.component_job
  File "dysgu/call_component.pyx", line 1914, in dysgu.call_component.call_from_block_model
  File "dysgu/call_component.pyx", line 1928, in dysgu.call_component.call_from_block_model
  File "dysgu/call_component.pyx", line 1880, in dysgu.call_component.multi
  File "dysgu/call_component.pyx", line 988, in dysgu.call_component.single
  File "dysgu/call_component.pyx", line 608, in dysgu.call_component.make_single_call
  File "dysgu/call_component.pyx", line 286, in dysgu.call_component.count_attributes2
ZeroDivisionError: float division

To troubleshoot, I first ran dysgu test to determine if my build was fully successful and this completed without error.
Next, I checked the output bam from the fetch step with some samtools commands. It's wasn't corrupt and had a reasonable amount of reads (6183390).
Then I went into the script call_component.pyx and looked into the code block where the error originated:

if clipped_bases > 0 and aligned_bases > 0:
        er.clip_qual_ratio = (aligned_base_quals / aligned_bases) / (clipped_base_quals / clipped_bases)
    else:
        er.clip_qual_ratio = 0

From that I can see it's one of aligned_bases, clipped_base_quals, or clipped_bases
I couldn't see a reason any to these values would be zero based on the BAM that was produced so I wanted to present the issue here.

Let me know if there is any other information or tests you would like me to try.
Best regards,
Patrick

Below are packages installed for build via Anaconda and pip:

Cython                 0.29.26
click                  8.0.3
numpy                  1.21.2
pandas                 1.3.5
pysam                  0.18.0
networkx               2.6.3
scikit-learn           1.0.1
ncls                   0.0.62
scikit-bio             0.5.6
edlib                  1.3.9
sortedcontainers       2.4.0
lightgbm               3.3.1

Merging genotype files

Hi,
I merged multiple samples and used them to genotype each sample. Then I merged the genotyped samples to get a population level VCF. It has variants with a mapping quality of zero and genotype quality zero. These variants are not present in the original sample genotype file.

Chr: NC_001493.2
Position: 63036-63075
ID: 35

Genotype Information
Sample: ATCC.dysgu_reads
Genotype: T
Quality: 0
Type: HOM_REF
Is Filtered Out: No

Genotype Attributes
MAPQP: 0
SU: 0
PS: 0
BCC: 0
MS: 0
FCC: 0
Genotype Quality: 0
COV: 0
SC: 0
RED: 0
PROB: 0
PE: 0
ICN: 0
NEIGH10: 0
BND: 0
RMS: 0
WR: 0
OCN: 0
SR: 0
How to filter them? Thank you

SV calling NOT working

Hello,
I have used dysgu for SV calling earlier without facing any errors but suddenly now I facing following error. Can you please help me what's going wrong, since google didn't show any possible solution

command : dysgu call --mode nanopore -p 20 ../../hg38.fa work_dir/ --ibam HG00733.sortedmappeddedup.bam > HG00733_ONTlee_dysgu.vcf

error :
2022-07-11 12:11:04,948 [INFO ] [dysgu-call] Version: 1.3.11
Traceback (most recent call last):
File "/home/sachin/miniconda3/bin/dysgu", line 8, in
sys.exit(cli())
File "/home/sachin/miniconda3/lib/python3.8/site-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/home/sachin/miniconda3/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/sachin/miniconda3/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/sachin/miniconda3/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/sachin/miniconda3/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/sachin/miniconda3/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/sachin/miniconda3/lib/python3.8/site-packages/dysgu/main.py", line 431, in call_events
raise ValueError("Could not find '{}'".format(kwargs["sv_aligns"]))
ValueError: Could not find 'None'

Merging samples VCFs

Hi,

Thanks for the nice tool

I am trying to merge multiple samples that came out from the dysgu run -v2 command in one combined file. I used the following command:

dysgu merge Sample1_SVs.vcf Sample2_SVs.vcf Sample3_SVs.vcf .... Sample8_SVs.vcf > Combined_file.vcf

However, if the variant exists in multiple samples it doesn't get actually merged. Instead, it keeps writing the same variant in separate rows:

#CHROM	POS	ID	REF	QUAL	FILTER	INFO	FORMAT	Sample1	Sample2	Sample3	Sample4	Sample5	Sample6	Sample7	Sample8
2	34200481	18843	A	.	PASS	SVMETHOD=DYSGUv1.3.14;SVTYPE=DEL;END=34200511;CHR2=2;GRP=151067;NGRP=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=30;CONTIGA=ACTATTGACAATAGTACATATATAATATACAGTATATACACTATTGACAATAGTGTATATAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTA;CONTIGB=agtatatacactattgacaatagtgtataTAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTATACACACGTCTTTTTTATTGTT;GC=24.85;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=36;OL=0;SU=21;WR=7;PE=0;SR=0;SC=7;BND=0;LPREC=1;RT=pe;MeanPROB=0.892;MaxPROB=0.892	GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB	0/1:129.0:60.0:21:7:0:0:7:0:40.02:3:7:7:0:0:18:0.516:0.513:0.994:0.892	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
2	34200481	72815	A	.	PASS	SVMETHOD=DYSGUv1.3.14;SVTYPE=DEL;END=34200511;CHR2=2;GRP=137688;NGRP=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=30;CONTIGA=ATACTATTGACAATAGTACATATATAATATACAGTATATACACTATTGACAATAGTGTATATAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTATA;CONTIGB=cagtatatacactattgacaatagtgtataTAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTATACACACGTCTTTTTTATTGTTTCT;GC=24.85;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=36;OL=0;SU=21;WR=9;PE=0;SR=0;SC=3;BND=0;LPREC=1;RT=pe;MeanPROB=0.89;MaxPROB=0.89	GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0/1:101.0:60.0:21:9:0:0:3:0:38.48:3:6:6:0:0:20:0.444:0.421:0.947:0.89	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
2	34200481	127694	A	.	PASS	SVMETHOD=DYSGUv1.3.14;SVTYPE=DEL;END=34200511;CHR2=2;GRP=138581;NGRP=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=30;CONTIGA=ATACTATTGACAATAGTACATATATAATATACAGTATATACACTATTGACAATAGTGTATATAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTATACAC;GC=24.43;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=36;OL=0;SU=26;WR=10;PE=0;SR=0;SC=6;BND=0;LPREC=1;RT=pe;MeanPROB=0.896;MaxPROB=0.896	GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0/1:78.0:60.0:26:10:0:0:6:0:37.75:3:11:5:0:0:17:0.342:0.342:1.0:0.896	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
2	34200481	182265	A	.	PASS	SVMETHOD=DYSGUv1.3.14;SVTYPE=DEL;END=34200511;CHR2=2;GRP=148809;NGRP=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=30;CONTIGA=ATACTATTGACAATAGTACATATATAATATACAGTATATACACTATTGACAATAGTGTATATAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTATACACA;CONTIGB=tatacactattgacaatagtgtataTAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTATACACACGTCTTTTT;GC=25.08;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=36;OL=0;SU=22;WR=10;PE=0;SR=0;SC=2;BND=0;LPREC=1;RT=pe;MeanPROB=0.888;MaxPROB=0.888	GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0/1:134.0:60.0:22:10:0:0:2:0:38.75:3:8:4:0:0:20:0.509:0.568:1.115:0.888	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
2	34200481	236622	A	.	PASS	SVMETHOD=DYSGUv1.3.14;SVTYPE=DEL;END=34200511;CHR2=2;GRP=149759;NGRP=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=30;CONTIGA=TGACAATAGTACATATATAATATACAGTATATACACTATTGACAATAGTGTATATAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGT;CONTIGB=ctattgacaatagtgtataTAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTATACACACGTCTTTTTTATTGTTT;GC=25.16;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=36;OL=0;SU=12;WR=3;PE=0;SR=0;SC=6;BND=0;LPREC=1;RT=pe;MeanPROB=0.865;MaxPROB=0.865	GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0/1:92.0:60.0:12:3:0:0:6:0:36.96:3:6:3:0:0:8:0.392:0.389:0.993:0.865	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
2	34200481	290902	A	.	PASS	SVMETHOD=DYSGUv1.3.14;SVTYPE=DEL;END=34200511;CHR2=2;GRP=154353;NGRP=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=30;CONTIGA=ATACTATTGACAATAGTACATATATAATATACAGTATATACACTATTGACAATAGTGTATATAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTATACACA;CONTIGB=tatacactattgacaatagtgtataTAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTATACACACGTCTTTTTTATT;GC=24.77;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=36;OL=0;SU=19;WR=8;PE=0;SR=0;SC=3;BND=0;LPREC=1;RT=pe;MeanPROB=0.827;MaxPROB=0.827	GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0/1:171.0:60.0:19:8:0:0:3:0:37.45:3:5:6:0:0:13:0.712:0.684:0.961:0.827	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
2	34200481	341988	A	.	PASS	SVMETHOD=DYSGUv1.3.14;SVTYPE=DEL;END=34200511;CHR2=2;GRP=132226;NGRP=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=30;CONTIGA=ATAGTACATATATAATATACAGTATATACACTATTGACAATAGTGTATATAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTATACACAC;CONTIGB=agtatatacactattgacaatagtgtataTAGAGATATAGCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTATACACACGTCTTTTTTATT;GC=25.31;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=36;OL=0;SU=8;WR=3;PE=0;SR=0;SC=2;BND=0;LPREC=1;RT=pe;MeanPROB=0.842;MaxPROB=0.842	GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0/1:71.0:60.0:8:3:0:0:2:0:32.49:3:2:3:0:0:8:0.344:0.333:0.97:0.842	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
2	34200481	391981	A	.	PASS	SVMETHOD=DYSGUv1.3.14;SVTYPE=DEL;END=34200511;CHR2=2;GRP=154176;NGRP=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=30;CONTIGA=TACTATTGACAATAGTACATATATAATATACAGTATATACACTATTGACAATAGTGTATATAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTATACACAC;CONTIGB=atatacactattgacaatagtgtataTAGAGATATATCTCTATATTGATACATATGTAGAGATATATCTCTATATTGATATATATGTACACACACAGGAGATATATACGTATGTATCAAAACATGTAATATACGTATACACACGTCTTTTTTATTG;GC=25.23;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=36;OL=0;SU=30;WR=12;PE=0;SR=0;SC=6;BND=0;LPREC=1;RT=pe;MeanPROB=0.92;MaxPROB=0.92	GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0	0/1:138.0:60.0:30:12:0:0:6:0:39.6:3:8:10:0:0:19:0.575:0.579:1.007:0.92

Is this what I should expect from the merge command? or I am doing something wrong?

Thanks

Issue writing REP metric to VCF

Hi Kez,

During the steps to write the output of the call function to the VCF, I encountered the following error:

dysgu call --ibam PD26400a_T.final.bam --sites PD26400a_T_vs_PD26400b_N.consensus.somatic.sv.vcf --all-sites True --ho
m-ref-sites True --clip-length 15 --metrics Homo_sapiens_assembly38.fasta working_dir working_dir/PD26400a_T.final.dysgu_reads.bam 
> svs.vcf
2021-12-21 13:12:33,777 [INFO   ]  [dysgu-call] Version: 1.3.3
2021-12-21 13:12:33,777 [INFO   ]  Input file is: working_dir/PD26400a_T.final.dysgu_reads.bam
2021-12-21 13:12:33,777 [INFO   ]  call --ibam PD26400a_T.final.bam --sites PD26400a_T_vs_PD26400b_N.consensus.somatic.sv.vcf --all-sites True --hom-ref-sites True --clip-length 15 --metrics Homo_sapiens_assembly38.fasta working_dir working_dir/PD26400a_T.final.dysgu_reads.bam
[W::hts_idx_load3] The index file is older than the data file: PD26400a_T.final.bai
2021-12-21 13:12:34,079 [INFO   ]  Sample name: PD26400a_T
2021-12-21 13:12:34,079 [INFO   ]  Writing vcf to stdout
2021-12-21 13:12:34,079 [INFO   ]  Running pipeline
2021-12-21 13:12:34,782 [INFO   ]  Calculating insert size. Removed 27 outliers with insert size >= 938.0
2021-12-21 13:12:34,798 [INFO   ]  Inferred read length 151.0, insert median 458, insert stdev 89
2021-12-21 13:12:34,799 [INFO   ]  Max clustering dist 903
2021-12-21 13:12:34,799 [INFO   ]  Minimum support 3
2021-12-21 13:12:34,799 [INFO   ]  Reading --sites
[W::vcf_parse_info] INFO/END=34143939 is smaller than POS at chr1:84232901
2021-12-21 13:12:34,826 [INFO   ]  Building graph with clustering 903 bp
2021-12-21 13:14:03,262 [INFO   ]  Total input reads 4894147
2021-12-21 13:14:04,966 [INFO   ]  Added 57 variants from input sites
2021-12-21 13:14:06,505 [INFO   ]  Graph constructed
2021-12-21 13:19:44,026 [INFO   ]  Number of components 2578332. N candidates 416974
2021-12-21 13:19:44,330 [INFO   ]  Number of matching SVs from --sites 62
2021-12-21 13:20:02,988 [INFO   ]  Re-alignment of soft-clips done. N candidates 300203
2021-12-21 13:20:07,483 [INFO   ]  Number of candidate SVs merged: 25814
2021-12-21 13:20:07,483 [INFO   ]  Number of candidate SVs after merge: 274389
2021-12-21 13:20:07,573 [INFO   ]  Number of candidate SVs dropped with sv-len < min-size or support < min support: 243030
2021-12-21 13:20:07,893 [INFO   ]  Number or SVs near gaps dropped 174
2021-12-21 13:20:09,897 [INFO   ]  Loaded n=24 chromosome coverage arrays from working_dir
/conda/lib/python3.7/site-packages/sklearn/base.py:333: UserWarning: Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 1.0.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  UserWarning,
2021-12-21 13:20:29,021 [INFO   ]  Model: pe, diploid: True, contig features: True. N features: 43
Traceback (most recent call last):
  File "/conda/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/conda/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/conda/lib/python3.7/site-packages/dysgu/main.py", line 445, in call_events
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1190, in dysgu.cluster.cluster_reads
  File "dysgu/io_funcs.pyx", line 656, in dysgu.io_funcs.to_vcf
  File "dysgu/io_funcs.pyx", line 279, in dysgu.io_funcs.make_main_record
ValueError: could not convert string to float: '.'

There was no issues with the fetch command and the beginning of the svs.vcf appears to be output correctly (i.e. full header was present, including all FILTER/INFO/FORMAT fields).

The issue stems from the REP metric calculation, specifically at the following code block:

if not small_output:
        info_extras += [
                    f"REP={'%.3f' % float(rep)}",
                    f"REPSC={'%.3f' % float(repsc)}",

                    ]

I'm running the most recent push to pypi (v1.3.3).

VCF format for genome browser

Is the VCF output compatible with genome browser? I am getting error that the reference allele is missing when I try to open the VCF output with IGV. This would be highly needed for visualize the results.

Multiprocessing error

I'm trying to run dysgu with the multiprocessing flag (-p), some samples runs smoothly but once in a while i get errors like this one:

2022-07-20 18:39:33,269 [INFO   ]  [dysgu-run] Version: 1.3.11
2022-07-20 18:39:35,108 [INFO   ]  run -v1 -p4 -c --regions /home/mimmie/glob/dysgu/chr1_1.bed --regions-only True -I 393,95,151 parsed.pabies-2.0.fa P9904-117_temp dedup.ir.P9904-117.bam
2022-07-20 18:39:35,108 [INFO   ]  Destination: P9904-117_temp
2022-07-20 18:39:35,108 [INFO   ]  Searching regions from /home/mimmie/glob/dysgu/chr1_1.bed
2022-07-20 18:41:13,640 [INFO   ]  dysgu fetch dedup.ir.P9904-117.bam written to P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam, n=417399, time=0:01:38 h:m:s
2022-07-20 18:41:13,789 [INFO   ]  Input file is: P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam
2022-07-20 18:41:13,791 [INFO   ]  Input file has no index, but --include was provided, attempting to index
2022-07-20 18:41:14,439 [INFO   ]  Sample name: P9904-117
2022-07-20 18:41:14,439 [INFO   ]  Writing vcf to stdout
2022-07-20 18:41:14,439 [INFO   ]  Running pipeline
2022-07-20 18:41:14,441 [INFO   ]  Read length 151, insert_median 393, insert stdev 95
2022-07-20 18:41:14,441 [INFO   ]  Max clustering dist 868
2022-07-20 18:41:14,442 [INFO   ]  Minimum support 3
2022-07-20 18:41:14,442 [INFO   ]  Building graph with clustering 868 bp
2022-07-20 18:41:29,817 [INFO   ]  Total input reads 533293
2022-07-20 18:48:00,408 [INFO   ]  Graph constructed
Process Process-1:
Traceback (most recent call last):
  File "/home/mimmie/.pyenv/versions/3.9.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/mimmie/.pyenv/versions/3.9.9/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "dysgu/cluster.pyx", line 813, in dysgu.cluster.process_job
  File "dysgu/cluster.pyx", line 761, in dysgu.cluster.component_job
  File "dysgu/call_component.pyx", line 1972, in dysgu.call_component.call_from_block_model
  File "dysgu/call_component.pyx", line 1986, in dysgu.call_component.call_from_block_model
  File "dysgu/call_component.pyx", line 1933, in dysgu.call_component.multi
  File "dysgu/call_component.pyx", line 1822, in dysgu.call_component.get_reads
  File "pysam/libcalignmentfile.pyx", line 1876, in pysam.libcalignmentfile.AlignmentFile.__next__
OSError: error -4 while reading file

If i run the sample without multiprocessing, it runs through without any problems. Since I have over 1000 samples with massive genomes it's not an option to run without multiprocessing so I'm grateful for any ideas and suggestions to what could cause this error and how to fix it.

Type error: '<' not supported between instances of 'bool' and 'str'

Hi,
First of all, thank you for the great tool, dysgu!
I am running it with my bam file from 10x genomics linked-read sequencing, and I found that error keep popping up. Could you give me some advice please?

2022-12-04 15:31:26,573 [INFO ] Sample name: SHR_OlaIpcv
2022-12-04 15:31:26,575 [INFO ] Writing vcf to stdout
2022-12-04 15:31:26,575 [INFO ] Running pipeline
2022-12-04 15:31:27,337 [INFO ] Calculating insert size. Removed 735 outliers with insert size >= 1033.0
2022-12-04 15:31:27,345 [INFO ] Inferred read length 148.0, insert median 302, insert stdev 128
2022-12-04 15:31:27,362 [INFO ] Max clustering dist 942
2022-12-04 15:31:27,362 [INFO ] Minimum support 3
2022-12-04 15:31:27,372 [INFO ] Building graph with clustering 942 bp
Traceback (most recent call last):
File "dysgu/graph.pyx", line 754, in dysgu.graph.alignments_from_sa_tag
TypeError: '<' not supported between instances of 'bool' and 'str'
Exception ignored in: 'dysgu.graph.process_alignment'
Traceback (most recent call last):
File "dysgu/graph.pyx", line 754, in dysgu.graph.alignments_from_sa_tag
TypeError: '<' not supported between instances of 'bool' and 'str'
Traceback (most recent call last):
File "dysgu/graph.pyx", line 754, in dysgu.graph.alignments_from_sa_tag
TypeError: '<' not supported between instances of 'bool' and 'str'
Exception ignored in: 'dysgu.graph.process_alignment'
Traceback (most recent call last):
File "dysgu/graph.pyx", line 754, in dysgu.graph.alignments_from_sa_tag
TypeError: '<' not supported between instances of 'bool' and 'str'
Traceback (most recent call last):
File "dysgu/graph.pyx", line 754, in dysgu.graph.alignments_from_sa_tag
TypeError: '<' not supported between instances of 'bool' and 'str'
Exception ignored in: 'dysgu.graph.process_alignment'
Traceback (most recent call last):
File "dysgu/graph.pyx", line 754, in dysgu.graph.alignments_from_sa_tag
TypeError: '<' not supported between instances of 'bool' and 'str'

ENH: support cram

currently, if a cram file is sent, then dysgu will try to use a fasta over the network according to the header (which fails in my case).
it would be faster to use:

result = hts_set_fai_filename(f_in, fasta);
...

after sam_open. This will work for bam as well (in that it will have no effect for bam files).

Invalid character '.' in 'GQ' FORMAT field

Hi,
I got a problem with dysgu merging multiple samples and genotyping. I have twenty samples. First, SVs were called on each of them and merged. However, genotyping with the merged vcf file has failed with the following error.
ATCC BCAHV C02-169 C96-152 C97-256 C98-172 LA87-305 LA90-390 S03-468 S09-363 S09-391 S09-398 S11-140 S11-366 S11-375 S11-462 S11-467 S98-108 S98-595 S99-1170
20
Dysgu is Processing ATCC
2022-07-24 09:03:54,699 [INFO ] [dysgu-call] Version: 1.3.11
2022-07-24 09:03:54,700 [INFO ] Input file is: /ssd2/av724/var_jul19/final2/minimap2_snf/ATCC.bam
2022-07-24 09:03:54,700 [INFO ] call --sites /ssd2/av724/var_jul19/final2/dysgu_np30x/ver2/dy_poplnmerg.vcf /ssd2/av724/varanalysis/reference/atcc_ref_ccv.fna /ssd2/av724/var_jul19/final2/dysgu_np30x/ver2/ATCC-temp /ssd2/av724/var_jul19/final2/minimap2_snf/ATCC.bam
2022-07-24 09:03:54,700 [WARNING] Warning: no @rg, using input file name as sample name for output: ATCC
2022-07-24 09:03:54,700 [INFO ] Sample name: ATCC
2022-07-24 09:03:54,700 [INFO ] Writing vcf to stdout
2022-07-24 09:03:54,700 [INFO ] Running pipeline
2022-07-24 09:03:54,764 [INFO ] Inferred read length 3398.0
2022-07-24 09:03:54,764 [INFO ] Max clustering dist 1050
2022-07-24 09:03:54,764 [INFO ] Minimum support 3
2022-07-24 09:03:54,764 [INFO ] Reading --sites
[E::vcf_parse_format] Invalid character '.' in 'GQ' FORMAT field at NC_001493.2:78
Traceback (most recent call last):
File "/home/av724/miniconda3/envs/dysgu2/bin/dysgu", line 8, in
sys.exit(cli())
File "/home/av724/miniconda3/envs/dysgu2/lib/python3.7/site-packages/click/core.py", line 1134, in call
return self.main(*args, **kwargs)
File "/home/av724/miniconda3/envs/dysgu2/lib/python3.7/site-packages/click/core.py", line 1059, in main
rv = self.invoke(ctx)
File "/home/av724/miniconda3/envs/dysgu2/lib/python3.7/site-packages/click/core.py", line 1665, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/av724/miniconda3/envs/dysgu2/lib/python3.7/site-packages/click/core.py", line 1401, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/av724/miniconda3/envs/dysgu2/lib/python3.7/site-packages/click/core.py", line 767, in invoke
return __callback(*args, **kwargs)
File "/home/av724/miniconda3/envs/dysgu2/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/av724/miniconda3/envs/dysgu2/lib/python3.7/site-packages/dysgu/main.py", line 444, in call_events
cluster.cluster_reads(ctx.obj)
File "dysgu/cluster.pyx", line 1337, in dysgu.cluster.cluster_reads
File "dysgu/cluster.pyx", line 917, in dysgu.cluster.pipe1
File "/home/av724/miniconda3/envs/dysgu2/lib/python3.7/site-packages/dysgu/sites_utils.py", line 61, in vcf_reader
for idx, r in enumerate(vcf):
File "pysam/libcbcf.pyx", line 4175, in pysam.libcbcf.VariantFile.next

I checked the VCF file. Here in the field at NC_001493.2:78, GQ is equal to MAPQ. In the header file, GQ is defined as an integer and MAPQ is a Float. This is causing the problem. I changed the GQ datatype from integer to float. It solved the problem. I hope this will not affect the results. Please let me know your suggestion.

Reporting only SVs from --sites for genotyping

Hi,

I'm interested in seeing how dysgu performs in genotyping a set of SVs discovered from a consensus of SV callers, but when running with --sites and --all-sites True the output includes new SVs discovered by dysgu. Is there a way to report only the sites from the provided vcf?

Thanks,
Will

Error while trying to clone the repository

Hello, while I was trying to clone your repository (using git clone --recursive https://github.com/kcleal/dysgu.git), I got the following error:

fatal: remote error: upload-pack: not our ref b341a74e355bcf4ff295307ed22d1f4905facb11
fatal: Fetched in submodule path 'dysgu/htslib', but it did not contain b341a74e355bcf4ff295307ed22d1f4905facb11. Direct fetching of that commit failed.
fatal:

TRA variants

Hello,

Is there any option to output the Translocations according to the VCF specifications? Meaning 2 entries per translocation and the alternate allele in this fashion ]chr18:53456042]N ?

Best regards,
Jonatan

PacBio CLR reads failing with Nanopore setting

Based on another issue posted here, I am running Dysgu with PacBio CLR reads using the Nanopore model. The specific error that it is outputting is:

Exception ignored in: 'dysgu.assembler.topo_sort2'
ValueError: Graph contains a cycle. Please report this. n=10, w=10, v=0. Node info n was: 10, 0, 0, 32
ValueError: Graph contains a cycle. Please report this. n=22, w=22, v=0. Node info n was: 6, 1, 0, 32

This occurs thousands and thousands of times using both the run and call subcommands.

Any suggestions?

how to get actual inserted/deleted sequence

Hi, given a variant like this one (added some newlines for readability):

chr1    54712   5       .       <INS>   .       PASS
SVMETHOD=DYSGUv1.2.3;SVTYPE=INS;END=54712;CHR2=chr1;GRP=426691;NGRP=1;CT=5to3;CIPOS95=0;CIEND95=0;SVLEN=36;
CONTIGA=ttttttttctttctttctttctttctttctttctttctttcTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTCCTCCTTTTCTTTCCTTTTCTTTCTTTCATTCTTTCTTTCTTTTTTAAGTGGCAGGGTCTCACT;
KIND=extra-regional;GC=28.14;NEXP=28;STRIDE=4;
EXPSEQ=ctttctttctttctttctttctttcTTT;
RPOLY=84;OL=0;SU=4;WR=0;PE=0;SR=0;SC=4;BND=4;LPREC=1;RT=pe 
GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB
1/1:10:34.5:4:0:0:0:4:4:16.5:1:1:3:76:1:7:0.829:1.129:0.935:0.65

what is the actual inserted sequence?
SVLEN says 36. EXPSEQ has len=28, the lowercase sequence in CONTIGA has len=41

I understand that in some cases we can't get the full inserted sequence, but for those cases, can we get the left end of the inserted sequence from CONTIGA and the right end from CONTIGB? how?

Would also be nice to be able to get deleted sequence for DEL:

chr1    10250   2       .       <DEL>   .       PASS    SVMETHOD=DYSGUv1.2.3;SVTYPE=DEL;END=10282;CHR2=chr1;GRP=212728;NGRP=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=32;CONTIGA=CTAACCCTAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCTCACCCCCACCCCCACCCCCACCCCCACCCCCACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTAacCCCTAACCCTAACCCTAACCCTAACCCTAACC;KIND=extra-regional;GC=56.52;NEXP=0;STRIDE=0;EXPSEQ=;RPOLY=120;OL=0;SU=6;WR=2;PE=1;SR=0;SC=2;BND=1;LPREC=1;RT=pe

In this case, the sequence after the lower-case letters in CONTIGA is length 32. Is that the deleted sequence? Or would I look this up in a fasta and contig A is the haplotype with the deletion?

About Pacio ccs.bam

Dear dysgu team,

Good afternoon.

I used ccs.bam file as input in dysqu program as below,

dysgu call --mode pacbio NK.fasta pb.large.ccs.bam > PbNK.vcf

However, I got the Error like this:

"Traceback (most recent call last):
File "/home/hsiang/miniconda3/envs/Python3_9/bin/dysgu", line 5, in
from dysgu.main import cli
File "/home/hsiang/miniconda3/envs/Python3_9/lib/python3.9/site-packages/dysgu/init.py", line 2, in
from dysgu.python_api import DysguSV,
File "/home/hsiang/miniconda3/envs/Python3_9/lib/python3.9/site-packages/dysgu/python_api.py", line 9, in
from dysgu.cluster import pipe1, merge_events
File "dysgu/cluster.pyx", line 1, in init dysgu.cluster
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 232 from C header, got 216 from PyObject"

Should I run pbmm2 align to get the new alignment bam?

Thank you very much

Sincerely yours,

Clarence

performance optimization

This is more of a request to optimize the code to run faster in production environment. multi threading, or GPU acceleration.

small duplication false negative

Why does dysgu-SV not detect this 77bp duplication? the reads look pretty good. It's a custom library target NGS with illumina with high coverage. I tried the option --keep-small and still didn't call it. Is this by design or false negative or what?

Dysgu merge function

Hi
Thank you for your great tools.
I had a question about the merge function.
I want to merge my different vcf files containing duplicates and I would like to know the properties of this function
does this function merge the same duplicates present on my different samples or does it recover all duplicates at the same time
Thanks in advance

Location on CHR2 for translocations

Hello,

Thanks for your tool. I was wondering if there is any way for translocations to get the location on both chromosomes. At the moment I don't see this information in the VCF file (only CHR2). Thank you.

P.S. I think the VCF format would encode these with SVTYPE=BND for each breakend. For my case I don't need the breakend information, but it would be great to get the location.

VCF filtering

The majority of the SV calls are low quality so it's important to filter the results to highest quality calls. Is there an easy way to filter cals based on the INFO column annotation and the Genotype info key?

for example:

SU > 10
PE > 5
SC > 10
PROB > 0.95
Filter = PASS

I know this can be done with awk. It would take me many hours to find the right command.

AttributeError: module 'pysam.libcalignmentfile' has no attribute 'IteratorColumnAll'

Hi,

I wanted to test this out, I was packaging it up for nixpkgs, but I get this error when running the tests:

AttributeError: module 'pysam.libcalignmentfile' has no attribute 'IteratorColumnAll'

What version of pysam do you use? This is the error log:

Traceback (most recent call last):
  File "nix_run_setup", line 8, in <module>
    exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\\r\\n', '\\n'), __file__, 'exec'))
  File "setup.py", line 136, in <module>
    setup(
  File "/nix/store/sbiym6y0nmyabnh6mz4xzy26l0fhyqy7-python3.8-setuptools-47.3.1/lib/python3.8/site-packages/setuptools/__init__.py", line 161, in setup
    return distutils.core.setup(**attrs)
  File "/nix/store/qy5z9gcld7dljm4i5hj3z8a9l6p37y81-python3-3.8.8/lib/python3.8/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/nix/store/qy5z9gcld7dljm4i5hj3z8a9l6p37y81-python3-3.8.8/lib/python3.8/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/nix/store/qy5z9gcld7dljm4i5hj3z8a9l6p37y81-python3-3.8.8/lib/python3.8/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/nix/store/sbiym6y0nmyabnh6mz4xzy26l0fhyqy7-python3.8-setuptools-47.3.1/lib/python3.8/site-packages/setuptools/command/test.py", line 238, in run
    self.run_tests()
  File "/nix/store/sbiym6y0nmyabnh6mz4xzy26l0fhyqy7-python3.8-setuptools-47.3.1/lib/python3.8/site-packages/setuptools/command/test.py", line 256, in run_tests
    test = unittest.main(
  File "/nix/store/qy5z9gcld7dljm4i5hj3z8a9l6p37y81-python3-3.8.8/lib/python3.8/unittest/main.py", line 100, in __init__
    self.parseArgs(argv)
  File "/nix/store/qy5z9gcld7dljm4i5hj3z8a9l6p37y81-python3-3.8.8/lib/python3.8/unittest/main.py", line 124, in parseArgs
    self._do_discovery(argv[2:])
  File "/nix/store/qy5z9gcld7dljm4i5hj3z8a9l6p37y81-python3-3.8.8/lib/python3.8/unittest/main.py", line 244, in _do_discovery
    self.createTests(from_discovery=True, Loader=Loader)
  File "/nix/store/qy5z9gcld7dljm4i5hj3z8a9l6p37y81-python3-3.8.8/lib/python3.8/unittest/main.py", line 154, in createTests
    self.test = loader.discover(self.start, self.pattern, self.top)
  File "/nix/store/qy5z9gcld7dljm4i5hj3z8a9l6p37y81-python3-3.8.8/lib/python3.8/unittest/loader.py", line 349, in discover
    tests = list(self._find_tests(start_dir, pattern))
  File "/nix/store/qy5z9gcld7dljm4i5hj3z8a9l6p37y81-python3-3.8.8/lib/python3.8/unittest/loader.py", line 405, in _find_tests
    tests, should_recurse = self._find_test_path(
  File "/nix/store/qy5z9gcld7dljm4i5hj3z8a9l6p37y81-python3-3.8.8/lib/python3.8/unittest/loader.py", line 483, in _find_test_path
    tests = self.loadTestsFromModule(package, pattern=pattern)
  File "/nix/store/sbiym6y0nmyabnh6mz4xzy26l0fhyqy7-python3.8-setuptools-47.3.1/lib/python3.8/site-packages/setuptools/command/test.py", line 55, in loadTestsFromModule
    tests.append(self.loadTestsFromName(submodule))
  File "/nix/store/qy5z9gcld7dljm4i5hj3z8a9l6p37y81-python3-3.8.8/lib/python3.8/unittest/loader.py", line 154, in loadTestsFromName
    module = __import__(module_name)
  File "/build/source/dysgu/view.py", line 11, in <module>
    from dysgu import io_funcs, cluster
  File "dysgu/cluster.pyx", line 16, in init dysgu.cluster
    from dysgu import coverage, graph, call_component, assembler, io_funcs, re_map, post_call_metrics
  File "dysgu/graph.pyx", line 1, in init dysgu.graph
    #cython: language_level=3, boundscheck=False, c_string_type=unicode, c_string_encoding=utf8, infer_types=True
AttributeError: module 'pysam.libcalignmentfile' has no attribute 'IteratorColumnAll'

I see this is also an issue mentioned here:

Hi,

Merging SV from Oxford Nanopore - expected runtime

Hi,
I am trying to merge SVs discovered from Oxford Nanopore data (plant species, 60 samples, 20,000-50,000 SVs/sample). It's been running for 10 hours now. Is that an expected runtime? Would it make sense to try to merge step wise? For example, find most closely related samples, merge those first (say groups 5-10) and then merge the merged files to get final non-redundant SVs?

minimap2 --MD -t 16 -ax map-ont ../short/Express617_v1.fa /vol/agcpgl/jlee/BreedPath_nanopore/${ID}.fq.gz | samtools sort -o ${ID}.bam
dysgu call -p 8 -v 2 --min-support 5 --mode nanopore ../short/Express617_v1.fa temp_dir.$ID $ID.bam > $ID.vcf
python flt_vcf.py $p.vcf > $p.pass.vcf
dysgu merge *pass.vcf > long.vcf

Coverage of reads

Hi,
This is more of a query than an issue. I have twenty samples of a DNA virus. I have both Illumina and nanopore reads. I would like to try Dysgu PE mode, LR mode, and hybrid mode. My samples are full genome data and have coverage of more than 1000x. I would like to know how to set the coverage for the input file. I can set the max coverage option to auto. Is this enough? Alternatively, I can use RASUSA (https://github.com/mbhall88/rasusa) to get 20x data then map with NGMLR or minimap2 or I can use samtools mpileup to reduce the coverage. What is the recommended coverage for PE mode, LR mode, and hybrid mode? My virus is a herpes virus. What do you suggest?

Dysgu Cluster

Hi,
Thanks your for this tools.
I have a question. can dysgu run on a cluster. I would install this on the cluster.
Thanks your

redundant results and insertion variants recording

Dear @kcleal

I have several questions about the output of dysgu, hope you can take some time to help me. Thank you in advance.

First, I found the result of dysgu is redundant, for example, the following deletion can be merged, in my opinion.

1       1136857 78530   C       <DEL>   .       PASS    SVMETHOD=DYSGUv1.3.5;SVTYPE=DEL;END=1136943;CHR2=1;GRP=2350686;NGRP=1;CT=5to3;CIPOS95=0;CIEND95=0;SVLEN=86;CONTIGB=gcggggtttattctaagaatgattatttccCATAATTCCTGGTCCTGTGTGAGTGCCAGCCACCGTTTCCTCGTGTCCCTCTGGATGGGTCATTCCCTGGCCTCTGGCCTGTGTGCTGACCAGTCctgagcggccct;KIND=intra_regional;GC=55.47;NEXP=0;STRIDE=0;EXPSEQ;RPOLY=0;OL=0;SU=3;WR=0;PE=0;SR=0;SC=3;BND=3;LPREC=0;RT=pe      GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB   0/1:66:60:3:0:0:0:3:3:16.89:0:1:2:54:0:0:0.938:0.789:0.842:0.559
1       1136860 78532   C       <DEL>   .       PASS    SVMETHOD=DYSGUv1.3.5;SVTYPE=DEL;END=1136945;CHR2=1;GRP=1565924;NGRP=1;CT=3to5;CIPOS95=0;CIEND95=0;SVLEN=85;CONTIGA=GTTAATTTGCTTGCAAGAAGTTTGAGCCTTTCTGGTCTCGCTTTTACGATGCATTGAAAGTGAGCCTGGAGCGGGGTTTATTCTAAGAATGATTATTTCCCAtaattcctggtcctgtgtgagtgccagccaccgtttcctcgtgtccctctgg;KIND=intra_regional;GC=46.75;NEXP=0;STRIDE=0;EXPSEQ;RPOLY=0;OL=0;SU=6;WR=0;PE=0;SR=0;SC=6;BND=6;LPREC=0;RT=pe     GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB   0/1:28:60:6:0:0:0:6:6:16.89:0:0:6:76:28:0:0.938:0.789:0.842:0.572

Second, I found that REF allele was not identical with the reference sequence allele at that POS

Finally, for the insertion, in my mind, the END tag should be the same with POS, however, they are different in the output, for example,

1       8058312 59      G       TAGCTAGCTAGCTAGCTAGATCTATAAATAGATAGATAG .       PASS    SVMETHOD=DYSGUv1.3.5;SVTYPE=INS;END=8058322;CHR2=1;GRP=2400;NGRP=1;CT=5to3;CIPOS95=0;CIEND95=0;SVLEN=40;KIND=extra-regional;GC=32.03;NEXP=0;STRIDE=0;EXPSEQ;RPOLY=42;OL=0;SU=13;WR=0;PE=1;SR=3;SC=12;BND=9;LPREC=0;RT=pe        GT:GQ:MAPQP:SU:WR:PE:SR:SC:BND:COV:NEIGH10:PS:MS:RMS:RED:BCC:FCC:ICN:OCN:PROB   0/1:147:60:13:0:1:3:12:9:16.65:0:11:6:0:0:13:0.655:1.526:1:0.627

Sincerely,
Zheng zhuqing

running docker container fails

Dear,

Thanks for this very promising tool.
I fetched the container but when I run the following command I get an error, can you please help me?
Other docker images do run on my machine (ubuntu 20 server)

Thanks

$ sudo docker images
REPOSITORY                    TAG          IMAGE ID       CREATED         SIZE
kcleal/dysgu                  latest       5e3234d8d9f0   3 months ago    1.98GB

$ sudo docker run kcleal/dysgu test        # no output and no error

sudo docker run kcleal/dysgu --help
docker: Error response from daemon: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: exec: "--help": executable file not found in $PATH: unknown.
ERRO[0000] error waiting for container: context canceled 

sudo docker run kcleal/dysgu run --help
docker: Error response from daemon: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: exec: "run": executable file not found in $PATH: unknown.
ERRO[0000] error waiting for container: context canceled

input vcf cannot be read

Hi,
And thanks for such a useful tool.
I'm exploring polymorphisms including previous data (in addition to new ones) and I'm trying to feed a VCF with --sites to dysgu run (conda env with python 3.9). I get the following error:
File "/opt/miniconda3/envs/dysgu/lib/python3.9/site-packages/dysgu/sites_utils.py", line 70, in vcf_reader
svt = r.info["SVTYPE"]
File "pysam/libcbcf.pyx", line 2581, in pysam.libcbcf.VariantRecordInfo.getitem
KeyError: 'Unknown INFO field: SVTYPE'
The VCF is v4.2 generated with bcftools 1.13
What can I do to fix the problem?
Thanks!

PacBio long reads?

Hello, thank you for creating this great software. I look forward to using this software in my analyses.

However, I have a question, does Dysgu supports PacBio long reads? I know you mention it supports PacBio HiFi reads but I was wondering about the PacBio long reads. Thank you, in advance. And I apologize if I missed this information.

Installation Note

I was interested in trying Dysgu, and created a fresh conda environment with python 3.10. When I tried to install dysgu within that fresh env, I got the following error:

(dysgu)$ pip install dysgu
ERROR: Could not find a version that satisfies the requirement dysgu (from versions: none)
ERROR: No matching distribution found for dysgu

This seemed odd to me, and may have something to do with it the new version released 1 day before.

However, when I dropped my python version down to 3.7 it went through (I did have to install numpy first). This is just for your documentation or others' troubleshooting.

Error running 1.3.10

Hello,

I've been trying to run dysgu v1.3.10 using a conda approach (which is what I've done for previous versions). However, I keep encountering a strange error:

Activating conda environment: <redacted>/CSL_pipeline_benchmark/scripts/.snakemake/conda/a45103408c16f97aed2118b0a7687435
2022-04-20 11:11:43,425 [INFO   ]  [dysgu-run] Version: 1.3.10
2022-04-20 11:11:43,429 [INFO   ]  run -o <redacted>/CSL_pipeline_benchmark/pipeline/structural_variant_calls/hg38_T2T_masked/sentieon_mm2-202112.01/dysgu-1.3.10/HALB3010452.vcf --mode pacbio --pl pacbio --procs 16 --search <redacted>/CSL_pipeline_benchmark/data_files/hg38_contigs.bed.gz <redacted>/reference/hg38_T2T_masked/hg38.fa <redacted>/CSL_pipeline_benchmark/pipeline/structural_variant_calls/hg38_T2T_masked/sentieon_mm2-202112.01/dysgu-1.3.10/HALB3010452_tmp <redacted>/CSL_pipeline_benchmark/pipeline/merged_alignments/hg38_T2T_masked/sentieon_mm2-202112.01/HALB3010452.bam
2022-04-20 11:11:43,430 [INFO   ]  Destination: <redacted>/CSL_pipeline_benchmark/pipeline/structural_variant_calls/hg38_T2T_masked/sentieon_mm2-202112.01/dysgu-1.3.10/HALB3010452_tmp
2022-04-20 11:11:43,430 [INFO   ]  Searching regions from <redacted>/CSL_pipeline_benchmark/data_files/hg38_contigs.bed.gz
Traceback (most recent call last):
  File "<redacted>/CSL_pipeline_benchmark/scripts/.snakemake/conda/a45103408c16f97aed2118b0a7687435/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "<redacted>/CSL_pipeline_benchmark/scripts/.snakemake/conda/a45103408c16f97aed2118b0a7687435/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "<redacted>/CSL_pipeline_benchmark/scripts/.snakemake/conda/a45103408c16f97aed2118b0a7687435/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "<redacted>/CSL_pipeline_benchmark/scripts/.snakemake/conda/a45103408c16f97aed2118b0a7687435/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "<redacted>/CSL_pipeline_benchmark/scripts/.snakemake/conda/a45103408c16f97aed2118b0a7687435/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "<redacted>/CSL_pipeline_benchmark/scripts/.snakemake/conda/a45103408c16f97aed2118b0a7687435/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "<redacted>/CSL_pipeline_benchmark/scripts/.snakemake/conda/a45103408c16f97aed2118b0a7687435/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "<redacted>/CSL_pipeline_benchmark/scripts/.snakemake/conda/a45103408c16f97aed2118b0a7687435/lib/python3.8/site-packages/dysgu/main.py", line 270, in run_pipeline
    max_cov_value = sv2bam.process(ctx.obj)
  File "dysgu/sv2bam.pyx", line 184, in dysgu.sv2bam.process
  File "dysgu/sv2bam.pyx", line 58, in dysgu.sv2bam.parse_search_regions
  File "dysgu/sv2bam.pyx", line 59, in dysgu.sv2bam.parse_search_regions
  File "<redacted>/CSL_pipeline_benchmark/scripts/.snakemake/conda/a45103408c16f97aed2118b0a7687435/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I'm not sure what's causing this, the BAM files have worked for all of my other tests. I am also happy to try the Docker if you'd prefer that, but I would need it versioned on docker hub (I only saw "latest" earlier).

Thanks!

kcleal / dysgu Goto Github PK

dysgu's People

Contributors

Stargazers

Watchers

Forkers

dysgu's Issues

Recommend Projects

Recommend Topics

Recommend Org