Git Product home page Git Product logo

Comments (35)

kcleal avatar kcleal commented on June 12, 2024 1

Sorry, could you be more specific about which part you mean?

from dysgu.

kcleal avatar kcleal commented on June 12, 2024 1

Hi @poddarharsh15,
I recommend duplicate marking, rather than removal - it will be enough to just mark reads as duplicates using e.g. samblaster or picardtools

from dysgu.

kcleal avatar kcleal commented on June 12, 2024

Hi @poddarharsh15,

Those are quite low performance metrics, I agree. If you could provide the dysgu log output, I may be able to offer some more insight. I am guessing that the problem is due to the very high coverage (300X). Some potential issues -

You need to make sure the --max-cov parameter is either set to auto, or choose a suitably high number, otherwise much of the genome will be filtered out. Likewise the --min-support parameter is usually inferred, but this might also be causing issues if it is too high.

The machine learning model that is used was trained on data in the 15 - 40 X coverage range, so its probable the decision boundary is wrong for high coverage samples. You can adjust the decision boundary using for example dysgu filter --pass-prob 0.8.

Variants can be selected using grep on the dysgu output vcf. The rational was just to test the performance on different variant types

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

dysgu v1.6.2

/home/tigem/h.poddar/miniconda3/envs/dysgu/bin/dysgu run -p4 /home/tigem/h.poddar/short_reads_pipelines/ref_genome/GRCh37/GRCh37.p13.genome.fa temp_dir /home/tigem/h.poddar/short_reads_pipelines/NA12878_HG001_data/sample.bam
This was the code (the basic commands) I ran which is the ran on the machine, unfortunately I am not able to fetch the log files.
Please let me know if this helps.

from dysgu.

kcleal avatar kcleal commented on June 12, 2024

Thanks, the --max-cov parameter has not been set, so regions with >200 coverage will be filtered out! This is probably what is causing the poor performance. I recommend setting this to auto.

Also for benchmarking with truvari, I recommend using the --dup-to-ins option, as many 'insertion' events are reported as DUP by dysgu

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

Thank you for your previous response. As I was reviewing your paper, I observed that even with other pipelines, you achieved precision close to 90%. Did you employ specific, strict parameters to achieve these benchmarking results?
I tried with --max-cov auto but I am still getting low P and R could you please see the results?
dysgu run -p4 --max-cov auto /home/tigem/h.poddar/short_reads_pipelines/ref_genome/GRCh37/GRCh37.p13.genome.fa temp_dir_2 /home/tigem/h.poddar/short_reads_pipelines/NA12878_HG001_data/sample.bam > svs_2.vcf
Screenshot from 2024-01-05 13-30-51

Truvari results....... with HG002_bed region 6.0
log.txt
params.json

Truvari results....... with HG002_bed region 6.1
log.txt
params.json

from dysgu.

kcleal avatar kcleal commented on June 12, 2024

Thanks for sharing the output log. The max-cov parameter was inferred to be 138x which seems quite low - I am not sure what the coverage of your sample is, but max-cov should be 5 - 10 higher than the median genome coverage. You can turn off this parameter by setting max-cov to -1, but this will slow the runtime. The rest of the output log looks normal to me.

I re-tested a HG002 sample with 20x coverage and analysed with dysgu 1.4.2 and truvari - the log file is attached. The sample was analysed with default parameters, precision was 0.928 and recall was 0.440.

Im a bit perplexed at the low P + R, so would like to help you find out the issue.

Im wondering if the aligner used may be causing an issue, or if supplementary alignments are missing. Dysgu has mostly been tested with bwa mem, but also you need to make sure supplementary alignments are also kept in the bam file. If supplementary alignments are missing, this can hurt performance. You can check to see what reads were analysed by looking at the HG002.dysgu_reads.bam file in the working directory, this is often useful for debugging.

log.txt

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

Thank you for your previous clarification. Currently, I'm attempting to run the command by setting max-cov to -1. Meanwhile, I've tried to benchmark the results obtained from the Manta pipeline using the Tier1_plus_Tier2 bed files of the HG002 truth set. I managed to achieve (P=0.95, R=0.20), (F1=33), but as observed, the recall is notably low.
For your reference, I've attached the log files and a screenshot of the HG002.dysgu_reads.bam file stat. Your insights into debugging this issue would be immensely valuable.

I'm considering that the issues I'm encountering in benchmarking might be related to the samples I'm using. Specifically, I'm working with HG001_one flow cell data. I downloaded all the fasta files and aligned them to the GRCh37 genome using the bwa mem aligner. To provide better context, I've included the link to the samples I'm referring to (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/)

Screenshot from 2024-01-05 15-07-57
log.txt
params.json

Please let me know, Appreciate your assistance.

from dysgu.

kcleal avatar kcleal commented on June 12, 2024

Thanks for the link, could you detail which fastq files you downloaded - I can try and repeat the analysis my end. Alternatively, if you could share a few megabases from the original mapped HG002 file, that might be enough to get an insight into potential issues.

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

First I downloaded all of them (all the six samples) and after that I merged them using 'cat'. For final results I had one forward and reverse reads(PE),
Please find the reference image below:
Screenshot from 2024-01-05 15-51-42

from dysgu.

kcleal avatar kcleal commented on June 12, 2024

Hi @poddarharsh15,
Sorry if Im missing the point, but shouldn't you be using the HG002 sample for benchmarking AshkenazimTrio/? It looks like you are using NA12878/HG001 which is a different sample.

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

Yes I am using HG001_data for doing a benchmark analysis. Isn't is possible to benchmark different data using HG002_SVs_Tier1_v0.6.vcf truth set files?
Am I am doing some mistakes?

from dysgu.

kcleal avatar kcleal commented on June 12, 2024

You need to use the HG002 sample to do the benchmarking, otherwise the results will be unreliable. NA12878 has its own set of germline variants that are not in the HG002 truth set and vice-versa. NA12878 variants that are not in the benchmark HG002 truth set will show up as false positives giving low precision. Nevertheless NA12878 and HG002 share quite a few of the same SVs, which explains the recall rates you are seeing.
It's interesting how manta shows high precision even though the wrong sample is being analysed, which implies that manta mostly detects variants that are common to both NA12878 and HG002.

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

I have a hypothetical question regarding benchmarking real data generated by my university for the analysis of patients. After working on these datasets for two months, I assumed I could use the HG002_truth files to benchmark any example data to understand the presence of structural variation. However, I'm seeking guidance on whether this assumption is correct or if there are specific considerations for benchmarking real patient data.
I apologize for what might seem like general or "silly" questions.
Thank you for your time and assistance.

from dysgu.

kcleal avatar kcleal commented on June 12, 2024

No problem.
For proper benchmarking your 'truth set' must match your sample, otherwise real variants that are present in your sample but not the 'truth set' will show up as false-positives. Also variants in the 'truth set' that are absent in your sample will make it look like your variant caller is not calling some variants (giving low recall), but in reality its because your sample doesn't have the same variants.
The way in which you could use the HG002 truth set to test other samples, is if you only assess the recall. Most samples will share some variants also found in the HG002 truth set. For example, callerA might identify 30% of calls from the truth set, but callerB might identify 20%. The rates could be used as a proxy for the actual recall rates, although the real recall rates can not be known without a better truth set.

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

Hello @kcleal,

Thank you for your previous clarification. I have another question regarding benchmarking. To achieve better results for HG001 data using the same (HG002) truth set files, considering there are no other truth set files for Structural Variants except GIAB_HG002_v0.6, should I compare only HG002 bam files to produce vcf files and benchmark them? This approach might help in identifying which pipelines perform best for analyzing Structural Variants on real-data.
Thank you.

from dysgu.

kcleal avatar kcleal commented on June 12, 2024

Yes @poddarharsh15 you are correct. Another benchmark to look at could be the syndip benchmark, it has a high level of completeness for DEL and INS variants

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

Thanks. I will review the syndip benchmark data. I also had a question about using grep to separate SVs from VCF files. Did you perform separate benchmarking for these SVs (DEl,INS,DUP,TRA) to obtain more accurate results for Precision (P), Recall (R), and F1 score?

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

Hi @kcleal I have attempted once again to benchmark using HG002 truth sets, and I would greatly appreciate it if you could review the results. I have attached the log files, and I used the following parameters for reference. Also I have tried to benchmark DEL and INS separately.
Thank you for your time and assistance.

from_dysgu:-
Screenshot from 2024-01-17 11-33-55
All_svs.txt
DEL_log.txt
INS_log.txt

from dysgu.

kcleal avatar kcleal commented on June 12, 2024

Hi @poddarharsh15,
I think these look like better performance metrics. The max coverage was inferred to be only 90X - is this a low coverage sample? You might get slightly higher recall using the default options.

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

Yes, unfortunately atm I am using less flow cells to check the pipelines that's why I have a low coverage, thanks for insight I will try again.

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

Hello,
I had a specific query regarding your research article. In the article, it was mentioned that INS was converted (annotated) to DUP using specific tools to achieve better Precision (P) and Recall (R). Could you please provide guidance on the tools or methods you employed for this conversion?
Thank you for your assistance.

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

We utilized svbench to assess performance of dysgu compared to other SV callers. For benchmarking calls against the HG002 benchmark (34), we filtered query calls by a minimum size of 30 bp (whole genome benchmark), or 50 bp (Tier 1 benchmark). We utilized a reference interval size of 1000 bp, and a percent size similarity threshold of 15%. Deletion and insertion calls were analysed separately, filtering both query and reference calls by svtype before comparison. Additionally, only query calls on the β€˜normal’ chromosomes were analysed . To match the definition of the GIAB benchmark, we converted DUP calls <500 bp to insertions.
Here, this statement from your article, I was curious to know which tool you used to convert these DUP calls to insertions?

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

I've encountered challenges with dysgu when using minimap2 to align my data to the reference genome. Interestingly, when I use the same aligner with other SV callers (manta, delly, lumpy), I observe better Precision-Recall Curve (PRC) results.

Could you please review these results and provide insights into the observed differences? Your expertise on this matter would be greatly appreciated.

dysgu command line used :-

  1. minimap2
    Screenshot from 2024-02-07 14-33-46

Screenshot from 2024-02-07 14-33-40

SVbench results:-

Screenshot from 2024-02-07 14-37-26

from dysgu.

kcleal avatar kcleal commented on June 12, 2024

Hi @poddarharsh15,
Do you mean you aligned paired-end reads to the genome using minimap2? The performance metrics look very bad indeed. I have not tested paired-end reads with minimap2 before, I'm not sure what could be causing the poor performance.

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

Hi @kcleal Yes, I used paired-end reads. Interestingly, I've used the same combination of minimap2 and dysgu with different flowcells(HG002) in the past, and the performance metrics were notably good. Specifically, this combination yielded the highest F1 score. Thank you for fast response.

from dysgu.

kcleal avatar kcleal commented on June 12, 2024

Do you know if that was using a different version of dysgu? It's possible a bug has been introduced that only affects minimap2 paired-end data. If you could screen shot a couple of examples of the data (the bam file showing an SV call, either the dysgu_reads.bam in the temp directory, or the main bam file) using GW or IGV, that might give me some ideas. I noticed that the number of calls seems excessively high >1million, which indicates a problem.

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

The versions were the same, I am doing samtools flagstat on all the bams I have generated using three aligners and I will send you screenshots asap.

  1. bwamen
    Screenshot from 2024-02-07 15-09-36

  2. dragen
    Screenshot from 2024-02-07 15-09-46

  3. minimpa2

Screenshot from 2024-02-07 15-09-53

from dysgu.

kcleal avatar kcleal commented on June 12, 2024

Thanks, they are quite useful. For bwamem you are using duplicate marking which will probably make a small difference. The main difference I can see is that for the minimap2 only 91.02 % of reads are properly paired, which is lower than expected. The others are >97%. This suggests a large number of reads are being flagged as discordant, which would probably lead to the high call count from dysgu. Dysgu relies on the proper-pair flag value being roughly correct to filter out normal looking read-pairs. Perhaps for minimap2 the insert size is not being inferred correctly. I know in bwa men the -I option can be used to define the insert size.

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

Thanks for the help I will try to change some parameters and see if something changes or else I will stick to the other aligners.

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

Hi good afternoon, I am using svbench for benchmarking my results could you please suggest me what will be the best parameter for getting better results? for now I tried with these command line and I am getting good precision and recalls. However I am concerned about the duplication percentage I am confused should I use --no-duplicate parameter or not! Thanks again.
svbench --include HG002_SVs.bed --pass-ref HG002_SVs.vcf.gz --pass-query diploidSV.vcf.gz --min-size-ref 50 --slop 1000 --pass-only --min-size-query 50 --no-duplicates

Screenshot for an example first one is with --no-duplicates and second without --no-duplicates parameter

Screenshot from 2024-02-09 12-48-03

from dysgu.

kcleal avatar kcleal commented on June 12, 2024

Hi @poddarharsh15,
Duplicates occur when there is a single true-positive in the benchmark set, but in the dysgu out there are multiple calls that match the single true-positive. It's up to you if you want to quantify duplicates or not, but I usually find it useful as it can help spot different sources of error. Also a duplicate will still be in the correct genomic locus, so is more useful than a genuine false-positive. Hope that helps

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

thanks again appreciate your help!

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

Hello again, I have a question about preprocessing BAM files before variant calling. Do you recommend removing duplicates using samtools before running structural variation (SV) callers?

from dysgu.

poddarharsh15 avatar poddarharsh15 commented on June 12, 2024

okay, great thanks.

from dysgu.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.