Git Product home page Git Product logo

clairs's Introduction

ClairS

ClairS - a deep-learning method for long-read somatic small variant calling

License

Contact: Ruibang Luo, Zhenxian Zheng
Email: [email protected], [email protected]


Introduction

ClairS is a somatic variant caller designed for paired samples and primarily ONT long-read. It uses Clair3 to eliminate germline variants. It ensembles the pileup and full-alignment models in Clair3, trusts them equally, and decides on the result using a set of rules and post-processing filters. With 50-fold HCC1395 (tumor) and 25-fold HCC1395BL (normal) of ONT R10.4.1 data, benchmarking against the truth SNVs (Fang et al., 2021), ClairS achieved 86.86%/93.01% recall/precision rate for SNVs when targeting VAF ≥0.05. For variants with VAF ≥0.2, the numbers go up to 94.65%/96.63%. Detailed performance figures are shown below.

ClairS means Clair-Somatic, or the masculine plural of "Clair" in french (thus, 's' is silent).

The logo of ClairS was generated using DALL-E 2 with prompt "A DNA sequence with genetic variant that looks like a letter 'S'".

A preprint describing ClairS's algorithms and results is at bioRxiv.

For germline variant calling, please try Clair3.

For somatic variant calling using tumor only sample, please try ClairS-TO.


Performance figures

ONT Q20+ chemistry SNV performance

  • HCC1395/HCC1395BL tumor/normal of ONT R10.4.1 data
  • Truth:High confidence (HighConf) and medium confidence (MedConf) SNV from the SEQC2 HCC1395/BL truths (Fang et al., 2021), the TVAF (tumor variant allele frequency) of which is ≥0.05 in the above dataset

The precision-recall curve of different combinations of tumor and normal coverages

The performance of ClairS at multiple VAF ranges and multiple tumor coverages with the normal coverage fixed at 25x

The performance of ClairS at multiple VAF ranges and multiple normal coverages with the tumor coverage fixed at 50x

PacBio Revio SNV performance

  • HCC1395/HCC1395BL tumor/normal of PacBio Revio data, using SMRTbell prep kit 3.0
  • Truth:High confidence (HighConf) and medium confidence (MedConf) SNV from the SEQC2 HCC1395/BL truths (Fang et al., 2021), the TVAF (tumor variant allele frequency) of which is ≥0.05 in the above dataset

The performance of ClairS at multiple VAF ranges and multiple tumor coverages with the normal coverage fixed at 25x

The performance of ClairS at multiple VAF ranges and multiple normal coverages with the tumor coverage fixed at 50x

Illumina SNV performance

  • HCC1395/HCC1395BL tumor/normal of of Illumina NovaSeq 6000 and HiSeq 4000 data
  • Truth:High confidence (HighConf) and medium confidence (MedConf) SNV from the SEQC2 HCC1395/BL truths (Fang et al., 2021), the TVAF (tumor variant allele frequency) of which is ≥0.05 in the above dataset

The precision-recall curve of different tumor/normal purity combinations with tumor coverage fixed at 50x and normal coverage fixed at 25x


Contents


Latest Updates

v0.3.0 (Jul 5) : 1. Added a module called “verdict” (Option --enable_verdict) to statistically classify a called variant into either a germline, somatic, or subclonal somatic variant based on the CNV profile and tumor purity estimation. Please find out more technical details about the Verdict module here. 2. Improved model training speed, reduced model training time cost by about three times.

v0.2.0 (Apr 29) : 1. Added --use_heterozygous_snp_in_normal_sample_for_intermediate_phasing/--use_heterozygous_snp_in_tumor_sample_for_intermediate_phasing option to support using either heterozygous SNPs in the normal sample or tumor sample for intermediate phasing. The previous versions used in_tumor_sample for phasing. In this new version, when testing with ONT 4kkz HCC1395/BL and using in_normal_sample for intermediate phasing, the SNV precision improved ~2%, while recall remained unchanged. in_normal_sample becomes the default from this version. However, if the coverage of normal sample is low, please consider switching back to using in_tumor_sample (#22, idea contributed by the longphase team @sloth-eat-pudding). 2. Added --use_heterozygous_indel_for_intermediate_phasing to include high quality heterozygous Indels for intermediate phasing. With this new option, the haplotagged tumor reads increased by ~3% in ONT 4khz HCC1395/BL, the option becomes default from this version. 3. Added a model that might provide a slightly better performance for liquid tumor. In this release, only ONT Dorado 5khz HAC for liquid tumor (-p ont_r10_dorado_hac_5khz_liquid) is provided. The model was trained with slightly higher normal contamination. We are testing out the new model with collaborator. 4. Added --use_longphase_for_intermediate_haplotagging option to replace WhatsHap haplotagging by LongPhase haplotagging to speed up read haplotagging process, the option becomes default from this version. 5. Bumped up Clair3 dependency to version 1.0.7, LongPhase to version 1.7.

v0.1.7 (Jan 25, 2024) : 1. Added ONT Dorado 5khz HAC (-p ont_r10_dorado_hac_5khz) and Dorado 4khz HAC (-p ont_r10_dorado_hac_4khz) model, renamed all ONT Dorado SUP model, check here for more details. 2. Enabled somatic variant calling in sex chromosomes. 3. Added FAU, FCU, FGU, FTU, RAU, RCU, RGU, and RTU tags.

v0.1.6 (Sep 18) : 1. Fixed an output bug that caused no VCF output if no Indel candidate was found (contributor @Khi Pin). 2. Fixed showing incorrect reference allele depth at a deletion region. 3. Added PacBio HiFi quick demo.

v0.1.5 (Aug 2) : 1. Updated SNV calling using ONT Dorado 4kHz data with a new model trained using multiple-sample pairs (HG003/4); 2. Updated SNV calling using ONT Dorado 5kHz data with a new model trained using multiple-sample pairs (HG001/HG002, HG003/4); 3. Support somatic indel calling using ONT Dorado 4kHz data. 4. Support somatic indel calling using ONT Dorado 5kHz data.

v0.1.4 (Jul 15) : 1. Added reference depth in AD tag. 2. Added HiFi Sequel II Indel model.

v0.1.3 (Jul 5) : Added ONT Dorado 4khz (-p ont_r10_dorado_4khz) and 5khz (-p ont_r10_dorado_5khz) models, check here for more details. Renamed platform options ont_r10 to ont_r10_guppy and ont_r9 to ont_r9_guppy.

v0.1.2 (May 17) : Added HiFi Revio model, renamed HiFi Sequel II model from hifi to hifi_sequel2.

v0.1.1 (Apr 30) : 1. Added the "command line used" to VCF header. 2. Added NAU, NCU, NGU, and NTU tags (#reads supporting the four bases in normal) to the output. 3. Hybrid calling mode now outputs three VCFs, ClairS somatic variant calls, Clair3 normal germline variant calls, and Clair3 tumor germline variant calls. 4. Added the --enable_clair3_germline_output option to also output Clair3 normal germline variant calls, and Clair3 tumor germline variant calls (even when hybrid calling more is not enabled). Running time will increase by ~40%.

v0.1.0 (Mar 24) : 1. Added support for Indel calling. ClairS Indel calling currently only supports ONT R10 data. To enable, use the --enable_indel_calling option. The Indel F1-score is ~73% with 50x/50x HCC1395/BL data. 2. Added an experimental --normal_vcf_fn to skip germline variant calling on normal BAM (#7, contributor @Xingyao). 3. Added --hybrid_mode_vcf_fn option to enable hybrid calling mode that combines de novo calling results and genotyping results without running the tool twice. Renamed the --vcf_fn to --genotyping_mode_vcf_fn for clarification. 4. Fixed a memory issue, memory consumption is now sub 100G for high coverage samples. 5. Fixed a conda environment issue in Singularity (#3). 6. Fixed zero division when no SNV was found (#2, #5). 7. Added AD tag in the output.

v0.0.1 (Jan 29, 2023): Initial release for early access.


Quick Demo

Quick start

After following installation, you can run ClairS with one command:

./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10_guppy
## Final output file: output/output.vcf.gz

Check Usage for more options.


Pre-trained Models

ClairS trained both pileup and full-alignment models using GIAB samples, and carry on benchmarking on HCC1395-HCC1395BL pair dataset. All models were trained with chr20 excluded (including only chr1-19, 21, 22).

Platform Model name Chemistry /Instruments Basecaller Option (-p/--platform) Reference Aligner
ONT r1041_e82_400bps_sup_v420 R10.4.1, 5khz Dorado SUP ont_r10_dorado_sup_5khz GRCh38_no_alt Minimap2
ONT r1041_e82_400bps_sup_v410 R10.4.1, 4khz Dorado SUP ont_r10_dorado_sup_4khz GRCh38_no_alt Minimap2
ONT r1041_e82_400bps_hac_v420 R10.4.1, 5khz Dorado HAC ont_r10_dorado_hac_5khz GRCh38_no_alt Minimap2
ONT r1041_e82_400bps_hac_v410 R10.4.1, 4khz Dorado HAC ont_r10_dorado_hac_4khz GRCh38_no_alt Minimap2
ONT r104_e81_sup_g5015 R10.4/R10.4.1, 4khz Guppy5 SUP ont_r10_guppy GRCh38_no_alt Minimap2
ONT 1 r941_prom_sup_g5014 R9.4.1, 4khz Guppy5 SUP ont_r9_guppy GRCh38_no_alt Minimap2
Illumina ilmn NovaSeq/HiseqX - ilmn GRCh38 BWA-MEM
PacBio HiFi 2 hifi_sequel2 Sequel II with Chemistry 2.0 - hifi_sequel2 GRCh38_no_alt Minimap2
PacBio HIFI hifi_revio Revio with SMRTbell prep kit 3.0 - hifi_revio GRCh38_no_alt Minimap2

Caveats 1: Although the r9(r941_prom_sup_g5014) model was trained on synthetic samples with r9.4.1 real data, the minimal AF cutoff, minimal coverage, and post-calling filtering parameters for the r9 model are copied from the r10 model, and are not optimized due to lack of real r9 data on a cancer sample with known truths.

Caveats 2: The PacBio HiFi Sequel II model is experimental. It was trained but not tested with any real data with known truths. HG003 54x and HG004 52x were used, thus tumor depth coverage higher than 50x may suffer from lower recall rate. For testing, please downsample both tumor and normal to ~40x for the best performance of this experimental model.


Installation

Option 1. Docker pre-built image

A pre-built docker image is available at DockerHub.

Caution: Absolute path is needed for both INPUT_DIR and OUTPUT_DIR in docker.

docker run -it \
  -v ${INPUT_DIR}:${INPUT_DIR} \
  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \
  hkubal/clairs:latest \
  /opt/bin/run_clairs \
  --tumor_bam_fn ${INPUT_DIR}/tumor.bam \      ## use your tumor bam file name here
  --normal_bam_fn ${INPUT_DIR}/normal.bam \    ## use your normal bam file name here
  --ref_fn ${INPUT_DIR}/ref.fa \               ## use your reference file name here
  --threads ${THREADS} \                       ## maximum threads to be used
  --platform ${PLATFORM} \                     ## options: {ont_r10_dorado_sup_4khz, ont_r10_dorado_sup_5khz, ont_r10_guppy, ont_r9_guppy, ilmn, hifi_sequel2, hifi_revio, etc}
  --output_dir ${OUTPUT_DIR}                   ## output path prefix 

Check Usage for more options.

Option 2. Singularity

Caution: Absolute path is needed for both INPUT_DIR and OUTPUT_DIR in singularity.

INPUT_DIR="[YOUR_INPUT_FOLDER]"        # e.g. /home/user1/input (absolute path needed)
OUTPUT_DIR="[YOUR_OUTPUT_FOLDER]"      # e.g. /home/user1/output (absolute path needed)
mkdir -p ${OUTPUT_DIR}

conda config --add channels defaults
conda create -n singularity-env -c conda-forge singularity -y
conda activate singularity-env

# singularity pull docker pre-built image
singularity pull docker://hkubal/clairs:latest

# run the sandbox like this afterward
singularity exec \
  -B ${INPUT_DIR},${OUTPUT_DIR} \
  clairs_latest.sif \
  hkubal/clairs:latest \
  /opt/bin/run_clairs \
  --tumor_bam_fn ${INPUT_DIR}/tumor.bam \      ## use your tumor bam file name here
  --normal_bam_fn ${INPUT_DIR}/normal.bam \    ## use your normal bam file name here
  --ref_fn ${INPUT_DIR}/ref.fa \               ## use your reference file name here
  --threads ${THREADS} \                       ## maximum threads to be used
  --platform ${PLATFORM} \                     ## options: {ont_r10_dorado_sup_4khz, ont_r10_dorado_sup_5khz, ont_r10_guppy, ont_r9_guppy, ilmn, hifi_sequel2, hifi_revio, etc}
  --output_dir ${OUTPUT_DIR} \                 ## output path prefix
  --conda_prefix /opt/conda/envs/clairs

Option 3. Build an anaconda virtual environment

Check here to install the tools step by step.

Anaconda install:

Please install anaconda using the official guide or using the commands below:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x ./Miniconda3-latest-Linux-x86_64.sh 
./Miniconda3-latest-Linux-x86_64.sh

Install ClairS using anaconda step by step:

# create and activate an environment named clairs
# install pypy and packages in the environemnt
conda create -n clairs -c bioconda -c pytorch -c conda-forge pytorch tqdm clair3-illumina python=3.9.0 -y
source activate clairs

git clone https://github.com/HKU-BAL/ClairS.git
cd ClairS

# make sure in conda environment
# download pre-trained models
echo ${CONDA_PREFIX}
mkdir -p ${CONDA_PREFIX}/bin/clairs_models
wget http://www.bio8.cs.hku.hk/clairs/models/clairs_models.tar.gz
tar -zxvf clairs_models.tar.gz -C ${CONDA_PREFIX}/bin/clairs_models/

./run_clairs --help

Option 4. Docker Dockerfile

This is the same as option 1 except that you are building a docker image yourself. Please refer to option 1 for usage.

git clone https://github.com/HKU-BAL/ClairS.git
cd ClairS

# build a docker image named hkubal/clairs:latest
# might require docker authentication to build docker image
docker build -f ./Dockerfile -t hkubal/clairs:latest .

# run the docker image like option 1
docker run -it hkubal/clairs:latest /opt/bin/run_clairs --help

Usage

General Usage

./run_clairs \
  --tumor_bam_fn ${INPUT_DIR}/tumor.bam \    ## use your tumor bam file name here
  --normal_bam_fn ${INPUT_DIR}/normal.bam \  ## use your bam file name here
  --ref_fn ${INPUT_DIR}/ref.fa \             ## use your reference file name here
  --threads ${THREADS} \                     ## maximum threads to be used
  --platform ${PLATFORM} \                   ## options: {ont_r10_dorado_sup_4khz, ont_r10_dorado_sup_5khz, ont_r10_guppy, ont_r9_guppy, ilmn, hifi_sequel2, hifi_revio, etc}
  --output_dir ${OUTPUT_DIR}                 ## output path prefix
 
## Final output file: ${OUTPUT_DIR}/output.vcf.gz

Options

Required parameters:

  -T, --tumor_bam_fn TUMOR_BAM_FN   Tumor BAM file input. The input file must be samtools indexed.
  -N, --normal_bam_fn NORMAL_BAM_FN Normal BAM file input. The input file must be samtools indexed.
  -R, --ref_fn FASTA                Reference file input. The input file must be samtools indexed.
  -o, --output_dir OUTPUT_DIR       VCF output directory.
  -t, --threads THREADS             Max #threads to be used.
  -p, --platform PLATFORM           Select the sequencing platform of the input. Possible options {ont_r10_dorado_sup_4khz, ont_r10_dorado_sup_5khz, ont_r10_dorado_hac_5khz, ont_r10_dorado_hac_4khz, ont_r10_guppy, ont_r9_guppy, ilmn, hifi_sequel2, hifi_revio}.

Miscellaneous parameters:

  -P PILEUP_MODEL_PATH, --pileup_model_path PILEUP_MODEL_PATH
                        Specify the path to your own somatic calling pileup model.
  -F FULL_ALIGNMENT_MODEL_PATH, --full_alignment_model_path FULL_ALIGNMENT_MODEL_PATH
                        Specify the path to your own somatic calling full-alignment model.
  -c CTG_NAME, --ctg_name CTG_NAME
                        The name of the contigs to be processed. Split by ',' for multiple contigs. Default: all contigs will be processed.
  -r REGION, --region REGION
                        A region to be processed. Format: `ctg_name:start-end` (start is 1-based).
  -b BED_FN, --bed_fn BED_FN
                        Path to a BED file. Call variants only in the provided BED regions.
  -G GENOTYPING_MODE_VCF_FN, --genotyping_mode_vcf_fn GENOTYPING_MODE_VCF_FN
                        VCF file input containing candidate sites to be genotyped. Variants will only be called at the sites in the VCF file if provided.
  -H HYBRID_MODE_VCF_FN, --hybrid_mode_vcf_fn HYBRID_MODE_VCF_FN  
                        Enable hybrid calling mode that combines the de novo calling results and genotyping results at the positions in the VCF file given.
  -q QUAL, --qual QUAL  If set, variants with >QUAL will be marked as PASS, or LowQual otherwise.
  --snv_min_af SNV_MIN_AF
                        Minimal SNV AF required for a variant to be called. Decrease SNV_MIN_AF might increase a bit of sensitivity, but in trade of precision, speed and accuracy. Default: 0.05.
  --min_coverage MIN_COVERAGE
                        Minimal coverage required for a variant to be called. Default: 4.
  --chunk_size CHUNK_SIZE
                        The size of each chuck for parallel processing. Default: 5000000.
  -s SAMPLE_NAME, --sample_name SAMPLE_NAME
                        Define the sample name to be shown in the VCF file. Default: SAMPLE.
  --output_prefix OUTPUT_PREFIX
                        Prefix for output VCF filename. Default: output.
  --remove_intermediate_dir
                        Remove intermediate directory before finishing to save disk space.
  --include_all_ctgs    Call variants on all contigs, otherwise call in chr{1..22} and {1..22}.
  --print_ref_calls     Show reference calls (0/0) in VCF file.
  --print_germline_calls
                        Show germline calls in VCF file.
  -d, --dry_run         Print the commands that will be ran.
  --python PYTHON       Absolute path of python, python3 >= 3.9 is required.
  --pypy PYPY           Absolute path of pypy3, pypy3 >= 3.6 is required.
  --samtools SAMTOOLS   Absolute path of samtools, samtools version >= 1.10 is required.
  --parallel PARALLEL   Absolute path of parallel, parallel >= 20191122 is required.
  --disable_phasing     EXPERIMENTAL: Disable phasing with longphase or whatshap. Usually leads to significant performance loss.
  --normal_vcf_fn NORMAL_VCF_FN
                        EXPERIMENTAL: Path to normal VCF file. Setting this will skip germline varaint calling on normal BAM file input.
  --enable_indel_calling
                        EXPERIMENTAL: Enable Indel calling, 'ont_r9_guppy' and 'ilmn' platforms are not supported. The calling time would increase significantly. default: disabled.
  --enable_clair3_germline_output
                        EXPERIMENTAL: Use Clair3 default calling settings than Clair3 fast calling setting for tumor and normal germline varaint calling. The calling time would increase ~40 percent, Default: disabled.
  --enable_verdict      EXPERIMENTAL: Use Verdict to tag the germline variant in CNV regions. We suggest using the parameter only for sample with tumor purity lower than 0.8, Default: disabled
  --use_heterozygous_snp_in_normal_sample_for_intermediate_phasing USE_HETEROZYGOUS_SNP_IN_NORMAL_SAMPLE_FOR_INTERMEDIATE_PHASING
                        EXPERIMENTAL: Use the heterozygous SNPs in normal VCF called by Clair3 for intermediate phasing. Option: {True, False}. Default: True.
  --use_heterozygous_snp_in_tumor_sample_for_intermediate_phasing USE_HETEROZYGOUS_SNP_IN_TUMOR_SAMPLE_FOR_INTERMEDIATE_PHASING
                        EXPERIMENTAL: Use the heterozygous SNPs in tumor VCF called by Clair3 for intermediate phasing. Option: {True, False}. Default: False.
  --use_heterozygous_indel_for_intermediate_phasing USE_HETEROZYGOUS_INDEL_FOR_INTERMEDIATE_PHASING
                        EXPERIMENTAL: Use the heterozygous Indels in normal and tumor VCFs called by Clair3 for intermediate phasing. Option: {True, False}. Default: True.
    --use_longphase_for_intermediate_haplotagging USE_LONGPHASE_FOR_INTERMEDIATE_HAPLOTAGGING
                        EXPERIMENTAL: Use the longphase instead of whatshap for intermediate haplotagging. Option: {True, False}. Default: True.
  --indel_output_prefix INDEL_OUTPUT_PREFIX
                        Prefix for Indel output VCF filename. Default: indel.
  --indel_pileup_model_path INDEL_PILEUP_MODEL_PATH
                        Specify the path to your own somatic calling indel pileup model.
  --indel_full_alignment_model_path INDEL_FULL_ALIGNMENT_MODEL_PATH
                        Specify the path to your own somatic calling indel full-alignment model.

Call SNVs in one or mutiple chromosomes using the -C/--ctg_name parameter

./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10_guppy -C chr21,chr22

Call SNVs in one specific region using the -r/--region parameter

./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10_guppy -r chr20:1000000-2000000

Call SNVs at interested variant sites (genotyping) using the -G/--genotyping_mode_vcf_fn parameter

./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10_guppy -G input.vcf

Call SNVs in the BED regions using the -B/--bed_fn parameter

We highly recommended using BED file to define multiple regions of interest like:

echo -e "${CTG1}\t${START_POS_1}\t${END_POS_1}" > input.bed
echo -e "${CTG2}\t${START_POS_2}\t${END_POS_2}" >> input.bed
...

Then:

./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10_guppy -B input.bed

Disclaimer

NOTE: the content of this research code repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.

clairs's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

clairs's Issues

Germline variants present in output.vcf

According to the ClairS documentation in the README, I expect the germline variants to be eliminated in the somatic vcf, and this is consistent with the demo outputs. However, running ClairS on my own data, I see overlaps between the clair3 germline callsets (both normal and tumor) and the final somatic callset. Is this behavior expected? If so, can you provide guidance on how to handle the overlapping calls?

Nondeterminism in ClairS output

This has already been resolved by I am sharing the email thread with @zhengzhenxian about this issue here.

ClairS returns slightly different results depending on the order in which bam files are being merged with samtools merge. While samtools merge does return merged bams with dissimilar ordering, I do not think it should affect variant calling. Could you help me understand a). how these functionally identical bam files are causing differences in ClairS output, and b). how significant are these differences?

Reply: Thank you for providing us with the data. Based on the data you provided, we believe that the issue is due to some of the merged BAMs sharing the same read names(read name and count are shown in the figure). We used "samtools mpileup" to parse all input alignments, and samtools gave different pileup results for different merged ordering, resulting in different variant calling results in Clair3 and ClairS.

Unfortunately, we are unable to fix the issue as it falls under the logic of samtools. However, we have two suggestions that may be helpful. The first is to add a unique prefix for each BAM that needs to be merged. The second is to sort the BAM file names in a specific order, such as alphabetical order, before merging. In our testing, if all read names are unique, the calling gave the same result in multiple runs.

Based on your benchmarking data, nearly 50% of the read names have more than one read support, which has contributed to 34 different calls. We suggest implementing one of the suggestions above to ensure that the merged BAMs have unique read names and therefore produce consistent variant calling results.

Please let us know if you have any further questions or concerns.

Thanks for your quick response to my issue. My team and I are taking your advice to try to force deterministic ClairS outputs in our pipeline. Taking your first suggestion, my team has filtered out non-unique reads and verified with BamUtil diff that two separate bam files are functionally identical -- multiple reads mapped to the same position are still ordered differently. However, calling ClairS on these two bam files again returned different results. Also, I was not able to detect any differences between the pileup files from either the output temp files or my own mpileup outputs. Can you please advise us on the source of the nondeterminism? We do believe that sorting the bam files before merging would work and we are in the process of implementing it in our pipeline, but my team would like to understand the root cause so that we can properly assess the validity of our results.

Regarding the issue of nondeterminism, I would like to clarify that it stems from samtools itself. Even after removing duplicated reads, we still observe differences in the read disorder. To investigate this further, I examined the Clair3 results in ${OUTPUT_DIR}/tmp/clair3_output/clair3_tumor_output/merge_output.vcf.gz and compared them with the samtools mplieup result of the first different record (chr1:143185622). Unfortunately, I found that the output is still different due to the different read order(figure 1 and figure2).

The issue with the read order affects the ClairS VCF result because we encode the alignment into a spatial read-level representation in our full-alignment model. Different read orders lead to different input arrangements and thus giving different outputs for some variants.

To address this issue, I would still recommend fixing the BAM order in the merging step to avoid distinct read orders. Please feel free to let us know if you have any further questions.

Empty VCF produced by ClairS

Hi,

I ran a set of 21 tumour-normal pairs. All jobs completed successfully but 5 runs returned empty VCFs. Looking into the log file I found the following:

image

I am running ClairS v0.0.1 from dockerhub, accessed on January 23rd.

Thanks,
Rowan

[Ask for insights on Illumina results regarding ClairS workflow/design choices]

Hello ClairS Team,

Firstly, I'd like to express my gratitude for your outstanding research and for making the code available to the community.

In Illumina dataset, I am curious on what point ClairS provides advantage over other baselines such as VarNet and NeuSomatic. (workflow/design or synthetic dataset construction or larger or better model architecture)

  1. Have you compared SNP/INDEL calling performance between ClairS and (VarNet or Neusomatic) when they are trained on identical datasets? I am curious whether the amount and quality of the training dataset have impacted the performance the most or not. (I saw the ablation study on your supplementary material regarding Nanopore dataset)

  2. What is the key factor that improved the performance over Neusomatic and VarNet?

Thank you for your time and insights.

Best regards,
Quito418

Option to call SNPs only

Similar to Clair3 (issue #40), is it possible to add an option to only call SNPs and ignore indels?
Thanks!

ClairS quits with exit code 0 but no output, no error logged

I'm using ClairS on three samples right now. Two have been running since last night, no issue yet. The third quits within seconds of launching the job with exit code 0 but no output is written. The directory structure is created in the output file. No errors are written in the log. This job is using a custom reference but I get the same behavior with these files if I substitute hg38 as the reference. Is there a debug flag or something I can use to tell me what is going wrong? This is the code I'm running. Any help would be greatly appreciated.

./run_clairs
-T /home/nparma/eee_bams_dsa/DSA_hifiasm_60x/ONT/COLO829-T_PS00361.all.sorted.bam
-N /home/nparma/eee_bams_dsa/DSA_hifiasm_60x/ONT/COLO829-BL_PS00342.all.sorted.bam
-R /home/nparma/eee_bams_dsa/references/dsa60/COLO829-BL_60X_hifisam_merged_haplotypes.fasta
-t 4
-p ont_r10_dorado_sup_5khz
-o /home/nparma/eee_bams_dsa/myanalysis/clairs/T/dsa/clairsout

Questions Regarding Heterozygous Variants, Somatic Mutations, and Phasing in ClairS Usage

Dear Clair Team,

I am a member of the longphase development team. Recently, while using ClairS and studying related literature, I have observed some results that raised a few questions I'd like to inquire about.

  1. During the use of ont_quick_demo.sh, I used IGV to observe the vcf in [Figure 1] process.After organizing the data, I noticed that only the tumor samples were heterozygous (as shown in Figure 2). My question is, during the Germline variant calling step, are only the tumor cells identified as confident heterozygous?

  2. From Figure 2, I observed that the SNPs used for longphase actually include some somatic mutations. I would like to confirm if the SNPs used for phasing are indeed a mix of somatic mutations and Germline variants.

  3. In Figure 3, the number of SNPs used for phasing seems insufficient to cover the entire range. Attempting other variant calling software, I found that Clair3 (v1.0.5) results could cover a broader range. After phasing and haplotagging with Clair3 and longphase (v1.6), I obtained the results as shown in Figure 4 (B) (only displaying SNPs that were phased). I noticed that different haplotypes could still be distinguished in ClairS's output (as seen in Figure 5). Have you considered using this method?

  4. For ClairS, tagging appears to be a crucial step. The ideal output would include H1/H2 and H2 (carrying somatic mutations), or H1, H2, H3 (tumor-specific), etc. Would such an approach be beneficial for the training or detection of the ClairS model? If you think it would be helpful, we are considering developing a new version focused on somatic mutations.

Figure 1
f1

Figure 2
f2

Figure 3
f3

Figure 4
f4

Figure 5
f5

Thank you for your time and assistance!

select_hetero_snp_for_phasing breaks for contigs where no variants are found

VCFs are not generated for contigs where no variants are found, which breaks select_hetero_snp_for_phasing.
clairs_output/vcf contains these files:

chr1.vcf chr16.vcf chr21.vcf chr22.vcf chr4.vcf chr5.vcf

Errors pasted from some of 1_select_hetero_snp_for_phasing.log:

[INFO] Total HET SNP calls selected: chr22: 119, not found:27, not match:0, low_qual_count:0. Total normal:57 Total tumor:146, pro: 0.815068493150685
Traceback (most recent call last):
File "/opt/bin/clairs.py", line 120, in
main()
File "/opt/bin/clairs.py", line 114, in main
submodule.main()
File "/opt/bin/src/select_hetero_snp_for_phasing.py", line 157, in main
select_hetero_snp_for_phasing(args)
File "/opt/bin/src/select_hetero_snp_for_phasing.py", line 118, in select_hetero_snp_for_phasing
pro = len(pass_variant_dict) / len(tumor_qual_dict)
ZeroDivisionError: division by zero
Traceback (most recent call last):
File "/opt/bin/clairs.py", line 120, in
main()
File "/opt/bin/clairs.py", line 114, in main
submodule.main()
File "/opt/bin/src/select_hetero_snp_for_phasing.py", line 157, in main
select_hetero_snp_for_phasing(args)
File "/opt/bin/src/select_hetero_snp_for_phasing.py", line 118, in select_hetero_snp_for_phasing
pro = len(pass_variant_dict) / len(tumor_qual_dict)
ZeroDivisionError: division by zero
[INFO] Total HET SNP calls selected: chr1: 220, not found:66, not match:0, low_qual_count:0. Total normal:73 Total tumor:286, pro: 0.7692307692307693
[INFO] Total HET SNP calls selected: chr21: 1, not found:1, not match:0, low_qual_count:0. Total normal:31 Total tumor:2, pro: 0.5
Traceback (most recent call last):
File "/opt/bin/clairs.py", line 120, in
main()
File "/opt/bin/clairs.py", line 114, in main
submodule.main()
File "/opt/bin/src/select_hetero_snp_for_phasing.py", line 157, in main
select_hetero_snp_for_phasing(args)
File "/opt/bin/src/select_hetero_snp_for_phasing.py", line 118, in select_hetero_snp_for_phasing
pro = len(pass_variant_dict) / len(tumor_qual_dict)
ZeroDivisionError: division by zero

samtools index: failed to open error

Good day,
I am using ClairS-TO using the following command:

singularity exec /apps/chpc/bio/clairs_to/clairs-to_latest.sif run_clairs_to -T BC1.sorted.bam -R GCF_000001405.40_GRCh38.p14_genomic.fna -t 8 -p ont_r10_dorado_hac_4khz --include_all_ctgs --output_dir output-BC1

I however get the following error (I don't get a final .vcf file but I do get many vcf files in the tmp folder):

samtools index: failed to open "/mnt/lustre/groups/CBBI1617/BreastCancer-fastq/SNV-analysis/input/output-BC1/tmp/phasing_output/phased_bam_output/tumor_NT_187421.1.bam": No such file or directory
ERROR in STEP 11, THE FOLLOWING COMMAND FAILED: ( parallel --joblog /mnt/lustre/groups/CBBI1617/BreastCancer-fastq/SNV-analysis/input/output-BC1/logs/phasing_log/parallel_3_haplotag_tumor.log -j 8 /opt/micromamba/envs/clairs-to/bin/longphase haplotag -o /mnt/lustre/groups/CBBI1617/BreastCancer-fastq/SNV-analysis/input/output-BC1/tmp/phasing_output/phased_bam_output/tumor_{1}.bam --reference /mnt/lustre/groups/CBBI1617/BreastCancer-fastq/SNV-analysis/input/GCF_000001405.40_GRCh38.p14_genomic.fna --region {1} -s /mnt/lustre/groups/CBBI1617/BreastCancer-fastq/SNV-analysis/input/output-BC1/tmp/phasing_output/phased_vcf_output/tumor_phased_{1}.vcf.gz -b /mnt/lustre/groups/CBBI1617/BreastCancer-fastq/SNV-analysis/input/BC1.sorted.bam :::: /mnt/lustre/groups/CBBI1617/BreastCancer-fastq/SNV-analysis/input/output-BC1/tmp/CONTIGS ) 2>&1 | tee /mnt/lustre/groups/CBBI1617/BreastCancer-fastq/SNV-analysis/input/output-BC1/logs/phasing_log/3_tumor_haplotag.log && parallel -j 8 samtools index -@8 /mnt/lustre/groups/CBBI1617/BreastCancer-fastq/SNV-analysis/input/output-BC1/tmp/phasing_output/phased_bam_output/tumor_{1}.bam :::: /mnt/lustre/groups/CBBI1617/BreastCancer-fastq/SNV-analysis/input/output-BC1/tmp/CONTIGS

Could I please get some assistance on this.

Question in training data label generation code - get_candidates.py

Dear ClairS Team,

While reviewing the code at the following link:

normal_af, vt, max_af = find_candidate_match(alt_info_dict=alt_dict[pos].alt_dict, ref_base=ref_base,

I noticed a potential issue related to the find_candidate_match function.
It appears that the function may be utilizing the tumor alt_dict instead of the normal alt_dict.
Could you please take a moment to verify this?

Thank you for your time and support!

Best regards,

Interpreting the QUAL column

Hi @aquaskyline and @zhengzhenxian,

Thank you for releasing an early access version of ClairS. Just playing around with this, I had a quick clarification question:

I was wondering what the QUAL column value signifies. Broadly, I believe the "QUAL value reflects how confident we are that a site displays some kind of variation considering the amount of data available", in this case, is it the confidence that a call is somatic? And how do you generate this QUAL score?

Thanks.

Docker latest version

The command docker pull hkubal/clairs:latest was pulling the 1.4 version I had to use specific tag docker pull hkubal/clairs:v0.1.5
to get the latest version after I checked the release number

ClairS crashing with spaces in input file name

Hello,
I come across an issue when the software is provided input files with spaces in their path.

$ ~/Software/ClairS/run_clairs --tumor_bam_fn Test\ data/demo_tumor.remap.bam --normal_bam_fn Test\ data/demo_tumor.remap.bam --bed_fn Test\ data/demo.bed --ref_fn Test\ data/GCA_000001405.15_GRCh38_no_alt_analysis_set_chr20.fna --threads 4 -o Test\ data/TESTING_BREAK/ClairS_test -p ont_r10_guppy

[INFO] Create folder Test data/TESTING_BREAK/ClairS_test

[COMMAND] /home/talentia/Software/ClairS/run_clairs --tumor_bam_fn Test data/demo_tumor.remap.bam --normal_bam_fn Test data/demo_tumor.remap.bam --ref_fn Test data/GCA_000001405.15_GRCh38_no_alt_analysis_set_chr20.fna --threads 4 --platform ont_r10_guppy --output_dir Test data/TESTING_BREAK/ClairS_test --bed_fn Test data/demo.bed

[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/logs
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/split_beds
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/candidates
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/pileup_tensor_can
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/fa_tensor_can
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/vcf_output
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/tmp_vcf_output
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/logs/clair3_log
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/clair3_output/phased_output
[INFO] Create folder /home/UID name/Test/Test data/TESTING_BREAK/ClairS_test/tmp/clair3_output/vcf
gzip: /home/UID.gz: No such file or directory
gzip: name/Test/Test.gz: No such file or directory
gzip: data/demo.bed.gz: No such file or directory
[INFO] --include_all_ctgs not enabled, use chr{1..22,X,Y} and {1..22,X,Y} by default
[WARNING] No contig in --bed_fn was found in the reference, contigs in BED /home/UID name/Test/Test data/demo.bed:

This seems to be related to how the workflow decompresses the input files, particularly the input BED and genotyping VCF here and here.
This is particularly relevant when the space is within the user name, making it impossible to run the software.

Thanks
Andrea

Haplotype filtering step keep stuck

Hi,
ClairS keep stuck at this command after completing phasing for hours on ONT data. No output in 4_HAP_FILTER.log log file either.
I am running second time, and seeing same issue.

[INFO] STEP 4: Haplotype filtering
[INFO] RUN THE FOLLOWING COMMAND:
( pypy3 /opt/bin/clairs.py haplotype_filtering --tumor_bam_fn /data/tahmad/devel/ClairS_HCC1437/tmp/clair3_output/phased_output/tumor_ --ref_fn /data/tahmad/alignment/grch38_chr.fasta --germline_vcf_fn /data/tahmad/devel/ClairS_HCC1437/tmp/clair3_output/clair3_tumor_output/merge_output.vcf.gz --pileup_vcf_fn /data/tahmad/devel/ClairS_HCC1437/tmp/vcf_output/pileup.vcf --full_alignment_vcf_fn /data/tahmad/devel/ClairS_HCC1437/tmp/vcf_output/full_alignment.vcf --output_dir /data/tahmad/devel/ClairS_HCC1437/tmp/vcf_output --samtools samtools --threads 56 ) 2>&1 | tee /data/tahmad/devel/ClairS_HCC1437/logs/4_HAP_FILTER.log

My clairS run command:

singularity run -B /data/tahmad /data/tahmad/images/clairs_latest.sif /opt/bin/run_clairs --threads 56 --phase_tumor True --longphase True --tumor_bam_fn /data/tahmad/devel/bams_haplotagged/HCC1437_ONT.bam --normal_bam_fn /data/tahmad/devel/bams_haplotagged/HCC1437BL_ONT.bam --ref /data/tahmad/alignment/grch38_chr.fasta --output_dir /data/tahmad/devel/ClairS_HCC1437 --platform ont_r10

I see following BAMs and phased VCFs:

ls -lh /data/tahmad/devel/ClairS_HCC1437/tmp/clair3_output/phased_output/tumor_
tumor_chr10.bam                tumor_chr17.bam.bai            tumor_chr3.bam                 tumor_phased_chr10.vcf.gz.tbi  tumor_phased_chr18.vcf.gz      tumor_phased_chr3.vcf.gz.tbi
tumor_chr10.bam.bai            tumor_chr18.bam                tumor_chr3.bam.bai             tumor_phased_chr11.vcf.gz      tumor_phased_chr18.vcf.gz.tbi  tumor_phased_chr4.vcf.gz
tumor_chr11.bam                tumor_chr18.bam.bai            tumor_chr4.bam                 tumor_phased_chr11.vcf.gz.tbi  tumor_phased_chr19.vcf.gz      tumor_phased_chr4.vcf.gz.tbi
tumor_chr11.bam.bai            tumor_chr19.bam                tumor_chr4.bam.bai             tumor_phased_chr12.vcf.gz      tumor_phased_chr19.vcf.gz.tbi  tumor_phased_chr5.vcf.gz
tumor_chr12.bam                tumor_chr19.bam.bai            tumor_chr5.bam                 tumor_phased_chr12.vcf.gz.tbi  tumor_phased_chr1.vcf.gz       tumor_phased_chr5.vcf.gz.tbi
tumor_chr12.bam.bai            tumor_chr1.bam                 tumor_chr5.bam.bai             tumor_phased_chr13.vcf.gz      tumor_phased_chr1.vcf.gz.tbi   tumor_phased_chr6.vcf.gz
tumor_chr13.bam                tumor_chr1.bam.bai             tumor_chr6.bam                 tumor_phased_chr13.vcf.gz.tbi  tumor_phased_chr20.vcf.gz      tumor_phased_chr6.vcf.gz.tbi
tumor_chr13.bam.bai            tumor_chr20.bam                tumor_chr6.bam.bai             tumor_phased_chr14.vcf.gz      tumor_phased_chr20.vcf.gz.tbi  tumor_phased_chr7.vcf.gz
tumor_chr14.bam                tumor_chr20.bam.bai            tumor_chr7.bam                 tumor_phased_chr14.vcf.gz.tbi  tumor_phased_chr21.vcf.gz      tumor_phased_chr7.vcf.gz.tbi
tumor_chr14.bam.bai            tumor_chr21.bam                tumor_chr7.bam.bai             tumor_phased_chr15.vcf.gz      tumor_phased_chr21.vcf.gz.tbi  tumor_phased_chr8.vcf.gz
tumor_chr15.bam                tumor_chr21.bam.bai            tumor_chr8.bam                 tumor_phased_chr15.vcf.gz.tbi  tumor_phased_chr22.vcf.gz      tumor_phased_chr8.vcf.gz.tbi
tumor_chr15.bam.bai            tumor_chr22.bam                tumor_chr8.bam.bai             tumor_phased_chr16.vcf.gz      tumor_phased_chr22.vcf.gz.tbi  tumor_phased_chr9.vcf.gz
tumor_chr16.bam                tumor_chr22.bam.bai            tumor_chr9.bam                 tumor_phased_chr16.vcf.gz.tbi  tumor_phased_chr2.vcf.gz       tumor_phased_chr9.vcf.gz.tbi
tumor_chr16.bam.bai            tumor_chr2.bam                 tumor_chr9.bam.bai             tumor_phased_chr17.vcf.gz      tumor_phased_chr2.vcf.gz.tbi   
tumor_chr17.bam                tumor_chr2.bam.bai             tumor_phased_chr10.vcf.gz      tumor_phased_chr17.vcf.gz.tbi  tumor_phased_chr3.vcf.gz  

tmp folders not being deleted after calling

The tmp folders created by ClairS are not deleted after the run is finished, or if they are, it is not consistent. They take up a significant amount of space and can unexpectedly fill up workspaces.

Adding Normal Sample GT to the VCF file

Would it be possible to add the Normal sample GT/DP/AO to the somatic vcf just for comparison -- for example you can imagine that you have a few "alt reads" in the normal sample compared to 5% or 10% in the tumor which might be much more... we use this to filter out possible FPs. Alternately could you spit out a normal VCF with the ref calls for the same positions in the somatic file. Those could be merged with BCFtools.

Thanks!

Bam files for SEQC2 samples

Hello,

Could you make it available SEQC2 ONT v10.4.1 full bams (or even fastq) for both Tumor and Normal samples available to run ClairS?

Keyur

Enhancing somatic variant calling and execution speed

Hello,

I am working with HCC1395 data, analyzing tumor samples at 75x coverage and normal samples at 45x coverage. I utilized Clair3 to process the normal.bam file, generating a normal.vcf. This file was then employed for phasing and haplotagging the tumor.bam, followed by using a somatic mutation caller. The results showed a notable decrease in false positives.

phase and haplotag Precision Recall F1-score TP FP FN
ClairS germline.vcf 67.12% 77.64% 72.00% 30626 15001 8821
Clair3 normal.vcf 72.50% 77.46% 74.90% 30556 11593 8891

In an instance where false positives were converted to true negatives, it was observed that the mutations in the normal sample were heterozygous, whereas in the tumor sample, they were homozygous. This suggests a loss of heterozygosity (LOH) event, making the strategy of phasing and tagging most reads into the same haplotype seem correct. Have you considered this method?

image

Moreover, I noted in literature that the primary reason for choosing Longphase for phasing is its speed. We still have a speed advantage in haplotagging. ClairS employs parallel acceleration at the chromosome level and we can introduce a feature to specify a range. Could this reduce the training costs for you? I also conducted a haplotag test, and the results do not seem to show any significant differences.

haplotag Precision Recall F1-score TP FP FN
whatshap v1.7 67.12% 77.64% 72.00% 30626 15001 8821
longphase v1.3 67.27% 77.62% 72.07% 30617 14897 8830

question: model for 5khz data

Hi Ruibang,

Since nanopore sequencing software updated to 5kHz sampling mode, I am wondering if there will be a pre-trained model for 5kHz data. Can I still use --platform ont_r10 (ont_r104_e81_sup_g5015) for now? Thank you.

Cheers
Jia

Running with singularity

Hi,
I am trying to run ClairS with singularity, but it keeps looking for a model and an executable in my ${CONDA_PREFIX} . First I had to tweak a bit the command from the README, since -v is docker syntax, not singularity, the SIF file goes after the options and you are specifying it twice. This is my first command:

singularity exec \
-B ${TUM_INPUT}:${TUM_INPUT},${GL_INPUT}:${GL_INPUT},${REF_DIR}:${REF_DIR},${OUT}:${OUT} \ 
clairs_latest.sif \
/opt/bin/run_clairs \
--tumor_bam_fn ${TUM_INPUT}/test.sorted.bam \
--normal_bam_fn ${GL_INPUT}/test.gl.sorted.bam \
--ref_fn ${REF_DIR}/Homo_sapiens_assembly38.fasta \
--threads 32 \
--platform ont_r9 \
--output_dir ${OUT}

After which I get an error:
[ERROR] file ${CONDA_PREFIX}/bin/clairs_models/ont_r9 not found
I then unpacked the models in the folder where ClairS is looking for them but I still get an error:
[ERROR] Cannot find clair3 main entry in ${CONDA_PREFIX}/bin
So is the docker/singularity image self contained? I also tried installing through conda but the environment does not solve with incompatible specifications.
I am not an advanced user of singularity, so might be doing something wrong.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.