Git Product home page Git Product logo

mocha's People

Contributors

freeseek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

mocha's Issues

no mLOY result but normal mLOX and mCA of male in autosomes

Hi,

Thank you for this wonderful tools! As I am applying this tool on my project and I have a question that cannot find the answer by myself and hope to get some clues from you.

I have successfully received the results from mocha and I would like to get the mLOY and mLOX. I can get numbers of mLOX from the results but never get an any mLOY samples (I have ~half of the samples are male in the original genotype data). Are there any potential reasons you this may cause this happen or do you have any suggestions how can I trouble shoot this?

Thank you for any of your incoming suggestions!

query about calling CNV with multi-samples

Dear developer

I want to call CNV among multi-samples.

So when i run mocha, is there any difference for the output CNV results between inputting samples one by one or merging these samples?

Thank you!

mochatools input format

Getting this error:
Error: input VCF file BAF format field is not of float type

When I try to run this script file:
bcftools +mochatools test.vcf -Ob -o output.bcf -t ALLELE_A,ALLELE_B,GC

My vcf like this:
......
.......
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 207052560007_R01C01
1 629241 rs10458597 C T . PASS . GT:GS:BAF:LRR 0/0:0.3807:0:0.0061176764
1 632287 rs9651229 C T . PASS . GT:GS:BAF:LRR 0/0:0.3100:0:0.092227906
1 632828 rs9701872 T C . PASS . GT:GS:BAF:LRR 0/0:0.3479:0:0.033441655

Any suggestions on filters for downstream analysis?

Hi! Below are a few reflections on the different filters used to identify mLOY, mLOX and autosomal mCAs in the tool.

  1. We generated the phenotype files from the filtered MoChA calls.
  2. #33 clarifies that mLOY and mLOX event counts obtained using LRR classification would differ from counts obtained from filters used to generate the pheno.filtered.tsv file (f["computed_gender"])=="M" && $(f["chrom"])~"X" && $(f["length"])>2e6 && $(f["rel_cov"])<2.5) because using classification by LRR (type column) would not be reliable for chromosomes X and Y. However, we found 36.8% mLOY events in the male subset of our data based on filters applied to calls.filters.tsv which is substantially larger than what we identified for mLOY in males from the European-ancestry subset of the UK Biobank (12.6%).
  3. Autosomal mCA event counts are substantially different when using LRR classification (mca.calls.filtered.tsv) compared to filters used to generate the pheno.filtered.tsv file .

Isn’t LRR supposed to be reliable enough for chromosomes 1:22? Which filter would you recommend for downstream analysis? Thanks for your help, I'm happy to clarify any points if needed.

Working with BAM files

Hello, how are you?

I am trying to use MoChA to identify somatic mosaicism in BAM files. I would like suggestions on how I can convert the BAM files to VCF so that I can use MoChA. I have tried using the command:

samtools mpileup -uf GCA_000001405.15_GRCh38_no_alt_analysis_set.fna myfilename.bam | bcftools call -mv > var.raw.vcf

But when I run the following code:

bcftools +mocha -g GRCh38 -o output.bcf -Ob -c calls.tsv -z stats.tsv var.raw.vcf

I get the following error:

Failed to open /path/to/fie/var.raw.vcf: not compressed with bgzip

So I used bgzip to compress the file, with the following command:

bgzip < var.raw.vcf > var.raw.vcf.gz

and I ran the command again, and it returned the following error:

Error: input VCF file must contain either the AD format field or the BAF and LRR format fields

I would like help in finding the best way to process the bam files.

Cheers, Felipe.

Crash before/during call-gtc_tsv, difficult to interpret the error message...

Error message:

[2022-04-27 11:50:34,79] [info] BackgroundConfigAsyncJobExecutionActor [e38dbc61mocha.gtc_tsv:NA:1]: job id: 22366 [2022-04-27 11:50:34,79] [info] BackgroundConfigAsyncJobExecutionActor [e38dbc61mocha.gtc_tsv:NA:1]: Status change from - to Done [2022-04-27 11:50:35,16] [info] WorkflowManagerActor: Workflow e38dbc61-b064-4141-9c6a-e5d54d6dd2f3 failed (during ExecutingWorkflowState): java.lang.RuntimeException: Failed to evaluate 'if_condition' (reason 1 of 1): Evaluating (length(idx2gtc2vcf_files[idx]) > 1) failed: Bad Map access idx2gtc2vcf_files[idx]: This Map[Int, Array[String]] does not have a Int index value [5] at cromwell.engine.workflow.lifecycle.execution.keys.ExpressionKey.processRunnable(ExpressionKey.scala:29) at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.$anonfun$startRunnableNodes$7(WorkflowExecutionActor.scala:563) at scala.collection.immutable.List.map(List.scala:293) at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.cromwell$engine$workflow$lifecycle$execution$WorkflowExecutionActor$$startRunnableNodes(WorkflowExecutionActor.scala:557) at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor$$anonfun$5.applyOrElse(WorkflowExecutionActor.scala:211) at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor$$anonfun$5.applyOrElse(WorkflowExecutionActor.scala:209) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:172) at akka.actor.FSM.processEvent(FSM.scala:710) at akka.actor.FSM.processEvent$(FSM.scala:704) at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.akka$actor$LoggingFSM$$super$processEvent(WorkflowExecutionActor.scala:54) at akka.actor.LoggingFSM.processEvent(FSM.scala:847) at akka.actor.LoggingFSM.processEvent$(FSM.scala:829) at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.processEvent(WorkflowExecutionActor.scala:54) at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:701) at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:695) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor$$anonfun$receive$1.applyOrElse(WorkflowExecutionActor.scala:507) at akka.actor.Actor.aroundReceive(Actor.scala:539) at akka.actor.Actor.aroundReceive$(Actor.scala:537) at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.akka$actor$Timers$$super$aroundReceive(WorkflowExecutionActor.scala:54) at akka.actor.Timers.aroundReceive(Timers.scala:51) at akka.actor.Timers.aroundReceive$(Timers.scala:40) at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.aroundReceive(WorkflowExecutionActor.scala:54) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:614) at akka.actor.ActorCell.invoke(ActorCell.scala:583) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268) at akka.dispatch.Mailbox.run(Mailbox.scala:229) at akka.dispatch.Mailbox.exec(Mailbox.scala:241) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

imp5Converter Output file format is not supported

Hi there,
I've recently encountered an issue while trying to download resources for GRCh37 while follow the installation steps. When running the imp5Converter command to convert BCF files. I encountered the following error message: "Output file format is not supported". Here is the command I used: "for chr in {{1..22},X}; do imp5Converter --h $pfx$chr$sfx.bcf --o $pfx$chr$sfx --r $chr; done". Could you offer any help or guidance on this matter?
Thank you in advance for your time and assistance.

question about calling CNV with tumor and normal-pair sample

Dear developer

I want to call CNV with my tumor sample(array data), and the normal-pair sample i had.

When I run the MoChA pipeline, some hg19/hg39 related files was needed.

Should I use the nomal sample related files as the input instead of that hg19/hg39 related files? Could you tell me which files I should change?

Thank you!

segfault at basic qc step

I'm using the WGS and array example posted on the README as my VCF then converted to BCF. When I run the data QC step (excluding dups, etc) using the command on either the WGS or array bcf

$HOME/bin/bcftools annotate --no-version -Ou -a dup.grch37.bed.gz -c CHROM,FROM,TO,JK -h jk.header.txt  wgs.unphased.bcf

I get the following segfault

(gdb) run annotate --no-version -Ou -a dup.grch37.bed.gz -c CHROM,FROM,TO,JK -h jk.header.txt  wgs.unphased.bcf
Starting program: /rsrch2/epi/kchang3/bin/bcftools annotate --no-version -Ou -a dup.grch37.bed.gz -c CHROM,FROM,TO,JK -h jk.header.txt  wgs.unphased.bcf
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x000000000044cb4d in setter_info_real ()
(gdb)

I have attached the bcf and header files.
example.zip

Question about call rate

Hi,

I am working with WGS data, and I am wondering how the call rate should be determined. Thanks in advance for your help!

Best regards,
Yifan

404 Error for downloading bcftools patches

$ wget -P bcftools https://raw.githubusercontent.com/freeseek/gtc2vcf/master/{Makefile.patch,main.patch,norm.patch,vcfmocha.c,sumlog.c}
--2018-07-12 11:05:38--  https://raw.githubusercontent.com/freeseek/gtc2vcf/master/Makefile.patch
Resolving raw.githubusercontent.com... 151.101.24.133
Connecting to raw.githubusercontent.com|151.101.24.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2018-07-12 11:05:38 ERROR 404: Not Found.

--2018-07-12 11:05:38--  https://raw.githubusercontent.com/freeseek/gtc2vcf/master/main.patch
Reusing existing connection to raw.githubusercontent.com:443.
HTTP request sent, awaiting response... 404 Not Found
2018-07-12 11:05:38 ERROR 404: Not Found.

--2018-07-12 11:05:38--  https://raw.githubusercontent.com/freeseek/gtc2vcf/master/norm.patch
Reusing existing connection to raw.githubusercontent.com:443.
HTTP request sent, awaiting response... 404 Not Found
2018-07-12 11:05:38 ERROR 404: Not Found.

--2018-07-12 11:05:38--  https://raw.githubusercontent.com/freeseek/gtc2vcf/master/vcfmocha.c
Reusing existing connection to raw.githubusercontent.com:443.
HTTP request sent, awaiting response... 404 Not Found
2018-07-12 11:05:38 ERROR 404: Not Found.

--2018-07-12 11:05:38--  https://raw.githubusercontent.com/freeseek/gtc2vcf/master/sumlog.c
Reusing existing connection to raw.githubusercontent.com:443.
HTTP request sent, awaiting response... 404 Not Found
2018-07-12 11:05:39 ERROR 404: Not Found.

However this seems to work:

$ wget -P bcftools https://raw.githubusercontent.com/freeseek/mocha/master/{Makefile.patch,main.patch,norm.patch,vcfmocha.c,sumlog.c}

Is the above command the correct one?

question with call cnv for array data

hi there,

I have successfully run the phase pipeline.

but it's seem that the Chromosomal alterations pipeline guileline was only described for the WGS data but not array data? Could you send me some example as reference.

Conceptual Confusion about CNV and mCA

Hi Giulio,

we know mocha can call mosaic chromosomal alterations from array data with BAF and LRR information, and pennCNV is a useful tool to call CNV from array data with BAF and LRR information, too. The frameworks of mocha and penncnv are based on HMM.

So I'm a little confused about the concept of CNV and mCA.

Can you simply help me distinguish the two and tell me the difference between penncnv and mocha in methods.

Thanks.

Some questions when plotting results

Q1: Getting Execution halted in mocha_plot.R script

Command:

mocha_plot.R   --mocha   --stats $dir/$pfx.stats.tsv   --vcf $dir/$pfx.bdev.bcf   --png MH0145622.png  --cytoband $cyto --samples 205058610011_R03C02 --regions all

Error

Attaching package: ‘reshape2’

The following objects are masked from ‘package:data.table’:

    dcast, melt

mocha_plot.R 2021-05-14 https://github.com/freeseek/mocha

Command: bcftools query --format "[%CHROM\t%POS\t%REF\t%ALT\t%SAMPLE\t%GT\t%INFO/ADJ_COEFF{0}\t%INFO/ADJ_COEFF{1}\t%INFO/ADJ_COEFF{2}\t%INFO/ADJ_COEFF{3}\t%INFO/ADJ_COEFF{4}\t%INFO/ADJ_COEFF{5}\t%INFO/ADJ_COEFF{6}\t%INFO/ADJ_COEFF{7}\t%INFO/ADJ_COEFF{8}\t%INFO/GC\t%INFO/ALLELE_A\t%INFO/ALLELE_B\t%BAF\t%LRR\t%Ldev\t%Bdev\n]" /home/ubuntu/snp_analysis/mocha//mocha_.bdev.bcf --samples 205058610011_R03C02
Error in fread(cmd, sep = "\t", header = FALSE, na.strings = ".", colClasses = list(character = c(1,  : 
  File is empty: /dev/shm/file193e37cafc20
Calls: setNames -> fread
Execution halted

Troubleshooting: Viewing contents in $dir/$pfx.bdev.bcf

  • Command:
bcftools view --header-only $dir/$pfx.bdev.bcf

Output

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>
##contig=<ID=chr5,length=181538259>
.....
##EGT=GSA-24v3-0_A1_ClusterFile.egt
##INFO=<ID=ADJ_COEFF,Number=9,Type=Float,Description="Adjust coefficients (order=AA_BAF0,AA_BAF1,AA_LRR0,AB_BAF0,AB_BAF1,AB_LRR0,BB_BAF0,BB_BAF1,BB_LRR0)">
##FORMAT=<ID=Ldev,Number=1,Type=Float,Description="LRR deviation due to chromosomal alteration">
##FORMAT=<ID=Bdev,Number=1,Type=Float,Description="BAF deviation due to chromosomal alteration">
##FORMAT=<ID=AS,Number=1,Type=Integer,Description="Allelic shift (1/-1 if the alternate allele is over/under represented)">
##bcftools_viewVersion=1.13-44-g3a918b7+htslib-1.13-23-g3eada2f
##bcftools_viewCommand=view --header-only mocha//mocha_.bdev.bcf; Date=Sun Oct  3 22:46:04 2021
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	205058610011_R07C01	205058610011_R03C02	205058610011_R05C02	205058610011_R07C02	205058610011_R06C02	205058610011_R04C01	205058610011_R10C01	205058610011_R01C02	205058610011_R08C02	205058610011_R03C01	205058610011_R01C01	205058610011_R05C01	205058610011_R04C02	205058610011_R02C02	205058610011_R11C01	205058610011_R02C01	205058610011_R06C01	205058610011_R10C02	205058610011_R12C01	205058610011_R08C01	205058610011_R09C01	205058610011_R09C02	205058610011_R12C02	205058610011_R11C02

  • Although $dir/$pfx.bdev.bcf is not empty, running below command does not output anything to the terminal.. I think this is why in the first command I get an error. Can you please have a look at this?
bcftools query -f '%CHROM\t%POS\t%REF\n' $dir/$pfx.bdev.bcf --samples 205058610011_R03C02

Q2: From where can I get the age_plot.R script?

Thanks,
Rashindrie

Format of the file with call rate information

Hi,

Im trying to use mocha on my genotype array data. After creating the minimal BCF, I tried the following command

awk -F"\t" '$2<.97 {print $1}' $crt > samples_xcl_list.txt; \
echo '##INFO=<ID=JK,Number=1,Type=Float,Description="Jukes Cantor">' | \
  bcftools annotate --no-version -Ou -a $dup -c CHROM,FROM,TO,JK -h /dev/stdin $dir/$pfx.unphased.bcf | \
  bcftools view --no-version -Ou -S ^samples_xcl_list.txt | \
  bcftools +fill-tags --no-version -Ou -t ^Y,MT,chrY,chrM -- -t ExcHet,F_MISSING | \
  bcftools view --no-version -Ou -G | \
  bcftools annotate --no-version -Ob -o $dir/$pfx.xcl.bcf \
    -i 'FILTER!="." && FILTER!="PASS" || INFO/JK<.02 || INFO/ExcHet<1e-6 || INFO/F_MISSING>1-.97' \
    -x ^INFO/JK,^INFO/ExcHet,^INFO/F_MISSING && \
  bcftools index -f $dir/$pfx.xcl.bcf;
/bin/rm samples_xcl_list.txt

This gave me the below error:

Could not parse JK at chr1:632287 .. [-]
Error: exclude called for sample that does not exist in header: "Sample ID,Call Rate". Use "--force-samples" to ignore this error.
Failed to read from standard input: unknown file type
Failed to read from standard input: unknown file type
Failed to read from standard input: unknown file type

The format of the file I currently use for $crt

Sample ID,Call Rate
06-056,0.9990433
06-061,0.9992602
06-044,0.9994463
06-011,0.9994555
06-012,0.9993386
.....

I wasn't sure how the $crt file should look like, so I created a CSV in the following format. I was wondering if there is a set specification for this file?

Thanks,
Rashindrie

Portability patches

Are you open to doing away with nested functions for portability?

Besides being a gcc-only extension, nested functions simply won't work on some platforms, even if compiling with GCC, due to stack underlying protections. There's a thorough explanation here:

https://thephd.dev/lambdas-nested-functions-block-expressions-oh-my

The code in its current form won't compile with clang and seg faults in kmin_brent() on FreeBSD when compiled with gcc10, presumably due to stack protections that vanilla Linux does not enforce.

Below are patches converting some of the nested functions to standard C. Each one I add kicks the seg fault down the road to the next nested function. This does fluff up the code a bit, but it need only be done once and it's pretty simple. Just takes some time and care. If you're open to standardizing the code this way, I'll finish up and test.

--- plugins/mocha.c.orig        2021-10-15 02:37:57 UTC
+++ plugins/mocha.c
@@ -705,6 +705,44 @@ static double baf_phase_lod(const float *baf_arr, cons
     return (double)ret * M_LOG10E;
 }
 
+typedef struct
+{
+    const float *baf;
+    const int8_t *gt_phase;
+    int n;
+    const int *imap;
+    int8_t *path;
+    float err_log_prb;
+    float baf_sd;
+}   f1_data_t;
+
+double f1(double x, void *data)
+{
+    f1_data_t  *d = data;
+
+    return -baf_phase_lod(d->baf, d->gt_phase, d->n, d->imap, d->path,
+                          d->err_log_prb, d->baf_sd, x);
+}
+
+static inline f1_data_t *f1_pack(const float *baf, const int8_t *gt_phase, int n,
+                  const int *imap, int8_t *path, float err_log_prb,
+                  float baf_sd)
+{
+    // Switch to malloc() and free() if more than one object must exist at
+    // any given time
+    static f1_data_t   d;
+
+    d.baf = baf;
+    d.gt_phase = gt_phase;
+    d.n = n;
+    d.imap = imap;
+    d.path = path;
+    d.err_log_prb = err_log_prb;
+    d.baf_sd = baf_sd;
+
+    return &d;
+}
+
 // TODO find a better title for this function
 static float compare_models(const float *baf, const int8_t *gt_phase, int n, const int *imap, float xy_log_prb,
                             float err_log_prb, float flip_log_prb, float tel_log_prb, float baf_sd, const float *bdev,
@@ -716,8 +754,10 @@ static float compare_models(const float *baf, const in
     int n_flips = 0;
     for (int i = 1; i < n; i++)
         if (path[i - 1] && path[i] && path[i - 1] != path[i]) n_flips++;
-    double f(double x, void *data) { return -baf_phase_lod(baf, gt_phase, n, imap, path, err_log_prb, baf_sd, x); }
-    double x, fx = kmin_brent(f, 0.1, 0.2, NULL, KMIN_EPS, &x);
+    double x;
+    f1_data_t *f1_data = f1_pack(baf, gt_phase, n, imap, path, err_log_prb,
+                                baf_sd);
+    double fx = kmin_brent(f1, 0.1, 0.2, f1_data, KMIN_EPS, &x);
     free(path);
     return -(float)fx + (float)n_flips * flip_log_prb * (float)M_LOG10E;
 }
@@ -801,10 +841,40 @@ static double ad_log_lkl(const int16_t *ad0_arr, const
     return (double)ret * M_LOG10E;
 }
 
+typedef struct
+{
+    const int16_t *ad0_arr;
+    const int16_t *ad1_arr;
+    int n;
+    const int *imap;
+    float ad_rho;
+}   f5_data_t;
+
+double f5(double x, void *data)
+{
+    f5_data_t *d = data;
+
+    return -ad_log_lkl(d->ad0_arr, d->ad1_arr, d->n, d->imap, d->ad_rho, x);
+}
+
+static inline f5_data_t *f5_pack(const int16_t *ad0_arr,
+    const int16_t *ad1_arr, int n, const int *imap, float ad_rho)
+{
+    static f5_data_t d;
+
+    d.ad0_arr = ad0_arr;
+    d.ad1_arr = ad1_arr;
+    d.n = n;
+    d.imap = imap;
+    d.ad_rho = ad_rho;
+
+    return &d;
+}
+
 static float get_ad_bdev(const int16_t *ad0_arr, const int16_t *ad1_arr, int n, const int *imap, float ad_rho) {
     double bdev = 0.0;
-    double f(double x, void *data) { return -ad_log_lkl(ad0_arr, ad1_arr, n, imap, ad_rho, x); }
-    kmin_brent(f, 0.1, 0.2, NULL, KMIN_EPS, &bdev);
+    f5_data_t *f5_data = f5_pack(ad0_arr, ad1_arr, n, imap, ad_rho);
+    kmin_brent(f5, 0.1, 0.2, f5_data, KMIN_EPS, &bdev);
     return (float)bdev < 1e-4 ? (float)NAN : (float)bdev;
 }
 
@@ -986,6 +1056,44 @@ static double ad_phase_lod(const int16_t *ad0_arr, con
     return (double)ret * M_LOG10E;
 }
 
+typedef struct
+{
+    const int16_t *ad0;
+    const int16_t *ad1;
+    const int8_t *gt_phase;
+    int n;
+    const int *imap;
+    int8_t *path;
+    float err_log_prb;
+    float ad_rho;
+}   f4_data_t;
+
+double f4(double x, void *data)
+{
+    f4_data_t *d = data;
+
+    return -ad_phase_lod(d->ad0, d->ad1, d->gt_phase, d->n, d->imap, d->path,
+        d->err_log_prb, d->ad_rho, x);
+}
+
+static inline f4_data_t *f4_pack(const int16_t *ad0, const int16_t *ad1,
+    const int8_t *gt_phase, int n, const int *imap, int8_t *path,
+    float err_log_prb, float ad_rho)
+{
+    static f4_data_t d;
+
+    d.ad0 = ad0;
+    d.ad1 = ad1;
+    d.gt_phase = gt_phase;
+    d.n = n;
+    d.imap = imap;
+    d.path = path;
+    d.err_log_prb = err_log_prb;
+    d.ad_rho = ad_rho;
+
+    return &d;
+}
+
 // TODO find a better title for this function
 static float compare_wgs_models(const int16_t *ad0, const int16_t *ad1, const int8_t *gt_phase, int n, const int *imap,
                                 float xy_log_prb, float err_log_prb, float flip_log_prb, float tel_log_prb,
@@ -998,8 +1106,8 @@ static float compare_wgs_models(const int16_t *ad0, co
     int n_flips = 0;
     for (int i = 1; i < n; i++)
         if (path[i - 1] && path[i] && path[i - 1] != path[i]) n_flips++;
-    double f(double x, void *data) { return -ad_phase_lod(ad0, ad1, gt_phase, n, imap, path, err_log_prb, ad_rho, x); }
-    double x, fx = kmin_brent(f, 0.1, 0.2, NULL, KMIN_EPS, &x);
+    f4_data_t *f4_data = f4_pack(ad0, ad1, gt_phase, n, imap, path, err_log_prb, ad_rho);
+    double x, fx = kmin_brent(f4, 0.1, 0.2, f4_data, KMIN_EPS, &x);
     free(path);
     return -(float)fx + (float)n_flips * flip_log_prb * (float)M_LOG10E;
 }
@@ -1485,6 +1593,51 @@ static float get_path_segs(const int8_t *path, const f
     return 0;
 }
 
+typedef struct
+{
+    float *lrr;
+    int a;
+    int16_t *ad0;
+    int16_t *ad1;
+    mocha_t mocha;
+    const model_t *model;
+    sample_t *self;
+    float *baf;
+}   f3_data_t;
+    
+double f3(double x, void *data)
+{
+    f3_data_t *d = data;
+
+    if (d->model->flags & WGS_DATA)
+        return -lrr_ad_lod(d->lrr + d->a, d->ad0 + d->a, d->ad1 + d->a,
+            d->mocha.n_sites, NULL, NAN, d->model->lrr_bias,
+            d->model->lrr_hap2dip, d->self->adjlrr_sd,
+            d->self->stats.dispersion, x);
+    else
+        return -lrr_baf_lod(d->lrr + d->a, d->baf + d->a, d->mocha.n_sites,
+            NULL, NAN, d->model->lrr_bias, d->model->lrr_hap2dip,
+            d->self->adjlrr_sd, d->self->stats.dispersion, x);
+}
+
+static inline f3_data_t *f3_pack(float *lrr, int a, int16_t *ad0, int16_t *ad1,
+    mocha_t mocha, const model_t *model, sample_t *self, float *baf)
+
+{
+    static f3_data_t d;
+
+    d.lrr = lrr;
+    d.a = a;
+    d.ad0 = ad0;
+    d.ad1 = ad1;
+    d.mocha = mocha;
+    d.model = model;
+    d.self = self;
+    d.baf = baf;
+
+    return &d;
+}
+
 // process one contig for one sample
 static void sample_run(sample_t *self, mocha_table_t *mocha_table, const model_t *model) {
     // do nothing if chromosome Y or MT are being tested
@@ -1735,16 +1888,10 @@ static void sample_run(sample_t *self, mocha_table_t *
             mocha.ldev = get_median(lrr + a, b + 1 - a, NULL);
             get_mocha_stats(pos, lrr, baf, gt_phase, n, a, b, cen_beg, cen_end, length, self->stats.baf_conc, &mocha);
 
-            double f(double x, void *data) {
-                if (model->flags & WGS_DATA)
-                    return -lrr_ad_lod(lrr + a, ad0 + a, ad1 + a, mocha.n_sites, NULL, NAN, model->lrr_bias,
-                                       model->lrr_hap2dip, self->adjlrr_sd, self->stats.dispersion, x);
-                else
-                    return -lrr_baf_lod(lrr + a, baf + a, mocha.n_sites, NULL, NAN, model->lrr_bias, model->lrr_hap2dip,
-                                        self->adjlrr_sd, self->stats.dispersion, x);
-            }
             double bdev_lrr_baf;
-            kmin_brent(f, -0.15, 0.15, NULL, KMIN_EPS, &bdev_lrr_baf);
+            f3_data_t *f3_data = f3_pack(lrr, a, ad0, ad1, mocha, model,
+                 self, baf);
+            kmin_brent(f3, -0.15, 0.15, f3_data, KMIN_EPS, &bdev_lrr_baf);
             if (model->flags & WGS_DATA)
                 mocha.lod_lrr_baf =
                     lrr_ad_lod(lrr + a, ad0 + a, ad1 + a, mocha.n_sites, NULL, model->err_log_prb, model->lrr_bias,
@@ -1923,6 +2070,32 @@ static float get_lrr_cutoff(const float *v, int n) {
     return cutoff;
 }
 
+typedef struct
+{
+    int16_t *ad0;
+    int16_t *ad1;
+    int n;
+}   f2_data_t;
+
+double f2(double x, void *data)
+{
+    f2_data_t *d = data;
+
+    return -lod_lkl_beta_binomial(d->ad0, d->ad1, d->n, NULL, x);
+}
+
+static inline f2_data_t *f2_pack(int16_t *ad0, int16_t *ad1, int n)
+
+{
+    static f2_data_t d;
+
+    d.ad0 = ad0;
+    d.ad1 = ad1;
+    d.n = n;
+
+    return &d;
+}
+
 // this function computes several contig stats and then clears the contig data from the sample
 static void sample_stats(sample_t *self, const model_t *model) {
     int n = self->n;
@@ -1995,9 +2168,8 @@ static void sample_stats(sample_t *self, const model_t
         hts_expand(stats_t, self->n_stats, self->m_stats, self->stats_arr);
 
         if (model->flags & WGS_DATA) {
-            double f(double x, void *data) { return -lod_lkl_beta_binomial(ad0, ad1, n, NULL, x); }
             double x;
-            kmin_brent(f, 0.1, 0.2, NULL, KMIN_EPS, &x); // dispersions above 0.5 are not allowed
+            kmin_brent(f2, 0.1, 0.2, f2_pack(ad0, ad1, n), KMIN_EPS, &x); // dispersions above 0.5 are not allowed
             self->stats_arr[self->n_stats - 1].dispersion = (float)x;
         } else {
             self->stats_arr[self->n_stats - 1].dispersion = get_sample_sd(baf, n, NULL);

genetic map grch38

Hi!

I just noticed that the genetic map has a numeric chromosome notation (1,2,3..X), while the reference panel files include the "chr" prefix. Does this have any effect?

Also, I have compared the genetic files provided by shapeit5_resources and the one provided by mocha, and there seem to be some differences, e.g:

grep -c ^21  genetic_map_hg38_withX.txt
42949

zcat chr21.b38.gmap.gz | wc -l
44619

Any guess of why this might be the case?

One last question, do shapeit5 and impute5 use the same genetic map?

Thank you in advance,

Irantzu

Written 0 variants for all contigs

Hi,

Really scratching my head here with my mocha input. I have the phased annotated of my unphased file and I don't get any variants written to the allelic shift *as.bcf output. It worked in the WDL but I was trying to run on an input file I made and it no longer works. Using us.gcr.io/mccarroll-mocha/r_mocha:1.15.1-20220518 image. It's as if bcf_sr_next_line isn't able to read any lines. This is example of first line of input file along with a phased line:

chr1    629119  .       T       C       .       .       GC=0.4325;ALLELE_A=0;ALLELE_B=1 GT:BAF:LRR      0/0:0.0249828:0.0337044 0/0:0.00921902:-0.0882683
chr1    629990  .       G       A       .       .       GC=0.43;ALLELE_A=1;ALLELE_B=0   GT:BAF:LRR      0|0:0.975216:0.387888   0|0:1.03519:0.455783

And this is input:

batch=Batch_b001
bcftools +mocha \
      --genome GRCh38 \
      --input-stats /storage1/fs1/bolton/Active/projects/mocha/UKBB/arrays/$batch.ukb.sample.tsv \
      --output allelic_shift_mocha_vcfs/ukbb.$batch.as100.bcf \
      --output-type b \
      mocha_in/$batch.phased.100.min.bcf 

and output

...
Using 61951 out of 61952 read variants from contig chr1
Written 0 variants for contig chr1
Using 60212 out of 60212 read variants from contig chr2
Written 0 variants for contig chr2
... (all contigs have "Written 0 variants for ..."

Incomplete LOH call(s) on Chr12p

Hi there,

Thanks for developing such a great tool!

My group is currently using MoChA to scan for mosaic CNVs in our cell cultures using data from our Illumina microarray. We have encountered some false negative calls (in which MoChA seems to miss a clear LOH signal) and I'm hoping you can clear up why MoChA seems to be truncating the calls.

We've implemented MoChA using the directions you outlined (including all the upstream steps). Here is the MoChA code we are implementing:

$BCFTOOLS/bcftools +mocha \ --genome $GENOME_NAME \ --no-version \ --output - \ --output-type b \ --variants ^${OUTPUT_PREFIX}.xcl.bcf \ --calls ${OUTPUT_PREFIX}.calls.tsv \ --stats ${OUTPUT_PREFIX}.stats.tsv \ --ucsc-bed ${OUTPUT_PREFIX}.ucsc.bed \ --mhc $MHC_REG \ --kir $KIR_REG \ --cnp $CNP \ --thread 12 \ ${OUTPUT_PREFIX}.bcf | \ tee ${OUTPUT_PREFIX}.as.bcf | \ $BCFTOOLS/bcftools index --force --output ${OUTPUT_PREFIX}.as.bcf.csi

MoChA detects an LOH signal on Chr12p, however the call seems to miss the half of the signal. There are not additional calls made on the Chr12 (no filtering applied to the callset).

Here are the BAF and LRR plots, the yellow box indicates what MoChA detected:

Chr12_LOH

We are curious if there is any way to adjust MoChA so that it captures the entire LOH event, rather than just a portion. This has happened in 2 other cases, at the same loci in separate samples. Any clarifications would be greatly appreciated.

Cheers,

Ryan

Question regarding setting up MoCha on Google Cloud

Hi Giulio,
Thanks for this well documented WDL pipeline! I am trying to set up my Google VM to run MoCha. And I followed your suggestions up to the step for starting the Cromwell server on the Google VM: "(java -Xmx3500m -Dconfig.file=PAPIv2.conf -jar cromwell-58.jar server &)". However, I got the error message as below -

Google Pipelines API configuration is not valid: Errors:
google configuration stanza does not contain an auth named '[email protected]'. Known auth names: application-default
google configuration stanza does not contain an auth named '[email protected]'. Known auth names: application-default

Do you know how I can fix this?

Thanks!
Jiayi

1.15?

bcftools 1.15 has been released. Are you planning a corresponding release for MoCha, or is 1.14 compatible with bcftools 1.15?

Problem with chr X XTR and Non_par regions

Dear Sir,

after preparing files and preprocessing my VCF file (GC content, phasing, etc...) , I'm trying to run bcftools +mocha and I encounter the following error :

WARNING: bcftools version mismatch .. bcftools at 1.14, the plugin "mocha" at 1.11
MoChA 2021-10-15 https://github.com/freeseek/mocha
Genome reference: GRCh37
Variants to exclude:NG.xcl.bcf
Regions to genotype: /mocha/grch37/cnps.bed
Genome-wide statistics: ng.stats
MHC region to exclude from analysis: 6:27486711-33448264
KIR region to exclude from analysis: 19:54574747-55504099
BAF deviations for LRR+BAF model: -2.0,-4.0,-6.0,10.0,6.0,4.0
BAF deviations for BAF+phase model: 6.0,8.0,10.0,15.0,20.0,30.0,50.0,80.0,100.0,150.0,200.0
Minimum base pair distance between consecutive sites: 400
Order of polynomial in local GC content to be used to regress LRR against GC: 2
Major transition phred-scaled likelihood: 65.00
Minor transition phred-scaled likelihood: 35.00
Autosomal telomeres phred-scaled likelihood: 20.00
Chromosome X telomeres phred-scaled likelihood: 8.00
Chromosome Y telomeres phred-scaled likelihood: 6.00
Uniform error phred-scaled likelihood: 15.00
Phase flip phred-scaled likelihood: 20.00
List of short arms: 13,14,15,21,22,chr13,chr14,chr15,chr21,chr22
Use variants in short arms: FALSE
Use variants in centromeres: FALSE
Use chromosomes without centromere rules: FALSE
Relative contribution from LRR for LRR+BAF model: 0.20
Difference between LRR for haploid and diploid: 0.69
Chromosome X XTR and nonPAR regions declared on different contigs: chrX:2699520-154930289 X_nonpar

Surely a mismatch in references preparation but I don't know where to look to fix this.

Many thanks for your help and an happy new year.

Regards.

Emmanuel

My command is
bcftools +mocha
--genome $assembly
--input-stats $tsv
--no-version
--output -
--output-type b
--variants ^$dir/$pfx.xcl.bcf
--calls $dir/$pfx.calls.tsv
--stats $dir/$pfx.stats.tsv
--ucsc-bed $dir/$pfx.ucsc.bed
--cnp $cnp
--mhc $mhc_reg
--kir $kir_reg
$dir/$pfx.bcf |
tee $dir/$pfx.as.bcf |
bcftools index --force --output $dir/$pfx.as.bcf.csi

bcftools +mocha generates an empty file.as.bcf

Dear Sir,

when runnnig the command in paragraph 'Call mosaic chromosomal alterations with MoChA'
the final as.bcf file has a header only, no record.
the variant and input bcf share the same position naming convention and have SNPs in common.
All the preceding preparation steps generate non empty files.

Part of the stderr below:
Estimated 0 sample(s) of unknown gender, 35 male(s) and 136 female(s)
Using 1094665 out of 1952213 read variants from contig 1
Written 0 variants for contig 1
Using 1199026 out of 2012443 read variants from contig 2
Written 0 variants for contig 2
Using 1008329 out of 1639568 read variants from contig 3
Written 0 variants for contig 3
Using 1020835 out of 1693554 read variants from contig 4
Written 0 variants for contig 4
Using 904851 out of 1497824 read variants from contig 5
Written 0 variants for contig 5
Using 846007 out of 1523692 read variants from contig 6
Written 0 variants for contig 6
Using 788063 out of 1416287 read variants from contig 7
Written 0 variants for contig 7
Using 783002 out of 1293342 read variants from contig 8
Written 0 variants for contig 8
Using 604281 out of 1103144 read variants from contig 9
Written 0 variants for contig 9
Using 699756 out of 1204898 read variants from contig 10
Written 0 variants for contig 10
Using 687167 out of 1157117 read variants from contig 11
Written 0 variants for contig 11
Using 654055 out of 1130795 read variants from contig 12
Written 0 variants for contig 12
Using 520295 out of 844782 read variants from contig 13
Written 0 variants for contig 13
Using 454277 out of 789672 read variants from contig 14
Written 0 variants for contig 14
Using 401875 out of 738255 read variants from contig 15
Written 0 variants for contig 15
Using 433851 out of 801022 read variants from contig 16
Written 0 variants for contig 16
Using 369430 out of 709297 read variants from contig 17
Written 0 variants for contig 17
Using 404774 out of 662388 read variants from contig 18
Written 0 variants for contig 18
Using 279673 out of 607853 read variants from contig 19
Written 0 variants for contig 19
Using 314723 out of 528443 read variants from contig 20
Written 0 variants for contig 20
Using 188329 out of 380762 read variants from contig 21
Written 0 variants for contig 21
Using 181159 out of 360997 read variants from contig 22
Written 0 variants for contig 22
Using 434476 out of 942713 read variants from contig X
Written 0 variants for contig X
Using 56979 out of 56979 read variants from contig Y
Written 0 variants for contig Y
Using 801 out of 801 read variants from contig MT
Written 0 variants for contig MT
Adjusted LRR at CNP losses: -0.5832
Adjusted LRR at CNP gains: 0.3853
Adjusted LRR at rare losses: -0.6343
Adjusted LRR at rare gains: 0.4073

Thanks for your support.

List of segmental duplications

When creating the list of segmental duplications, it looks like the resulting file has a column missing (when following your instructions).

The input file downloaded at this step looks fine (genomicSuperDups.bed).

$ head -n 3 genomicSuperDups.bed 
chr1	10000	87112	0.00713299
chr1	10000	20818	0.0186603
chr1	10000	19844	0.0173215

It looks like I'm losing the last column at the cut part of your command (in the README file).

$ awk '{print $1,$2; print $1,$3}' genomicSuperDups.bed | \
>  sort -k1,1 -k2,2n | uniq | \
>  awk 'chrom==$1 {print chrom"\t"pos"\t"$2} {chrom=$1; pos=$2}' | \
>  bedtools intersect -a genomicSuperDups.bed -b - | \
>  bedtools sort | \
>  bedtools groupby -c 4 -o min | \
>  awk 'BEGIN {i=0; s[0]="+"; s[1]="-"} {if ($4!=x) i=(i+1)%2; x=$4; print $0"\t0\t"s[i]}' | \
>  bedtools merge -s -c 4 -o distinct | \
>  cut -f1-3,5 | head -n 3
chr1	10000	10485
chr1	10485	18392
chr1	18392	87112

Is it possible that order of the fields might have changed with newer versions of bedtools (at the merge part of the command)? Removing the cut command gives me what looks like the proper content.

$ zcat dup.grch37.bed.gz | head -n 3
1	10000	10485	0.00713299
1	10485	18392	0.00579252
1	18392	87112	0.00457824

I'm using bedtools version 2.27.1 (so it shouldn't be affected by the groupby bug).

Question compiling with Homebrew GCC 10.2.0_4

Hey Giulio, have you even seen this gcc compile issue before?

plugins/mocha.c:718:36: error: function definition is not allowed here
    double f(double x, void *data) { return -baf_phase_lod(baf, gt_phase, n, imap, path, err_log_prb, baf_sd, x); }
                                   ^
plugins/mocha.c:719:31: error: use of undeclared identifier 'f'
    double x, fx = kmin_brent(f, 0.1, 0.2, NULL, KMIN_EPS, &x);

Could not parse gender (0/1/2) in the sample statistics file

Dear developer,

I ran into this issue when running the mCA detection pipeline. If the coding for gender is 0/1/2 (in the sample statistics file), I will get the following error "Could not parse gender 1 xxxx(call rate)". I've tried to convert it to "M/F/U" (but the coding for gender is still 0/1/2 in the $sex file), the analysis can be completed but the "baf_auto" in the stats file are NaN for all samples.
My question are,
(1) What are the potential causes of the NaN values? (I suspect the "NaN" might be due to the coding for the gender).
(2) Do you have any suggestions to overcome this problem?
Thank you for your help in advance!!

image

Imputation Error - terminate called after throwing an instance of std::length_error

I am running the phasing pipeline on a VCF of 5 samples and running into some errors with the imputation step for chromosomes 1-4.

The imputation output for chromosome 1 is here:

Recombination rates given by genetic map
________________________________

Input read              [123.639s]
    Imputation region           [10177-249240543] [249.23Mb, 293.40cM]
    Markers     (ref/both/targ) [3740965/506109/0]
    Samples     (ref/targ)      [2504/5]
    Ploidy      (ref/targ)      [N_rd=2504, N_rh=0] [N_td=5, N_th=0]

    Buffer region               [10177-249240543] [249.23Mb, 293.40cM]
    Markers     (ref/both/targ) [3740965/506109/0]

________________________________

PBWT select             [10.89s]
    States      (avg/max)       [4817/4836]

ESC[33mWARNING: ESC[0mPossible performance decrease: one or more haplotypes have a large number of copying states [4836 > 2000].
While this could increase imputation accuracy, consider using smaller imputation regions to increase IMPUTE5 time/memory performances.
________________________________

  * HMM imputation [0%]^Mterminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
/var/spool/slurmd/job6647381/slurm_script: line 88:  3635 Aborted                 (core dumped) impute5 --h $panel_pfx${chr}$panel_sfx.bcf --m $dir/genetic_map.chr$chr.txt --g $dir/$pfx.pgt.bcf --r $chr --o $dir/$pfx.chr$chr.imp.bcf --l $dir/$pfx.chr$chr.log --threads $thr

Firstly, it gives a warning about the fact that one of the haplotypes have a large number of copying states, which could lead to worse time/memory performances. Right now, this pipeline is burning through a lot of resources, so if you could suggest how I modify your code to tackle this warning and issue, it would be helpful.

But then, more importantly, as soon as the HMM imputation starts, it throws a std::length_error. Why might this error be taking place, and what can I do to fix it?

Thank you for your help.

Can't find file to patch at input line 3

Hi,

I've tried to run this command:
cd bcftools && patch < Makefile.patch && patch < main.patch && patch < vcfnorm.patch && cd ..

And I got this message:

can't find file to patch at input line 3
Perhaps you should have used the -p or --strip option?
The text leading up to this was:

|--- Makefile 2018-08-26 12:00:00.000000000 -0500
|+++ Makefile 2018-08-26 12:00:00.000000000 -0500

File to patch:

Thanks.

No releases

Commit messages contain phrases like new release or new version, but there are no versioned releases/tags for this repo. That makes it hard to create a reproducible deployment for reproducible science...

Unable to infer the A and B alleles while parsing the site...

I have a batch of VCF files from an array that I am trying to add ALLELE_A and ALLELE_B into to be able to run them through MoChA. I used the mochatools command shown below to do so:

bcftools +mochatools $input -- -t ALLELE_A,ALLELE_B,GC -f $reference > $output

I am getting the error: Unable to infer the A and B alleles while parsing the site: for all non 0/0 sites.
Can you please offer some advice on why this might be the case and how to fix it?

P.S. Not sure if this would have anything to do with it, but the VCFs were pre-generated by the org that provides our dataset, but we had to add in the LRR and BAF fields manually afterwards.

FreeBSD port available

FYI:

bio-mocha has been committed to the FreeBSD ports collection.
It might be helpful to users if you could post a message like
the following on your website:

Thanks!

bio-mocha can be installed on FreeBSD via the FreeBSD ports system.

To install via the binary package, simply run:

pkg install bio-mocha

This will very quickly install a prebuilt binary using only highly-portable
optimizations, much like apt, yum, etc.

FreeBSD ports can just as easily be built and installed from source,
although it will take longer (for the computer, not for you):

cd /usr/ports/biology/bio-mocha
make install

Building from source allows installing to a different prefix, compiling with
native optimizations, and in some cases, building with non-default options
such as different compilers or dependencies. For example, adding

CFLAGS+=-march=native

to /etc/make.conf will cause ports built from source to use all native
optimizations known to the compiler for the local CPU, resulting in faster
but less portable binaries.

To report issues with a FreeBSD port, please submit a PR at:

https://www.freebsd.org/support/bugreports.html

For more information, visit https://www.freebsd.org/ports/index.html.

Error [E::bgzf_read_block] Invalid BGZF header at offset 85619

Dear Giulio,

When I run this command from the Mocha page on my data:

bcftools isec --no-version -Ou --complement --exclude "N_ALT>1" --write 1 $dir/$pfx.unphased.bcf $dir/$pfx.xcl.bcf | \
  bcftools annotate --no-version -Ou --remove ID,QUAL,INFO,^FMT/GT  | \
  bcftools +scatter --no-version -Ob --output $dir --scatter $(echo {{1..22},X} | tr ' ' ',') --prefix $pfx

I get these errors:

[E::bgzf_read_block] Invalid BGZF header at offset 85619
[E::bgzf_read] Read block operation failed with error 2 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 102544
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 102544
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 119330
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 134975
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 134975
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 152196
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 152196
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 169203
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 169203
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 185200
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 185200
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 201927
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 201927
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 201927
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 218412
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes
[E::bgzf_read_block] Invalid BGZF header at offset 218412
[E::bgzf_read] Read block operation failed with error 3 after 0 of 32 bytes

We use genome build 37 for our project and the previous steps succeeded without errors.

Do you know why I get this error?

Kind regards,
Priscilla

check bpm or not

Hi, Giulio, I have a question regarding the "mocha.do_not_check_bpm": true in the MoCha configuration.
Should I change it to false if I use a more recent Illumina array, say Infinium Global Screening Array?
It will be great if you can share some insights into the logic behind this parameter.

Thanks

mocha on sequencing vcf

Hi,
I have a question about input. Mocha is described as running off of VCFs with either:
phased genotypes (GT) and B-allele frequency (BAF) and Log R Ratio intensity (LRR)
or
phased genotypes (GT) and allelic depth coverage (AD) (for sequencing WGS)

When I run on a prepped BCF file with lines like the line below (the position below is only an example and may not match the ref base) :
chr1 817001 . G A . PASS GC=0.50143;CpG=0.06392;AC_Het=1 GT:AD 0|1:18,20

I get this error
Error: input VCF missing the ALLELE_A info field. Use bcftools +mochatools -- --infer-BAF-alleles to infer ALLELE_A

when I run :

bcftools +mocha \
--rules $rule \
--no-version \
--output-type b \
--output ${dir}/${pfx}.mocha.bcf \
--threads $thr \
--variants ^${dir}/${pfx}.xcl.bcf \
--mosaic-calls ${dir}/${pfx}.mocha.tsv \
--genome-stats ${dir}/${pfx}.stats.tsv \
--ucsc-bed ${dir}/${pfx}.ucsc.bed \
--cnp ${cnp} \
 ${dir}/${pfx}.bcf

What is the cause of this message if the BCF file already includes the GT and AD?

Thanks

impute5

Hi,

I understand this might not be the best place to ask about impute5, but I am wondering if you have any clue how the error below could be resolved?

I was trying to run the imputation part of the pipeline, i.e.,

for chr in {1..22} X; do
  zcat $map | sed 's/^23/X/' | awk -v chr=$chr '$1==chr {print chr,".",$4,$2}' > $dir/genetic_map.chr$chr.txt
  impute5 \
    --h $panel_pfx${chr}$panel_sfx.bcf \
    --m $dir/genetic_map.chr$chr.txt \
    --g $dir/$pfx.pgt.bcf \
    --r chr$chr \
    --o $dir/$pfx.chr$chr.imp.bcf \
    --l $dir/$pfx.chr$chr.log \
    --threads $thr
done

And had the following error for every chromosome:

IMPUTE v5
        Simone Rubinacci and Jonathan Marchini, 2020

Version: 1.1.4
Started: 04/01/2022 - 16:00:23
________________________________

Options:
--b 250
--g 5219.pgt.bcf
--h ALL.chr6.phase3_integrated.20130502.genotypes.bcf
--l 5219.chr6.log
--m genetic_map.chr6.txt
--ne 20000
--o 5219.chr6.imp.bcf
--pbwt-cm 0.02
--pbwt-depth 4
--r chr6
--threads 4

ERROR: No in common between reference and target panels.

Yifan

Issues with the number of heterozygous sites

Hi, we have encountered some problems with counting the number of heterozygous sites at the final step of MoChA (Version 1.14-20220112).

Take chr1 as an example, when we combined all samples (~450,000) in one bcf file for analysis, only about ten heterozygous sites can be counted (as shown in the n_hets column) and used for the following analyses. However, the number of n_hets increases greatly when we split the combined bcf file and included only 10,000 samples or even 100 samples for analysis.

I cannot figure out the problem, and there is no error message. Could you please give me some suggestions for dealing with this?

Thanks.

Can I Use CEL Files from Axiom Precision Medicine Research Array (PMRA) Release 3 to Generate CHP Files and Obtain VCF Files Annotated with GC, BAF, and LRR Content?

Hello,

I am a beginner in the field of mosaic chromosome alterations and I'm trying to use the Mocha pipeline with original CEL files sequenced in Axiom Precision Medicine Research Array (PMRA) release 3 to generate CHP files. According to the pipeline documentation, for Affymetrix data, a CSV manifest file is required, and for CEL files, an XML option file and a ZIP file containing all files listed within the XML option file are needed. The documentation also mentions that only analysis types AxiomGT1 and birdseed are supported, and Affymetrix arrays before the Genome-Wide Human SNP Array 6.0 are not supported.

My question is, can I input CEL files from Axiom PMRA release 3 to generate CHP files and obtain VCF files annotated with GC content, BAF, and LRR?

If this is not possible, I have three separate files containing genotype data, GC content for variants, and BAF and LRR data for variants and individuals. Do you have any suggestions or scripts for merging these files to annotate the genotype data with GC content, BAF, and LRR to generate a VCF file for use with Mocha?

I apologize if my questions seem naive; I'm still learning about this field.

Thank you for your assistance.

The sequence "hs37d5" not found and "No BGZF EOF marker" errors

I have a single test VCF file from WGS that I am running through your data preparation pipeline. The original BAM was aligned on hs37d5 (an alternate form/version of GRCh37 from what I understand), and hence the VCF is also aligned to that version.

When running this first command with the VCF and the input files you indicated/provided for GRCh37 in the following command to "Create the minimal binary VCF":

bcftools view --no-version -h $vcf | sed 's/^\(##FORMAT=<ID=AD,Number=\)\./\1R/' | \
        bcftools reheader -h /dev/stdin $vcf | \
        bcftools filter --no-version -Ou -e "FMT/DP<10 | FMT/GQ<20" --set-GT . | \
        bcftools annotate --no-version -Ou -x ID,QUAL,^INFO/GC,^FMT/GT,^FMT/AD | \
        bcftools norm --no-version -Ou -m -any --keep-sum AD | \
        bcftools norm --no-version -Ob -f $ref | \
        tee $dir/$pfx.unphased.bcf | \
        bcftools index --force --output $dir/$pfx.unphased.bcf.csi

gives the errors:

[E::faidx_adjust_position] The sequence "hs37d5" was not found
faidx_fetch_seq failed at hs37d5:91
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
index: failed to create index for "-"

Subsequently, the following command to "generate list of variants to be excluded from modelling by both eagle and mocha":

awk -F"\t" '$2<.97 {print $1}' $crt > samples_xcl_list.txt; \
echo '##INFO=<ID=JK,Number=1,Type=Float,Description="Jukes Cantor">' | \
        bcftools annotate --no-version -Ou -a $dup -c CHROM,FROM,TO,JK -h /dev/stdin $dir/$pfx.unphased.bcf | \
        bcftools view --no-version -Ou -S ^samples_xcl_list.txt | \
        bcftools +fill-tags --no-version -Ou -t ^Y,MT,chrY,chrM -- -t ExcHet,F_MISSING | \
        bcftools view --no-version -Ou -G | \
        bcftools annotate --no-version -Ob \
                -i 'FILTER!="." && FILTER!="PASS" || INFO/JK<.02 || INFO/ExcHet<1e-6 || INFO/F_MISSING>1-.97' \
                -x ^INFO/JK,^INFO/ExcHet,^INFO/F_MISSING | \
        tee $dir/$pfx.xcl.bcf | \
        bcftools index --force --output $dir/$pfx.xcl.bcf.csi; \
/bin/rm samples_xcl_list.txt

gives the error:
[W::bcf_sr_add_reader] No BGZF EOF marker; file '/home/adarbar/projects/rrg-rauhm/adarbar/MoChA_test/MoChA_Run/MoChA_output/008ca9e31add6629e40396fdf930a2b7.MoChA.unphased.bcf' may be truncated

It seems that this also stops it from generating the index files, preventing Phasing from being able to run.

Can you shed some light on what the problem here might be, and what I can do to fix it?

Questions about how to filter callset

Hi! I have two more questions about the filtering strategies for the mCA callset,

(1) Based on README.md, the mLOX events were not required to be identified as a loss by MoChA. I just want to confirm that whether it is the same for mLOY events. p.s.: I assume inference of mLOY is independent of the event type in the output file given either mLOY is inferred based on LRR median statistics (as what has been used by MoChA to infer event types).

(2) When generating the list of samples with mCAs, one mCA phenotype to be generated is the $pfx.cll.lines file by using
cat $pfx.{{3,4,6,8,11,17,18}p_loss,{1,6,11,13,14,16,17,22}q_loss,13q_cnloh,{2,3,4,5,12,17,18,19}_gain}.lines | \ awk -F"\t" -v OFS="\t" 'NR==FNR {x[$1]++} NR>FNR && $1 in x {print $1}' - $pfx.stats.tsv > $pfx.cll.lines
Since not all events on p/q arm are summarized, I am wondering whether it just summarises the events enriched in samples with CLL (chronic lymphocytic leukaemia).

Thank you again for taking the time to answer my questions!!

Missing GC info field

Hi All,

I am trying to run mocha, but I got the following error.

Error: input VCF has no GC info field: use "--LRR-GC-order 0/-1" to disable LRR adjustment through GC correction

I can find GC info in the $pfx.unphased.bcf, but not $pfx.bcf.... (after phasing by Eagle)

Could you help me?

GC content in VCF file

Forgive me for the naïve question, but I am new to genetic analysis and struggling with one of the requirements of this pipeline.

I have WGS data in BAM formats that I am currently running through GATK's HaplotypeCaller in order to get the VCF file. However, upon inspection of this file, while I have all the other requirements, I don't seem to have the GC content.

I have tried searching it up online but wasn't able to find any information on this.
As I understand, the original, unphased VCF file that runs through the remainder of your tutorial should have this information, so do you have any recommendations?

Why HMM?

Hi, I read both papers referenced in README.md in the default branch however I could not find the justification for choosing Hidden Markov Model (HMM) for this modeling. Although you provide proof of its performance and its downsides (alongside what you did to correct for these issues), would be great to know why HMM.

Is there any reason or was it just due to previous publication (such as these:

There are a few differences between MoChA and the HMM model used in Loh et al. 2018, Thompson et al. 2019, Loh et al. 2020, and Terao et al. 2020.

)

(This is not a critic, just curiosity)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.