Git Product home page Git Product logo

genepi / imputationserver Goto Github PK

View Code? Open in Web Editor NEW
72.0 13.0 39.0 757.75 MB

Michigan Imputation Server: A new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity

Home Page: https://imputationserver.sph.umich.edu/

License: GNU Affero General Public License v3.0

Java 100.00%
imputation gwas software-as-a-service minimac4

imputationserver's Introduction

Imputationserver Logo

Publication Michigan Imputation Server codecov follow on Twitter

This repository includes the complete source code for the Michigan Imputation Server workflow based on Minimac4. The workflow itself is executed with the Cloudgene workflow system for Hadoop MapReduce.

Michigan Imputation Server consists of several parallelized pipeline steps:

  • Quality Control
  • QC Report
  • Phasing and Imputation
  • Compression and Encryption

Documentation

The documentation is available at http://imputationserver.readthedocs.io.

Citation

Please cite this paper if you use Michigan Imputation Server:

Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, Vrieze S, Chew EY, Levy S, McGue M, Schlessinger D, Stambolian D, Loh PR, Iacono WG, Swaroop A, Scott LJ, Cucca F, Kronenberg F, Boehnke M, Abecasis GR, Fuchsberger C. Next-generation genotype imputation service and methods. Nature Genetics 48, 1284–1287 (2016).

Contact

Feel free to contact us in case of any problems.

Contributors

  • Lukas Forer
  • Sebastian Schönherr
  • Sayantan Das
  • Christian Fuchsberger

Contributing

Project contributions are more than welcome! See our CONTRIBUTING.md file for details.

imputationserver's People

Contributors

agrueneberg avatar cfuchsberger avatar dependabot[bot] avatar getconor avatar jdpleiness avatar lukfor avatar seppinho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

imputationserver's Issues

hg38 reference: strand flips

Hi.

I am trying to use the webserver MIS Minimac4 server to impute samples with hg38 aligned variants. I have double-checked that there are no strand flips; however, I keep getting the error of more than 100 obvious strand flips. I have the information below. Is there another way to check the strand flips so I can complete imputation? Any help would be appreciated.

Quality Control
Uploaded data is hg38 and reference is hg19.

Lift Over

Skip allele frequency check.

Calculating QC Statistics

Statistics:
Alternative allele frequency > 0.5 sites: 0
Reference Overlap: 92.71 %
Match: 4,849,194
Allele switch: 1,205,271
Strand flip: 74
Strand flip and allele switch: 52
A/T, C/G genotypes: 850,564
Filtered sites:
Filter flag set: 0
Invalid alleles: 0
Multiallelic sites: 0
Duplicated sites: 657
NonSNP sites: 0
Monomorphic sites: 136
Allele mismatch: 42,917
SNPs call rate < 90%: 90,636

Excluded sites in total: 134,472
Remaining sites in total: 6,814,393
See snps-excluded.txt for details
Typed only sites: 546,309
See typed-only.txt for details

Warning: 1 Chunk(s) excluded: < 3 SNPs (see chunks-excluded.txt for details).
Warning: 2 Chunk(s) excluded: reference overlap < 50.0% (see chunks-excluded.txt for details).
Remaining chunk(s): 161
Error: More than 100 obvious strand flips have been detected. Please check strand. Imputation cannot be started!

Thanks,
Meghana

MIS not generating the additional EmpiricalDose

file that should be generates with the Minimac 4 command that is called in the MIS pipeline (--meta). This file is required for meta imputation downstream.

This is a known issue with some versions of the Minimac 4 binaries, and discussed in their github, so I assume this is one of these versions that is implemented in the MIS.

Execution failed. Unable to find resource '/data/apps/imputationserver/minimac4.yaml'

Getting the error below when run test example in docker. Any suggestions?

2018-03-10 23:53:25,764 [pool-3-thread-2] INFO  cloudgene.mapred.jobs.CloudgeneJob - Installation of application hapmap2 finished.
2018-03-10 23:53:25,764 [pool-3-thread-2] INFO  cloudgene.mapred.jobs.AbstractJob - Job job-20180310-235325-575: executing setups...
2018-03-10 23:53:25,765 [pool-3-thread-2] ERROR cloudgene.mapred.jobs.AbstractJob - Job job-20180310-235325-575: execution failed. Unable to find resource '/data/apps/imputationserver/minimac4.yaml'
2018-03-10 23:53:25,794 [pool-3-thread-2] INFO  cloudgene.mapred.jobs.AbstractJob - Job job-20180310-235325-575: cleanup successful.
2018-03-10 23:53:25,794 [pool-3-thread-2] INFO  cloudgene.mapred.jobs.WorkflowEngine - Input Validation for job job-20180310-235325-575 finished. Result: false

Installing reference panel on the command line

I'm working on a secure platform and it doesn't have a web browser so I'm not able to use the recommended method of using the admin panel on the web server. Is it possible to install reference panels on the command line? I looked at the options and the install option seems to only be for github repositories. What can I do?

Problem of .dose.vcf.gz files as input for bcftools for annotation purposes

Hi,

I am trying to use bcftools to annotate the chromosome:position with rsids. Normally, this code should work with .vcf.gz files but it doesn't and reports the following error " Failed to open -: not compressed with bgzip". Has this been encountered previously with Michigan imputed files? I would appreciate your advice!

bcftools annotate
--output-type v
--remove ID
--set-id +'%CHROM:%POS:%REF:%ALT'
chr21.dose.vcf.gz
| bcftools annotate
--annotations All_20170710.vcf.gz
--columns ID
--output-type v
| plink
--vcf /dev/stdin
--keep-allele-order
--double-id
--make-bed
--out chr21

Thanks

Compile Failure

On the imputation server docker,

I tried to compile this project with maven, but it failed.

I haven't edited any code in the project, but, it doesn't work.

It failed in the test course, failures are following.

`Results :
 
 Failed tests:   testPipelineWithEagle(genepi.imputationserver.steps.ImputationTest)
   testCompareInfoAndDosageSize(genepi.imputationserver.steps.ImputationTest)
testPipelineWithEaglePhasingOnly(genepi.imputationserver.steps.ImputationTest)
testPipelineWithPhasedHg19ToHg38(genepi.imputationserver.steps.ImputationTest)
testPipelineWithEagleHg19ToHg38(genepi.imputationserver.steps.ImputationTest)
testPipelineWithSFTP(genepi.imputationserver.steps.ImputationTest): expected:<true> but was:<false>
testPipelineWithPhasedAndNoPhasingSelected(genepi.imputationserver.steps.ImputationTest)
   testPipelineWithPhased(genepi.imputationserver.steps.ImputationTest)
testWriteTypedSitesOnly(genepi.imputationserver.steps.ImputationTest)
testPipelineWithPhasedHg38ToHg19(genepi.imputationserver.steps.ImputationTest)
testPipelineWithEagleHg38ToHg19(genepi.imputationserver.steps.ImputationTest)
 testPipelineWithEagleHg38ToHg38(genepi.imputationserver.steps.ImputationTest)
testPipelineWithEagleAndR2Filter(genepi.imputationserver.steps.ImputationTest)
 testPipelineWithHttpUrl(genepi.imputationserver.steps.ImputationTest)
 testPipelineWithPhasedAndEmptyPhasing(genepi.imputationserver.steps.ImputationTest)

 Tests run: 65, Failures: 15, Errors: 0, Skipped: 0`

Chunk exclusion: at least one sample has a call rate < 90.0%

I get a warning that 41 Chunk(s) excluded: at least one sample has a call rate < 90.0%
But when I check the pipeline is says "Chunk exclusion: if (#variants < 3 || overlap < 50% || sampleCallRate < 50%)". Is this just a typing error that 90.0% is supposed to be 50%?

Thanks!

typed only sitets

Hi, thanks for providing the TOPMed imputation service.

In the QC of the .vcf file I uploaded, I found that some SNPs are labeled as typed-only sites. What does that mean exactly? Does that indicate some kind of error/mal-format of my input file? Thanks!

API: Add Rsq filter

Hi,
I was wondering whether it would be possible to add the minimum Rsq filter found in the pretty WebUI option to the API? This would cut down some major downstream analysis time and storage space for us, since we are mainly interested in higher confidence imputed variants.

"This reference panel doesn't support chromosome..."

Dear fellows,

Running the docker image of genepi/imputationserver:latest which works beautifully with 1000G or HapMap.

Though, with our custom panel, created using 1000G as template, we get the output in the attached log
..............
2021-03-02 15_20_48-Window

job-20210224-153346-657_log.txt

Task 'Calculating QC Statistics' failed.
Exception:java.lang.InterruptedException: This reference panel doesn't support chromosome 22.
..............

Happy to share headers of the reference files if you think they may help get to the bottom of this.

Help is much appreciated, thank you!
Anca.

Allele mismatch, where reference genome has REF = ALT. How to handle this case?

My VCF file contains
1 1116188 rs140187110 C T . . . GT 0/0

However, this SNP gets excluded. snps-excluded.txt:

"1:1116188:C:T Allele mismatch Ref:C/C"

If I adjust my VCF to match the reference, like this:

1 1116188 rs140187110 C C . . . GT 0/0

then the job fails early, saying:

The provided VCF file is malformed at variation rs140187110: reference allele (C) and alternate allele (C) are the same. (see Help).

What is the suggested fix in this case if I want to minimize as much filtering as possible (under the assumption that imputation works best if we have as much usable information as possible)? Or should I not restrict filtering, but instead allow these rows to be removed then re-insert them later manually? Is there any edge case where this workaround could cause any conflicts in the imputed data?

Imputation quality score (rsq_hat)

Hi,

Thanks for setting up the TOPMed imputation server.

Because of the limitation in sample size (n=25K), we are imputing our large cohort in 2 batches but intend to merge the datasets post-imputation. We would like to calculate an overall imputation quality score (rsq_hat) and we have been advised to use the following code:

https://github.com/statgen/Minimac4/blob/5e6f3cc91d166cd2298c296e46c9f428e6e0f3aa/src/ImputationStatistics.cpp#L50-L63
https://github.com/statgen/r2-estimator/blob/7e2162a0e9db6e2b56d0e036cfbc43b58a977ef6/src/main.cpp#L211-L223

However, when we compared the imputation rsq provided by the TOPMed imputation server vs. the rsq that we calculated (on the same dataset using the code above), we noted an inflation (improvement) of the quality metric, but only for the rare variants . To us, it suggests that the rsq calculation method implemented in the TOPMed imputation server does something a bit different with the rare variants than what is found in the codes that you shared. Is that possible?

We would be curious to have your thoughts.

Best,
Guillaume

Michigan Imputation Server (minimac3 Vs minimac4)

Dear,

lately on Michigan Imputation Server, I am using minimac4 instead of minimac3. However I tested the imputation of one chromosome, eg 21, with both of them. for this test I used the same VCF file.

After the imputation I am a bit confused because the results are differents when I compared the two info files (attached files) globally the number of variants filtred with Rsq (0.3 or 0.8) and MAF (0.01 or 0.05) are different.
3C_chr21_HRC_minimac3_test1.info.gz
3C_chr21_HRC_minimac4_test1.info.gz

`

info.3C_chr21_hrc_minimac3_test1<-fread("3C_chr21_HRC_minimac3_test1.info.gz")
info.3C_chr21_hrc_minimac4_test1<-fread("3C_chr21_HRC_minimac4_test1.info.gz")

sum(info.3C_chr21_hrc_minimac3_test1$Rsq>=0.8)
[1] 143924
sum(info.3C_chr21_hrc_minimac4_test1$Rsq>=0.8)
[1] 145839
sum(info.3C_chr21_hrc_minimac3_test1$Rsq>=0.3)
[1] 249565
sum(info.3C_chr21_hrc_minimac4_test1$Rsq>=0.3)
[1] 250199
sum(info.3C_chr21_hrc_minimac3_test1$MAF>=0.01)
[1] 108967
sum(info.3C_chr21_hrc_minimac4_test1$MAF>=0.01)
[1] 109515
sum(info.3C_chr21_hrc_minimac3_test1$MAF>=0.05)
[1] 79217
sum(info.3C_chr21_hrc_minimac4_test1$MAF>=0.05)
[1] 79365
sum(info.3C_chr21_hrc_minimac3_test1$MAF>=0.05 & info.3C_chr21_hrc_minimac3_test1$Rsq>=0.8)
[1] 76157
sum(info.3C_chr21_hrc_minimac4_test1$MAF>=0.05 & info.3C_chr21_hrc_minimac4_test1$Rsq>=0.8)
[1] 74932
sum(info.3C_chr21_hrc_minimac3_test1$MAF>=0.01&info.3C_chr21_hrc_minimac3_test1$Rsq>=0.8)
[1] 96414
sum(info.3C_chr21_hrc_minimac4_test1$MAF>=0.01&info.3C_chr21_hrc_minimac4_test1$Rsq>=0.8)
[1] 94884
`

When I extended the imputation to all chromosomes after filter (MAF:0.05 & Rsq:0.8) it remained less (~4.7 millions SNP) with minimac4 than minimac3 (~5 millions SNP)

Is there any explanation can you provide to understand this differences?

Thank you in advance for all responses and suggestions.

Kind regards,

Takiy Berrandou

The aesEncryption flag does not work

When I start an imputation job on the command line it always tries to encrypt the files despite me explicitly setting the --aesEncryption flag to "no". I currently have to change the minimac4.yaml file and set the "temp" line to "false" in order to keep the output files after the job is done.

There is something that seems to override the aesEncryption flag because in the job.txt file it says "AES 256 Encryption: yes" despite the flag being set to no. There's also a zip related error message that I have attached in the screenshot below. I don't know if it's related though.
zip-error

Input genotypes are expect to come from array genotypes with no more than 20000 SNPs expected per chunk.

Hi,
I m trying to impute HLA in TopMed Imputed data using Michigan Imputation Server. However job fails stating "Calculating QC Statistics failed". Also, when I try to impute Michigan Server Imputed data using TopMed Imputation server, the job fails stating "Input genotypes are expect to come from array genotypes with no more than
20000 SNPs expected per chunk
". In both the cases I have specified correct genome build. Will you please let me know what might be the reason for this problem? I appreciate your help. Thanks in advance.

BK

test

I have cloned genepi/imputationserver, and found there is no genepi.base package

Missing existing variant data in the dosage file

Hi,
I imputed GWAS data on chr2 and find that the variant which was there in the unimputed file (73641201) is missing from the dosage file. I can't find any information on that variant in .info file as well. Any leads are much appreciated.

Many thanks!

Imputation of Chromosome X - Execution failed without specific error message

To whom it may concern,

I would like to impute Chromosome X data but the imputation process stopped without any specific error messages. I tried to run it again but got the same result. I also emailed the administrator twice but there's no response, so I post my question here. Any comments would be appreciated.

Details as below:

Input Validation
1 valid VCF file(s) found.

Samples: 2419
Chromosomes: 23
SNPs: 5694
Chunks: 8
Datatype: unphased
Build: hg19
Reference Panel: apps@1000g-phase-3-v5 (hg19)
Population: eas
Phasing: eagle
Mode: imputation


Quality Control
Calculating QC Statistics

Statistics:
Alternative allele frequency > 0.5 sites: 1,911
Reference Overlap: 100.00 %
Match: 5,694
Allele switch: 0
Strand flip: 0
Strand flip and allele switch: 0
A/T, C/G genotypes: 0
Filtered sites:
Filter flag set: 0
Invalid alleles: 0
Multiallelic sites: 0
Duplicated sites: 0
NonSNP sites: 0
Monomorphic sites: 0
Allele mismatch: 0
SNPs call rate < 90%: 0

Excluded sites in total: 0
Remaining sites in total: 5,694


Quality Control (Report)
Execution failed. Please contact the server administrators for help if you believe this job should have completed successfully.

Allele mismatch: "<" character is expected for the ALT allele ("Ref:T/<")

My VCF contains
1 81524136 rs72682552 T A . . . GT 0/0

However, this SNP gets excluded. snps-excluded.txt:
1:81524136:T:A Allele mismatch Ref:T/<

If I adjust my VCF to match the reference, like this:
1 81524136 rs72682552 T < . . . GT 0/0

then the SNP still gets excluded. snps-excluded.txt:

1:81524136:T:< Invalid Alleles

What is the suggested fix in this case if I want to minimize as much filtering as possible (under the assumption that imputation works best if we have as much usable information as possible)? Or should I not restrict filtering, but instead allow these rows to be removed then re-insert them later manually? Is there any edge case where this workaround could cause any conflicts in the imputed data?

Jobs do not start

I am currently trying to impute some VCF files, via the web interface, and everything works as intended, until the VCF file is uploaded.
The documentation says that “Input Validation and Quality Control are executed immediately to give you feedback about the data-format and its quality”.
However, for my jobs this is not the case, the site simply stays blank and simply looks like this:
image

I thought that maybe there are currently no resources available, but this Job is now waiting for quite a long time (over a day) with no change.
I am only writing, because a couple of months ago, I also ran a job on the Michigan imputation server and everything worked flawlessly (I am using the same vcf files now), so I thought that maybe there is a problem on the server site of things.

Any ideas what might be the problem here?
Any help is much appreciated.

A/T C/G strand problem

My genotype data have some strand problem(A/T, G/C). I found the problem was solved after imputation steps (qc+phase+imputation). Could you please tell me which step solved the problem and how to solve it? Thanks!

Not receiving a password for my imputation job

Dear developer,

I did not receive a password for my imputation job. I double checked my email Spam folder, it's not there either. Indeed, I have only received emails when a job failed, but not when a job succeed. Please advice.

Thank you!

Imputation X-chromosome

I'm trying to impute chromosome X using the private instance on Docker, but I get an error in the Pre-phasing and Imputation phase.
After the QC report is created, the job stops.
In the logs, I find the following error:
[ERROR] Minimac reference panel cloudgene/apps/[email protected]/2.0.0/m3vcfs/X.nonPAR.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz not found.
Job chr_X.nonPAR (null) failed.
Apparently, it can't find the specific reference file.

ERROR: Missing required option: population

No combination of suggested population parameters executes.

sudo docker exec -t -i impute cloudgene run [email protected] --files https://imputationserver.sph.umich.edu/static/downloads/hapmap300.chr1.recode.vcf.gz --refpanel apps@hapmap2 
Cloudgene 2.0.3
http://www.cloudgene.io
(c) 2009-2019 Lukas Forer and Sebastian Schoenherr
Built by null on null
Built by null on null

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudgene/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

Genotype Imputation (Minimac4) 1.2.4
https://imputationserver.readthedocs.io


ERROR: Missing required option: population

usage: input parameters:
    --aesEncryption <checkbox>   AES 256 encryption
                                 (default: no)
    --build <build>              Array Build
                                 hg38: GRCh38/hg38
                                 hg19: GRCh37/hg19
                                 (default: hg19)
    --conf <arg>                 Hadoop configuration folder
    --files <local_folder>       Input Files (<a
                                 href="http://www.1000genomes.org/wiki/Ana
                                 lysis/Variant%20Call%20Format/vcf-variant
                                 -call-format-version-41"
                                 target="_blank">VCF</a>)
    --force                      Force Cloudgene to reinstall application
                                 in HDFS even if it already installed.
    --mode <mode>                Mode
                                 qconly: Quality Control Only
                                 imputation: Quality Control & Imputation
                                 phasing: Quality Control & Phasing Only
                                 (default: imputation)
    --output <arg>               Output folder
    --phasing <phasing>          Phasing
                                 no_phasing: No phasing
                                 eagle: Eagle v2.4 (phased output)
                                 (default: eagle)
    --population <population>    Population
                                 bind: refpanel
                                 property: populations
                                 category: RefPanel
    --r2Filter <r2Filter>        rsq Filter
                                 0: off
                                 0.1: 0.1
                                 0.2: 0.2
                                 0.3: 0.3
                                 0.001: 0.001
                                 (default: 0)
    --refpanel <app_list>        Reference Panel (<a
                                 href="https://imputationserver.sph.umich.
                                 edu/start.html#!pages/refpanels"
                                 target="_blank">Details</a>)
    --show-log                   Stream logging messages to stdout
    --show-output                Stream output to stdout
    --user <arg>                 Hadoop username [default: cloudgene]
sudo docker exec -t -i impute cloudgene run [email protected] --files https://imputationserver.sph.umich.edu/static/downloads/hapmap300.chr1.recode.vcf.gz --refpanel apps@hapmap2  --show-log --population bind
Cloudgene 2.0.3
http://www.cloudgene.io
(c) 2009-2019 Lukas Forer and Sebastian Schoenherr
Built by null on null
Built by null on null

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/opt/cloudgene/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

Genotype Imputation (Minimac4) 1.2.4
https://imputationserver.readthedocs.io

[INFO]  Submit job job-20200107-010222...
        Input values: 
          refpanel: apps@hapmap2
          files: https://imputationserver.sph.umich.edu/static/downloads/hapmap300.chr1.recode.vcf.gz
          build: hg19
          r2Filter: 0
          phasing: eagle
          population: bind
          mode: imputation
          aesEncryption: yes
          myseparator: 
        Results:
          Quality-Control Report: /opt/cloudgene/job-20200107-010222/qcreport/qcreport
          QC Statistics: /opt/cloudgene/job-20200107-010222/statisticDir
          Imputation Results: /opt/cloudgene/job-20200107-010222/local
          Logs: /opt/cloudgene/job-20200107-010222/logfile
[LOG]   Executing Job installation....
[LOG]   Cleaning up uploaded local files...
[LOG]   Cleaning up temporary local files...
[LOG]   Cleaning up temporary hdfs files...
[LOG]   Cleaning up hdfs files...

Error: Execution failed.

Unable to change the navigation of the "Help" tab.

Hi,

I am trying to change the navigation of the "Help" to our own documentation created using readthedocs but I am unable to do so.

I have tried changing the navigation in the settings.yaml file in config but I can't see the change in action.

This is the current one

navigation:
  - id: help
    name: Help
    link: http://docs.cloudgene.io

This is the change I am trying to implement

navigation:
  - id: help
    name: Help
    link: https://qatarimputationserver.readthedocs.io/en/latest/

Can the server return rsid?

Hi!

I'm wondering if the server can return rsid instead of the ID in chr:pos:alt:ref format in the output files? My SNP IDs in the input files are rsid.

Thank you!!
Xiaotong

How to get the docker to push jobs to Hadoop

Hi there!

We managed to get the docker to run inside a local VM but when we tried to link it to our Hadoop clusters, it is not connecting. Instead, the jobs are running WITHIN the docker/VM environment.

When we tried to ping our Hadoop clusters IP add from within the docker env, we are able to do it. So, the docker can "see" the Hadoop clusters but I don't know why the jobs are not being "push" to the clusters.

It seems that the Imputation Server is ignoring the IP in the config files (/etc/haddop/conf)

What we did:

Changed the IP in mapred-site.xml

<property>
     <name>mapred.job.tracker</name>
     <value>172.32.62.64:8021</value>
</property> 

Changed the IP in core-site.xml

<property>
      <name>fs.defaultFS</name>
      <value>hdfs://172.17.0.3:8020</value>
</property>

Changed the values in hdfs-site.xml

<property>
       <name>dfs.datanode.data.dir</name>
       <value>hdfs:///user/imputation/data</value>
</property>

Is there some other config files that we need to edit so that we can fire the jobs to our Hadoop env instead of running locally within the docker container?

Could you please help?

Thanks,
Rozaimi

Jobs fail with no helpful error message

The output from the job is in the pastebin link below.

I'm testing out the docker version of the server now, this is how I start it:
docker run -d -p 8080:80 -v $(pwd):/data/ genepi/imputationserver:v1.2.4

I'm starting a job from the web interface with these files. I get the same error if I start a job manually with docker exec -ti name-of-container cloudgene run [email protected] --files /data/test-data/data/simulated-chip-1chr-imputation/ --conf /etc/hadoop/conf --population eur

The files I attached have worked before when I tested the server in March last year. I have also tried the supplied test files in the test-data directory but they produce the same error message. What could be wrong?

https://pastebin.com/Dvc1XrJW

Password mismatch

How can I run a command line job with no password for the output zip file? or to specify the password beforehand?

I ran 115 jobs on my locally installed hadoop cluster MIS (latest version).
I unzipped them using a script I wrote to retrieve the passwords from the log files, but 5 of them rejected the password. I also failed when tried to manually enter the passwords, and noticed that all 5 passwords that failed are pretty short (e.g. t or BiPx) and I suspect they were not printed correctly to the log file due to special characters that they included, and only part of the password is printed to the log file. I can't use these 5/115 output files not and will have to re-run them. I would rather avoid such issues and have them either not password protected, or set the password myself.

[Feature request] - Installing reference panel from local files

In my previous issue I had the problem of not being able to install a reference panel since it wasn't hosted online. The reference panel is sensitive and is not allowed to be hosted on an online server, so I had no way of installing it at first. To work around the problem I simply started a local python based web server and, copied the link to the reference panel and used it as input to the installation function.

This shouldn't be necessary, could you implement a function that lets the user install a reference panel from a local copy?

Are indels excluded ?

According to the documentation, the exclusion criteria are [a] invalid alleles [b] duplicates [c] indels [d] monomorphic sites, [e] allele mismatch [f] SNP call rate < 90%.

The HRC does not have indels, but the 1000G does.
Are all indels removed in the QC step even when 1000G imputation is selected?
Is it possible to skip the indel exclusion ?

Thanks.

Too long INFO filed fails quality control

Hi, I am using the server to impute the VCF file with many (20-30) INFO fields. The Quality Control step always fails without any logfiles or error messages for trouble shooting. I then used VCFtools to remove all the INFO fields and the imputation is OK now.

I am wondering why too many INFO fields in my VCF files would cause QC to fail.

Server API requests

I have been doing some work with the Michigan Imputation Server API lately and there are a couple of feature requests I would like to suggest:

  1. Passwords for output files should be able to be requested via the API or the user should be able to set the password for a job during job submission. Being forced to pull the password out of your email does not allow the user to easily automate the unlocking of the resulting dosage files.

  2. Allow the user to specify the length of time that an API token is valid for. At the moment it looks like the API token stops working after about a day, forcing the user to revoke it and create a new token. This also breaks any efforts at automating interactions with the imputation server.

Thanks!

Max number of cores per chromosome file on the docker version

I am running the docker version (1.16) on a machine with 26 cores, and set the number of cores to 24 when I first ran the container.

On a test run, I have one chromosome (22) and 7600 samples, but only up to 3 CPU cores are being utilized at any given moment.

Is there a limit per job/file? should I split my chromosomes to smaller files?
Will it behave differently when pushed to a Hadoop cluster?

Thank you!

Issue regarding the access to UCSC database

Hi,
I tried to get output files for example data using $ GenEpi -g example -p example -o ./ command. But, I got an error:
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'genome-mysql.cse.ucsc.edu' (timed out)")

I searched for "genome-mysql.cse.ucsc.edu'. But, it is not available.

So, It would be really appreciated if you can give any suggestion to overcome this problem.

Thank you in advance.

Imported new panel, imputation fails on QC

Hi! We are attempting to perform imputation using a Haplotype Reference Consortium panel (with GRCh37/hg19 data). We have installed the panel and are able to see it as an option when running a job. However, the imputation fails for some reason when generating the QC report.

Here are our logs from the job:

job.txt:

22/01/18 13:55:24 Executing Job installation....
22/01/18 13:55:24   Preparing Application 'Genotype Imputation (Minimac4)'...
22/01/18 13:55:24   Application 'Genotype Imputation (Minimac4)'is already installed.
22/01/18 13:55:24   Preparing Application 'Haplotype Reference Consortium Panel'...
22/01/18 13:55:24   Application 'Haplotype Reference Consortium Panel'is already installed.
22/01/18 13:55:24 Executing Job setups....
22/01/18 13:55:24 Planner: WDL evaluated.
22/01/18 13:55:24 Planner: DAG created.
22/01/18 13:55:24   Nodes: 3
22/01/18 13:55:24     Input Validation
22/01/18 13:55:24       Inputs: 
22/01/18 13:55:24       Outputs: 
22/01/18 13:55:24     Quality Control
22/01/18 13:55:24       Inputs: 
22/01/18 13:55:24       Outputs: mafFile chunkFileDir statisticDir 
22/01/18 13:55:24     Quality Control (Report)
22/01/18 13:55:24       Inputs: mafFile r2Filter myseparator check1 check2 
22/01/18 13:55:24       Outputs: qcreport 
22/01/18 13:55:24   Dependencies: 2
22/01/18 13:55:24     Input Validation->Quality Control
22/01/18 13:55:24     Quality Control->Quality Control (Report)
22/01/18 13:55:24 Executor: execute DAG...
22/01/18 13:55:24 ------------------------------------------------------
22/01/18 13:55:24 Input Validation
22/01/18 13:55:24 ------------------------------------------------------
22/01/18 13:55:24 Versions:
22/01/18 13:55:24   Pipeline: michigan-imputationserver-1.5.7
22/01/18 13:55:24   Imputation-Engine: minimac4-1.0.2
22/01/18 13:55:24   Phasing-Engine: eagle-2.4
22/01/18 13:55:24 Configuration file '/data/apps/imputationserver/1.5.7/job.config' not available. Use default values.
22/01/18 13:55:24   Input Validation [0 sec]
22/01/18 13:55:24 ------------------------------------------------------
22/01/18 13:55:24 Quality Control
22/01/18 13:55:24 ------------------------------------------------------
22/01/18 13:55:24 Configuration file '/data/apps/imputationserver/1.5.7/job.config' not available. Use default values.
22/01/18 13:55:24 Reference Panel Ranges: genome-wide
22/01/18 13:55:25   Quality Control [0 sec]
22/01/18 13:55:25   Exporting parameter statisticDir...
22/01/18 13:55:25   Added 3 downloads.
22/01/18 13:55:25   Added 0 custom downloads.
22/01/18 13:55:25 ------------------------------------------------------
22/01/18 13:55:25 Quality Control (Report)
22/01/18 13:55:25 ------------------------------------------------------
22/01/18 13:55:25 Running script /data/apps/imputationserver/1.5.7/qc-report.Rmd...
22/01/18 13:55:25 Working Directory: /data/apps/imputationserver/1.5.7
22/01/18 13:55:25 Output: /data/jobs/job-20220118-135524-523/qcreport/qcreport.html
22/01/18 13:55:25 Parameters:
22/01/18 13:55:25   /data/jobs/job-20220118-135524-523/temp/mafFile/mafFile
22/01/18 13:55:25 Creating RMarkdown report from /data/apps/imputationserver/1.5.7/qc-report.Rmd...
22/01/18 13:55:30   Quality Control (Report) [ERROR]
22/01/18 13:55:30   Exporting parameter qcreport...
22/01/18 13:55:30   Added 0 downloads.
22/01/18 13:55:30   Added 0 custom downloads.
22/01/18 13:55:30 Executing onFailure... 
22/01/18 13:55:30 ------------------------------------------------------
22/01/18 13:55:30 Send Notification on Failure
22/01/18 13:55:30 ------------------------------------------------------
22/01/18 13:55:30 Configuration file '/data/apps/imputationserver/1.5.7/job.config' not available. Use default values.
22/01/18 13:55:30   Send Notification on Failure [ERROR]
22/01/18 13:55:30 onFailure execution failed.
22/01/18 13:55:30 Job execution failed: Job Execution failed.
22/01/18 13:55:30 Cleaning up...
22/01/18 13:55:30 Cleaning up uploaded local files...
22/01/18 13:55:30 Cleaning up temporary local files...
22/01/18 13:55:30 Cleaning up temporary hdfs files...
22/01/18 13:55:30 Cleaning up hdfs files...
22/01/18 13:55:30 Cleanup successful.


std.out:

Export parameter 'statisticDir'...


processing file: qc-report.Rmd
Quitting from lines 5-16 (qc-report.Rmd) 
Error in read.table(input, header = FALSE, sep = "\t") : 
  no lines available in input
Calls: <Anonymous> ... withCallingHandlers -> withVisible -> eval -> eval -> read.table
Execution halted


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |..........                                                            |  14%
  ordinary text without R code


  |                                                                            
  |....................                                                  |  29%
label: unnamed-chunk-1 (with options) 
List of 1
 $ echo: logi FALSE



Export parameter 'qcreport'...
No action required. Email notification has been disabled in job.config

And here is our yaml file if that is useful:

name: Haplotype Reference Consortium Panel
description: Panel built from hrc-EGAD00001002729
category: RefPanel
version: 1.1

properties:
  id: HRC
  hdfs: ${app_hdfs_folder}/m3vcfs/HRC.r1-1.EGA.GRCh37.chr$chr.m3vcf.gz
  legend: ${app_local_folder}/legend/HRC.r1-1.EGA.GRCh37.chr$chr.legend.gz
  mapEagle: ${app_hdfs_folder}/map/genetic_map_hg19_withX.txt.gz
  refEagle: ${app_hdfs_folder}/bcfs/HRC.r1-1.EGA.GRCh37.chr$chr.bcf
  build: hg19
  samples:
    all: 27165
    mixed: -1
  populations:
    all: ALL
    mixed: Other/Mixed

installation:

  - import:
      source: ${app_local_folder}/bcfs
      target: ${app_hdfs_folder}/bcfs

  - import:
      source: ${app_local_folder}/m3vcfs
      target: ${app_hdfs_folder}/m3vcfs

  - import:
      source: ${app_local_folder}/map
      target: ${app_hdfs_folder}/map


Imputation on chromosome 14 failed. Imputation was stopped.

I am trying to impute a dataset aligned to the HRC reference panel on minimac4 using that same HRC panel, however every job I run gets to the prephasing/imputation phase and stops with the error:

Imputation on chromosome 14 failed. Imputation was stopped.

​There isn't any error files or much to try and diagnose this problem from my end - I am unsure if chr14 is the first to be imputed and this is indicative of issues with my dataset, or else they are imputed in order and there is just something wrong with chr14 etc.
If this is something you've come across and/or you know what the problem could be it would be very helpful!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.