mikkelschubert / paleomix Goto Github PK

Pipelines and tools for the processing of ancient and modern HTS data.

Home Page: https://paleomix.readthedocs.io/en/stable/

License: MIT License

Python 95.24% R 4.41% CSS 0.35%

paleomix's Introduction

The PALEOMIX pipelines

The PALEOMIX pipelines are a set of pipelines and tools designed to aid the rapid processing of High-Throughput Sequencing (HTS) data: The BAM pipeline processes de-multiplexed reads from one or more samples, through sequence processing and alignment, to generate BAM alignment files useful in downstream analyses; the Phylogenetic pipeline carries out genotyping and phylogenetic inference on BAM alignment files, either produced using the BAM pipeline or generated elsewhere; and the Zonkey pipeline carries out a suite of analyses on low coverage equine alignments, in order to detect the presence of F1-hybrids in archaeological assemblages. In addition, PALEOMIX aids in metagenomic analysis of the extracts.

The pipelines have been designed with ancient DNA (aDNA) in mind, and includes several features especially useful for the analyses of ancient samples, but can all be for the processing of modern samples, in order to ensure consistent data processing.

Installation and usage

Detailed instructions can be found in the documentation for PALEOMIX. For questions, bug reports, and/or suggestions, please use the GitHub tracker or contact Mikkel Schubert at [email protected].

Citations

The PALEOMIX pipelines have been published in Nature Protocols; if you make use of PALEOMIX in your work, then please cite

Schubert M, Ermini L, Sarkissian CD, Jónsson H, Ginolhac A, Schaefer R, Martin MD, Fernández R, Kircher M, McCue M, Willerslev E, and Orlando L. "Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX". Nat Protoc. 2014 May;9(5):1056-82. doi: 10.1038/nprot.2014.063. Epub 2014 Apr 10. PubMed PMID: 24722405.

The Zonkey pipeline has been published in Journal of Archaeological Science; if you make use of this pipeline in your work, then please cite

Schubert M, Mashkour M, Gaunitz C, Fages A, Seguin-Orlando A, Sheikhi S, Alfarhan AH, Alquraishi SA, Al-Rasheid KAS, Chuang R, Ermini L, Gamba C, Weinstock J, Vedat O, and Orlando L. "Zonkey: A simple, accurate and sensitive pipeline to genetically identify equine F1-hybrids in archaeological assemblages". Journal of Archaeological Science. 2017 Feb; 78:147-157. doi: 10.1016/j.jas.2016.12.005.

Related tools

DamMet: Probabilistic modelling of ancient methylomes using sequencing data underlying an ancient specimen.
gargammel: Simulations of ancient DNA datasets.
mapDamage: Tracking and quantifying damage patterns in ancient DNA sequences.
nf-core/eager: A fully reproducible and state-of-the-art ancient DNA analysis pipeline.

paleomix's People

Contributors

Stargazers

Watchers

Forkers

health1987 ilyakichigin gullumluvl carlesv theboocock ginolhac beeso018 umn-eggl muslih14 gerardo-martinez-j fvangef jelber2 tmancill tayyub-png jfy133 ztang040 wer-kle yancheer genostack

paleomix's Issues

Should the .rmdup.collapsed.bam and .rmdup.normal.bam be merged?

Hi Mikkel,

I have used your pipeline in my Master's thesis to trim and map my target capture reads. The pipeline has run without errors, and as a result I got .rmdup.collapsed.bam and .rmdup.normal.bam files for each individual. My next goal is to calculate coverage per gene. I have a question considering the further treatment of the mentioned files. Would you recommend me to merge the two files somehow or to treat them separately, or is this not an appropriate way to deal with PCR duplicates?
Our data is of very high quality and 150 bp PE reads. So, if you think that it’s more appropriate, we could also bypass collapsing overlapping reads and use them as normal PE reads.

Best regards,
Vitali Razumov

Error running mapDamage in example bam_pipeline

Hi,
I'm just trying to install the paleomix pipeline inside a singularity container to be able to run it on an HPC.
The example pipeline runs almost completely successful, but 2 errors occur:

[root@singularity-builder]/vagrant/Paleomix/bam_pipeline# singularity run ../paleomix.img/ bam_pipeline run 000_makefile.yaml 
Reading makefiles ...
  - Validating prefixes ...
Building BAM pipeline .
Running BAM pipeline ...
  - Checking file dependencies ...
  - Checking for required executables ...
  - Checking version requirements ...
    - Checking version of 'Rscript' ...
    - Checking version of 'AdapterRemoval' ...
    - Checking version of 'GenomeAnalysisTK' ...
    - Checking version of 'Picard tools' ...
    - Checking version of 'R module: Rcpp' ...
    - Checking version of 'R module: RcppGSL' ...
    - Checking version of 'R module: gam' ...
    - Checking version of 'R module: ggplot2' ...
    - Checking version of 'R module: inline' ...
    - Checking version of 'bwa' ...
    - Checking version of 'mapDamage' ...
    - Checking version of 'samtools' ...
  - Determining states ...
  - Ready ...


08:04:18 Running 2 tasks using ~2 of max 2 threads; 146 done of 166 tasks in 0s; press 'h' for help.
  - <mapDamage (plots): 2 files in 'ExampleProject/rCRS/Synthetic_Sample_1' -> 'ExampleProject.rCRS.mapDamage/ACGATA'>
  - <mapDamage (plots): 2 files in 'ExampleProject/rCRS/Synthetic_Sample_1' -> 'ExampleProject.rCRS.mapDamage/TGCTCA'>

<mapDamage (plots): 2 files in 'ExampleProject/rCRS/Synthetic_Sample_1' -> 'ExampleProject.rCRS.mapDamage/TGCTCA'>
  Error ('NodeError') occurred running command:
    Error(s) running Node:
    	Temporary directory: '/tmp/root/bam_pipeline/2b3eef21-e42a-4a9a-8ef6-5c294f91eca4'
    
    Parallel processes:
      Process 1:
        Command = java -server -Djava.io.tmpdir=/tmp/root/bam_pipeline -Djava.awt.headless=true \
                      -XX:+UseSerialGC -Xmx4g -jar /root/install/jar_root/picard.jar MergeSamFiles \
                      SO=coordinate COMPRESSION_LEVEL=0 OUTPUT=input.bam \
                      VALIDATION_STRINGENCY=LENIENT \
                      I=/vagrant/Paleomix/bam_pipeline/ExampleProject/rCRS/Synthetic_Sample_1/TGCTCA.rmdup.collapsed.bam \
                      I=/vagrant/Paleomix/bam_pipeline/ExampleProject/rCRS/Synthetic_Sample_1/TGCTCA.rmdup.normal.bam
        Status  = Exited with return-code 1
        STDOUT* = 'pipe_java_140259777747728.stdout'
        STDERR* = 'pipe_java_140259777747728.stderr'
        CWD     = '/tmp/root/bam_pipeline/2b3eef21-e42a-4a9a-8ef6-5c294f91eca4'
    
      Process 2:
        Command = mapDamage --no-stats --merge-reference-sequences -t \
                      'mapDamage plot for library '"'"'TGCTCA'"'"'' -i \
                      /tmp/root/bam_pipeline/2b3eef21-e42a-4a9a-8ef6-5c294f91eca4/input.bam -d \
                      /tmp/root/bam_pipeline/2b3eef21-e42a-4a9a-8ef6-5c294f91eca4 -r \
                      000_prefixes/rCRS.fasta --downsample 100000
        Status  = Exited with return-code 1
        STDOUT* = '/tmp/root/bam_pipeline/2b3eef21-e42a-4a9a-8ef6-5c294f91eca4/pipe_mapDamage.stdout'
        STDERR* = '/tmp/root/bam_pipeline/2b3eef21-e42a-4a9a-8ef6-5c294f91eca4/pipe_mapDamage.stderr'
        CWD     = '/vagrant/Paleomix/bam_pipeline'


08:04:29 Running 1 task using ~1 of max 2 threads; 14 failed, 146 done of 166 tasks in 12s; press 'h' for help.
  Log-file located at '/tmp/root/bam_pipeline/bam_pipeline.20200529_080403_00.log'
  - <mapDamage (plots): 2 files in 'ExampleProject/rCRS/Synthetic_Sample_1' -> 'ExampleProject.rCRS.mapDamage/ACGATA'>

<mapDamage (plots): 2 files in 'ExampleProject/rCRS/Synthetic_Sample_1' -> 'ExampleProject.rCRS.mapDamage/ACGATA'>
  Error ('NodeError') occurred running command:
    Error(s) running Node:
    	Temporary directory: '/tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb'
    
    Parallel processes:
      Process 1:
        Command = java -server -Djava.io.tmpdir=/tmp/root/bam_pipeline -Djava.awt.headless=true \
                      -XX:+UseSerialGC -Xmx4g -jar /root/install/jar_root/picard.jar MergeSamFiles \
                      SO=coordinate COMPRESSION_LEVEL=0 OUTPUT=input.bam \
                      VALIDATION_STRINGENCY=LENIENT \
                      I=/vagrant/Paleomix/bam_pipeline/ExampleProject/rCRS/Synthetic_Sample_1/ACGATA.rmdup.normal.bam \
                      I=/vagrant/Paleomix/bam_pipeline/ExampleProject/rCRS/Synthetic_Sample_1/ACGATA.rmdup.collapsed.bam
        Status  = Exited with return-code 1
        STDOUT* = 'pipe_java_140259776994960.stdout'
        STDERR* = 'pipe_java_140259776994960.stderr'
        CWD     = '/tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb'
    
      Process 2:
        Command = mapDamage --no-stats --merge-reference-sequences -t \
                      'mapDamage plot for library '"'"'ACGATA'"'"'' -i \
                      /tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb/input.bam -d \
                      /tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb -r \
                      000_prefixes/rCRS.fasta --downsample 100000
        Status  = Exited with return-code 1
        STDOUT* = '/tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb/pipe_mapDamage.stdout'
        STDERR* = '/tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb/pipe_mapDamage.stderr'
        CWD     = '/vagrant/Paleomix/bam_pipeline'

Done; but errors were detected ...

  Number of nodes:             166
  Number of done nodes:        146
  Number of runable nodes:     0
  Number of queued nodes:      0
  Number of outdated nodes:    0
  Number of failed nodes:      20
  Pipeline runtime:            12s

The content of the logfiles is:

 # /tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb/pipe_mapDamage.stderr

started with the command: /usr/bin/mapDamage --no-stats --merge-reference-sequences -t mapDamage plot for library 'ACGATA' -i /tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb/input.bam -d /tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb -r 000_prefixes/rCRS.fasta --downsample 100000
[E::idx_find_and_load] Could not retrieve index file for '/tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb/input.bam'
alignment file must be single-ended
alignment file must be single-ended

and

# /tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb/pipe_java_140259776994960.stderr

INFO    2020-05-29 08:04:21     MergeSamFiles
  
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MergeSamFiles -SO coordinate -COMPRESSION_LEVEL 0 -OUTPUT input.bam -VALIDATION_STRINGENCY LENIENT -I /vagrant/Paleomix/bam_pipeline/ExampleProject/rCRS/Synthetic_Sample_1/ACGATA.rmdup.normal.bam -I /vagrant/Paleomix/bam_pipeline/ExampleProject/rCRS/Synthetic_Sample_1/ACGATA.rmdup.collapsed.bam
**********


08:04:23.061 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/root/install/jar_root/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri May 29 08:04:23 GMT 2020] MergeSamFiles INPUT=[/vagrant/Paleomix/bam_pipeline/ExampleProject/rCRS/Synthetic_Sample_1/ACGATA.rmdup.normal.bam, /vagrant/Paleomix/bam_pipeline/ExampleProject/rCRS/Synthetic_Sample_1/ACGATA.rmdup.collapsed.bam] OUTPUT=input.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=LENIENT COMPRESSION_LEVEL=0    ASSUME_SORTED=false MERGE_SEQUENCE_DICTIONARIES=false USE_THREADING=false VERBOSITY=INFO QUIET=false MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Fri May 29 08:04:23 GMT 2020] Executing as root@singularity-builder on Linux 3.10.0-1062.9.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1ubuntu1-b09; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.22.4
INFO    2020-05-29 08:04:23     MergeSamFiles   Input files are in same order as output so sorting to temp directory is not needed.
[Fri May 29 08:04:29 GMT 2020] picard.sam.MergeSamFiles done. Elapsed time: 0.11 minutes.
Runtime.totalMemory()=12455936
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: Write error; BinaryCodec in writemode; streamed file (filename not available)
        at htsjdk.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:222)
        at htsjdk.samtools.util.BinaryCodec.writeByteBuffer(BinaryCodec.java:188)
        at htsjdk.samtools.util.BinaryCodec.writeShort(BinaryCodec.java:266)
        at htsjdk.samtools.util.BlockCompressedOutputStream.writeGzipBlock(BlockCompressedOutputStream.java:445)
        at htsjdk.samtools.util.BlockCompressedOutputStream.deflateBlock(BlockCompressedOutputStream.java:415)
        at htsjdk.samtools.util.BlockCompressedOutputStream.write(BlockCompressedOutputStream.java:305)
        at htsjdk.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:220)
        at htsjdk.samtools.util.BinaryCodec.writeByteBuffer(BinaryCodec.java:188)
        at htsjdk.samtools.util.BinaryCodec.writeInt(BinaryCodec.java:234)
        at htsjdk.samtools.BAMRecordCodec.encode(BAMRecordCodec.java:160)
        at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:144)
        at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:185)
        at picard.sam.MergeSamFiles.doWork(MergeSamFiles.java:224)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
Caused by: java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
        at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
        at java.nio.channels.Channels.writeFully(Channels.java:101)
        at java.nio.channels.Channels.access$000(Channels.java:61)
        at java.nio.channels.Channels$1.write(Channels.java:174)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at htsjdk.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:220)
        ... 15 more

# Runtime_log.txt

2020-05-29 08:04:29,032 INFO    main: Started with the command: /usr/bin/mapDamage --no-stats --merge-reference-sequences -t mapDamage plot for library 'ACGATA' -i /tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb/input.bam -d /tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb -r 000_prefixes/rCRS.fasta --downsample 100000
2020-05-29 08:04:29,287 ERROR   main: alignment file must be single-ended

# pipe.errors

Command          = '/usr/local/bin/paleomix bam_pipeline run 000_makefile.yaml'
CWD              = '/vagrant/Paleomix/bam_pipeline'
PATH             = '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin'
Node             = <mapDamage (plots): 2 files in 'ExampleProject/rCRS/Synthetic_Sample_1' -> 'ExampleProject.rCRS.mapDamage/ACGATA'>
Threads          = 1
Input files      = 000_prefixes/rCRS.fasta
                   ExampleProject/rCRS/Synthetic_Sample_1/ACGATA.rmdup.collapsed.bam
                   ExampleProject/rCRS/Synthetic_Sample_1/ACGATA.rmdup.normal.bam
Output files     = ExampleProject.rCRS.mapDamage/ACGATA/3pGtoA_freq.txt
                   ExampleProject.rCRS.mapDamage/ACGATA/5pCtoT_freq.txt
                   ExampleProject.rCRS.mapDamage/ACGATA/Fragmisincorporation_plot.pdf
                   ExampleProject.rCRS.mapDamage/ACGATA/Length_plot.pdf
                   ExampleProject.rCRS.mapDamage/ACGATA/Runtime_log.txt
                   ExampleProject.rCRS.mapDamage/ACGATA/dnacomp.txt
                   ExampleProject.rCRS.mapDamage/ACGATA/lgdistribution.txt
                   ExampleProject.rCRS.mapDamage/ACGATA/misincorporation.txt
Auxiliary files  = /root/install/jar_root/picard.jar
Executables      = java
                   mapDamage

Errors =
Parallel processes:
  Process 1:
    Command = java -server -Djava.io.tmpdir=/tmp/root/bam_pipeline -Djava.awt.headless=true \
                  -XX:+UseSerialGC -Xmx4g -jar /root/install/jar_root/picard.jar MergeSamFiles \
                  SO=coordinate COMPRESSION_LEVEL=0 OUTPUT=input.bam \
                  VALIDATION_STRINGENCY=LENIENT \
                  I=/vagrant/Paleomix/bam_pipeline/ExampleProject/rCRS/Synthetic_Sample_1/ACGATA.rmdup.normal.bam \
                  I=/vagrant/Paleomix/bam_pipeline/ExampleProject/rCRS/Synthetic_Sample_1/ACGATA.rmdup.collapsed.bam
    Status  = Exited with return-code 1
    STDOUT* = 'pipe_java_140259776994960.stdout'
    STDERR* = 'pipe_java_140259776994960.stderr'
    CWD     = '/tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb'

  Process 2:
    Command = mapDamage --no-stats --merge-reference-sequences -t \
                  'mapDamage plot for library '"'"'ACGATA'"'"'' -i \
                  /tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb/input.bam -d \
                  /tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb -r \
                  000_prefixes/rCRS.fasta --downsample 100000
    Status  = Exited with return-code 1
    STDOUT* = '/tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb/pipe_mapDamage.stdout'
    STDERR* = '/tmp/root/bam_pipeline/5a5a5535-df57-4bc3-92d6-8fa2bd6bb0eb/pipe_mapDamage.stderr'
    CWD     = '/vagrant/Paleomix/bam_pipeline'

I'm using following version:

[root@singularity-builder]/vagrant/Paleomix/bam_pipeline# singularity run ../paleomix.img/                                   
PALEOMIX - pipelines and tools for NGS data analyses.
Version: 1.2.14

And I read in your release notes that a similar thing was fixed in a minor version (1.2.6):

mapDamage plots should not require indexed BAMs; this fixed missing file
errors for some makefile configurations.

What could be wrong?
Thanks in advance,
Best,
Nadine

Phylo pipeline example not working

Hello,

I have been experimenting with PALEOMIX and have had success with BAM pipeline.

I decided to run phylogenetic pipeline but at the step
$phylo_pipeline genotype+msa+phylogeny 000_makefile.yaml

I get an error message:

Reading makefile(s):
Error reading makefiles:
MakefileError:
Makefile requirement not met at 'root:chimpanzee':
Expected value(s): key in 'Project', 'PhylogeneticInference', 'Genotyping', 'PAML', or 'MultipleSequenceAlignment'
Observed value(s): 'chimpanzee'
Observed type: str

Has something changed since the example was made or what might be causing the issue?

readthedocs Documentation incorrectly renders double dash with a 'long' single dash

https://paleomix.readthedocs.io/en/latest/bam_pipeline/configuration.html

First instance ( found):

The BAM pipeline exposes a number options, including the maximum number of threads used, and the maximum number of threads used for individual programs, the location of JAR files, and more. These may be set using the corresponding command-line options (e.g. –max-threads). However, it is also possible to set default values for such options, including on a per-host bases. This is accomplished by executing the following command, in order to generate a configuration file at ~/.paleomix/bam_pipeline.ini:

However, source (https://github.com/MikkelSchubert/paleomix/blob/master/docs/bam_pipeline/configuration.rst) correctly shows double dash.

This could be confusing for users who are not familiar with looking for sources of docs.

Errors running node

When running the bam pipeline, I quickly hit a confusing error (see below). I have tried running this script on every available partition, with variable memory and CPU allocations, but the error remains the same. How do I fix this?

15:34:04 INFO [1/270] Started trimming PE adapters from '/global/scratch/users/l
aurenhamm/thesis/aim1/Alignment_v3/fastq_original/1_5_S12_ID_[12].fastq'
15:34:04 ERROR NodeError while trimming PE adapters from '/global/scratch/users/laurenhamm/thesis/aim1/Alignment_v3/fastq_original/1_5_S12_ID_[12].fastq':
15:34:04 INFO Saving error logs to '/global/scratch/users/laurenhamm/thesis/aim1/Alignment_v3/paleomix/bamOuts/bam_pipeline.20230717_153402_01.log'
15:34:04 ERROR Error(s) running Node:
15:34:04 ERROR Temporary directory: '/global/scratch/users/laurenhamm/thesis/aim1/Alignment_v3/fastq_original/5d71e6f0-32c6-41cc-b1f3-bbd9efca5eca'
15:34:04 ERROR
15:34:04 ERROR Command = AdapterRemoval --gzip --trimns --trimqualities --threads 12 --basename
15:34:04 ERROR /global/scratch/users/laurenhamm/thesis/aim1/Alignment_v3/fastq_original/5d71e6f0-32c6-41cc-b1f3-bbd9efca5eca/reads
15:34:04 ERROR --collapse --file1
15:34:04 ERROR /global/scratch/users/laurenhamm/thesis/aim1/Alignment_v3/fastq_original/1_5_S12_ID_1.fastq
15:34:04 ERROR --file2
15:34:04 ERROR /global/scratch/users/laurenhamm/thesis/aim1/Alignment_v3/fastq_original/1_5_S12_ID_2.fastq
15:34:04 ERROR --mm 3 --minlength 25 --qualitybase 33 --qualitybase-output 33
15:34:04 ERROR Status = Exited with return-code 1
15:34:04 ERROR STDOUT* = '/global/scratch/users/laurenhamm/thesis/aim1/Alignment_v3/fastq_original/5d71e6f0-32c6-41cc-b1f3-bbd9efca5eca/pipe_AdapterRemoval_47368337539856.stdout'
15:34:04 ERROR STDERR* = '/global/scratch/users/laurenhamm/thesis/aim1/Alignment_v3/fastq_original/5d71e6f0-32c6-41cc-b1f3-bbd9efca5eca/pipe_AdapterRemoval_47368337539856.stderr'
15:34:04 ERROR CWD = '/global/scratch/users/laurenhamm/thesis/aim1/Alignment_v3/paleomix/bamOuts'

write-config

Hello,
I am currently working through your pipeline for future projects and am running into a problem when I try to configure the bam_pipeline. "--write-config" does not seem to work in configuring the pipeline, and I've tried many other possible variations. Every time I input "paleomix bam_pipeline --write-config", the output is just "BAM Pipeline v1.2.13.2" followed by "Usage", with the only possible commands following bam_pipeling being "help, example, makefile, dryrun, run and remap". How should I work around this?

Thank you so much,

Sarah

'empty header' error

Having problems with pysam I think... Any ideas about what's going on here?

Traceback (most recent call last):
  File "/panfs/roc/groups/3/mccuem/beesons/WGS/paleomix/bin/paleomix", line 250, in <module>
    sys.exit(main(sys.argv[1:]))
  File "/panfs/roc/groups/3/mccuem/beesons/WGS/paleomix/bin/paleomix", line 246, in main
    return module.main(argv)
  File "/home/mccuem/beesons/.local/lib/python2.7/site-packages/pypeline/tools/cleanup.py", line 313, in main
    return _pipe_to_bam()
  File "/home/mccuem/beesons/.local/lib/python2.7/site-packages/pypeline/tools/cleanup.py", line 85, in _pipe_to_bam
    with pysam.Samfile("-", "r") as input_handle:
  File "pysam/calignmentfile.pyx", line 311, in pysam.calignmentfile.AlignmentFile.__cinit__ (pysam/calignmentfile.c:4929)
  File "pysam/calignmentfile.pyx", line 510, in pysam.calignmentfile.AlignmentFile._open (pysam/calignmentfile.c:7138)
ValueError: file header is empty (mode='r') - is it SAM/BAM format?

thank you, but this message appears with PALEOMIX v.1.2.14 too

Hi,

You appear to be using a version of PALEOMIX that is very, very old:
The particular issue you are running into was fixed four years ago, and the version check was then completely removed two years ago because of related issues. So upgrading to a newer version of PALEOMIX (e.g. v1.2.14) should fix it. If it doesn't, then feel free to open another issue.

Best,
Mikkel

Originally posted by @MikkelSchubert in #33 (comment)

I repost the original message.

Hi, I run BAM pipeline but it returns:

Building BAM pipeline ...

Validating prefixes ...
Running BAM pipeline ...

Checking file dependencies ...

Checking for required executables ...

Checking version requirements ...

Checking version of 'JAVA Runtime Environment' ...
Program may be broken or a version not supported by the
pipeline; please refer to the PALEOMIX documentation.
Required: at least v1.6.x
Search string: 'java version "(\d+)._'
---------------------- Command output ----------------------
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~16.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

I guess this version is updated: how can I figure out this issue, without changing the openjdk version?

Thanks

ImportError: libhts.so.2: cannot open shared object file: No such file or directory

I just tried to get paleomix using conda as instructed in the documentation but this is what I got when running a script independently:

python rmdup_collapsed.py 
Traceback (most recent call last):
 File "rmdup_collapsed.py", line 27, in <module>
  import pysam
 File "/home/btx638/software/miniconda3/envs/paleomix/lib/python3.6/site-packages/pysam/__init__.py", line 5, in <module>
  from pysam.libchtslib import *
ImportError: libhts.so.2: cannot open shared object file: No such file or directory

I also cannot run the example pipeline

paleomix bam example .
Traceback (most recent call last):
  File "/home/btx638/software/miniconda3/envs/paleomix/bin/paleomix", line 5, in <module>
    from paleomix.main import entry_point
  File "/home/btx638/software/miniconda3/envs/paleomix/lib/python3.6/site-packages/paleomix/main.py", line 25, in <module>
    import pysam
  File "/home/btx638/software/miniconda3/envs/paleomix/lib/python3.6/site-packages/pysam/__init__.py", line 5, in <module>
    from pysam.libchtslib import *
ImportError: libhts.so.2: cannot open shared object file: No such file or directory

I have installed htslib manually and include the path at my .bashrc and it still cannot find that libhts.so.2. Could you point a possible solution? Nothing about this on the troubleshooting page. Thanks in advance

Issue When Employing `RegionsOfInterest`.

Dear @MikkelSchubert,

I am trying to run PaleoMix v2.0.0a0 but I am getting an error when trying to enter a BED file:

paleomix bam dryrun --max-threads 8 --adapterremoval-max-threads 8 --bwa-max-threads 8 --log-level debug --log-file UkData_PaleoMix.log UKData_PaleoMix.yaml
INFO Writing debug log to 'UkData_PaleoMix.log'
INFO Reading makefile 'UKData_PaleoMix.yaml'
WARNING option has been deprecated and will be removed in the future: Options :: Features :: Coverage
WARNING option has been deprecated and will be removed in the future: Options :: Features :: Depths
WARNING option has been deprecated and will be removed in the future: Options :: Features :: Summary
ERROR Error reading makefiles: Makefile requirement not met at 'Genomes :: RockDove_DoveTail_ReRun :: RegionsOfInterest':
ERROR   Expected value: key in 'Path'
ERROR   Observed value: 'RegionsOfInterest'
ERROR   Observed type:  str

The dryrun runs normally when I comment the two RegionsOfInterest lines:

#    RegionsOfInterest:
#      Loci: /projects/mjolnir1/people/dqm353/Pigeons/Reference/FPGP_FinalRun.EcoT22I_Extended.bed

paleomix bam dryrun --max-threads 8 --adapterremoval-max-threads 8 --bwa-max-threads 8 --log-level debug --log-file UkData_PaleoMix.log UKData_PaleoMix.yaml
INFO Writing debug log to 'UkData_PaleoMix.log'
INFO Reading makefile 'UKData_PaleoMix.yaml'
WARNING option has been deprecated and will be removed in the future: Options :: Features :: Coverage
WARNING option has been deprecated and will be removed in the future: Options :: Features :: Depths
WARNING option has been deprecated and will be removed in the future: Options :: Features :: Summary
INFO Validating FASTA files
INFO Building BAM pipeline for 'UKData_PaleoMix.yaml'
INFO Checking file dependencies
INFO Checking for auxiliary files
INFO Determining state of pipeline tasks
INFO Checking required software on localhost
INFO   [✓] AdapterRemoval v2.3.3
INFO   [✓] BWA v0.7.17
INFO   [✓] Python v3.9.12
INFO   [✓] ln
INFO   [✓] samtools v1.15.1
INFO Number of tasks:             30
INFO Number of done tasks:        3
INFO Number of runable tasks:     3
INFO Number of queued tasks:      24
INFO Number of outdated tasks:    0
INFO Number of failed tasks:      0
INFO Dry run completed successfully
DEBUG Shutting down local worker
INFO Log-file written to '/maps/projects/mjolnir1/people/dqm353/Pigeons/FPG/UKData/UkData_PaleoMix.log'

I am attaching here my .YAML and my .BED files.

PaleoMixFiles.zip

Any ideas of what could be happening? I reckon I am overlooking something but I cannot really tell what that would be.

Many thanks in advance, George.

Trimming first 6-8 bp of read 1

Hi Mikkel, I hope you are doing well!

I'm processing some new data and I wondered if there a way to remove the first N bp of read 1. In this experiment the adapters have an index directly next to the insert, so that the first 6 or 8 bp of read 1 are library specific. Ideally I'd like to check that index matches the expect sequence as an extra precaution on patterned flowcells, but simply trimming the first N bp could work for now. I see AdapterRemoval's demultiplexing option gets close to the idea, but it seems it would not work with the current implementation in Paleomix. If it's not possible or worth including, I could use the pre-trimming option.

Thanks! Nathan
P.S. I saw zonkey is listed as published in 2007...

About the MinQuality setting

Hi Mikkel,

I am doing a set of stringency trails, to calculate the number of hits under different "MinQuality" settings using paleomix pipeline.
The makefile document and paleomix scripts were working properly when the "MinQuality" setting was below 60.
However, once the "MinQuality" was beyond 60, an Error would exist when writing the "summary" file.
The error sentance below:

#14:29:06 INFO [3/6] Started writing summary to ./Sample_1.summary
#14:29:06 ERROR NodeUnhandledException while writing summary to ./Sample_1.summary:
#14:29:06 INFO Saving error logs to '/mnt/shared/scratch/ztang/project/324_Fix/quality_test/MinQ_60-100/MinQ_70/bam_pipeline.20210330_110345_01.log'
#14:29:06 ERROR Error(s) running Node:

This issue occurs in all trails with "MinQuality" 60+, and the size of output bam file was also abnormal.
Could you please give an idea of what was the potential reason for this error? Thanks!

checkpointing

Hi Mikkel,

Thanks for a great pipeline! I've conda installed paleomix on a cluster and running the bam pipeline. Sometimes when my run is interrupted, it starts from where it ended, but sometimes it re-run commands that finished and I'm not sure why.

Could you point me to information on how checkpointing on paleomix bam pipeline works?

Thank you!

"Could not retrieve index file"

Hi there, paleomix team! Thanks as always for such a great pipeline 👍
I'm running into an odd error that I've never seen before, and I'm not sure if I need to worry about. I've successfully run paleomix dozens of times on my old lab server, but I'm running it on a new server now and am getting the non-fatal error:

[E::idx_find_and_load] Could not retrieve index file for '/net/harris/vol1/home/beichman/bears/paleomix/paleomix/004_UARC_IT_APP2/004_UARC_IT_APP2/brown_bear/004_UARC_IT_APP2/Lib_S2.rmdup.collapsed.bam [E::idx_find_and_load] Could not retrieve index file for '/net/harris/vol1/home/beichman/bears/paleomix/paleomix/004_UARC_IT_APP2/004_UARC_IT_APP2/brown_bear/004_UARC_IT_APP2/Lib_S2.rmdup.normal.bam

The pipeline does not fail that node, and appears to complete successfully (final bam files are output), however I wanted to check with you that nothing insidious was going on due to this lack of bam index at the this stage. Looking back at past runs of paleomix I believe .bai files were output at that stage when I ran it previously.

Any ideas of what might be going on and if I need to worry about it? Thanks so much for your advice! :)

Software versions:
paleomix: 1.2.14
samtools 1.9
bwa 0.7.17
R 3.6.1
picard 2.21.7

Makefile pasted below:


# -*- mode: Yaml; -*-
# Timestamp: 2018-07-02T10:11:43.849578
#
# Default options.
# Can also be specific for a set of samples, libraries, and lanes,
# by including the "Options" hierarchy at the same level as those
# samples, libraries, or lanes below. This does not include
# "Features", which may only be specific globally.
Options:
  # Sequencing platform, see SAM/BAM reference for valid values
  Platform: Illumina
  # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)
  # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to
  # specify 'Solexa', to handle reads on the Solexa scale. This is
  # used during adapter-trimming and sequence alignment
  QualityOffset: 33
  # Split a lane into multiple entries, one for each (pair of) file(s)
  # found using the search-string specified for a given lane. Each
  # lane is named by adding a number to the end of the given barcode.
  SplitLanesByFilenames: yes
  # Compression format for FASTQ reads; 'gz' for GZip, 'bz2' for BZip2
  CompressionFormat: bz2

  # Settings for trimming of reads, see AdapterRemoval man-page
  AdapterRemoval:
     # Adapter sequences, set and uncomment to override defaults
#     --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
#     --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
     # Some BAM pipeline defaults differ from AR defaults;
     # To override, change these value(s):
     --mm: 3
     --minlength: 25
     # Extra features enabled by default; change 'yes' to 'no' to disable
     --collapse: yes
     --trimns: yes
     --trimqualities: yes

  # Settings for aligners supported by the pipeline
  Aligners:
    # Choice of aligner software to use, either "BWA" or "Bowtie2"
    Program: BWA

    # Settings for mappings performed using BWA
    BWA:
      # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
      # for a description of each algorithm (defaults to 'backtrack')
      Algorithm: mem
      # Filter aligned reads with a mapping quality (Phred) below this value
      # 20180702: AB changed MinQuality from 0 --> 30 (recommended by paleomix)
      MinQuality: 30
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # May be disabled ("no") for aDNA alignments with the 'aln' algorithm.
      # Post-mortem damage localizes to the seed region, which BWA expects to
      # have few errors (sets "-l"). See http://pmid.us/22574660
      # 20180702: AB changed UseSeed to 'no' for aDNA
      UseSeed: yes
      # Additional command-line options may be specified for the "aln"
      # call(s), as described below for Bowtie2 below.

    # Settings for mappings performed using Bowtie2
    Bowtie2:
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # Examples of how to add additional command-line options
#      --trim5: 5
#      --trim3: 5
      # Note that the colon is required, even if no value is specified
      --very-sensitive:
      # Example of how to specify multiple values for an option
#      --rg:
#        - CN:SequencingCenterNameHere
#        - DS:DescriptionOfReadGroup

  # Mark / filter PCR duplicates. If set to 'filter', PCR duplicates are
  # removed from the output files; if set to 'mark', PCR duplicates are
  # flagged with bit 0x400, and not removed from the output files; if set to
  # 'no', the reads are assumed to not have been amplified. Collapsed reads
  # are filtered using the command 'paleomix rmdup_duplicates', while "normal"
  # reads are filtered using Picard MarkDuplicates.
  PCRDuplicates: filter

  # Command-line options for mapDamage; note that the long-form
  # options are expected; --length, not -l, etc. Uncomment the
  # "mapDamage" line adding command-line options below.
  mapDamage:
    # By default, the pipeline will downsample the input to 100k hits
    # when running mapDamage; remove to use all hits
    --downsample: 100000

  # Set to 'yes' exclude a type of trimmed reads from alignment / analysis;
  # possible read-types reflect the output of AdapterRemoval
  ExcludeReads:
    # Exclude single-end reads (yes / no)?
    Single: no
    # Exclude non-collapsed paired-end reads (yes / no)?
    Paired: no
    # Exclude paired-end reads for which the mate was discarded (yes / no)?
    Singleton: no
    # Exclude overlapping paired-ended reads collapsed into a single sequence
    # by AdapterRemoval (yes / no)?
    Collapsed: no
    # Like 'Collapsed', but only for collapsed reads truncated due to the
    # presence of ambiguous or low quality bases at read termini (yes / no).
    CollapsedTruncated: no

  # Optional steps to perform during processing.
  Features:
    # Generate BAM without realignment around indels (yes / no)
    RawBAM: yes
    # Generate indel-realigned BAM using the GATK Indel realigner (yes / no)
    RealignedBAM: no
    # To disable mapDamage, write 'no'; to generate basic mapDamage plots,
    # write 'plot'; to build post-mortem damage models, write 'model',
    # and to produce rescaled BAMs, write 'rescale'. The 'model' option
    # includes the 'plot' output, and the 'rescale' option includes both
    # 'plot' and 'model' results. All analyses are carried out per library.
    mapDamage: no
    # Generate coverage information for the raw BAM (wo/ indel realignment).
    # If one or more 'RegionsOfInterest' have been specified for a prefix,
    # additional coverage files are generated for each alignment (yes / no)
    Coverage: yes
    # Generate histogram of number of sites with a given read-depth, from 0
    # to 200. If one or more 'RegionsOfInterest' have been specified for a
    # prefix, additional histograms are generated for each alignment (yes / no)
    Depths: yes
    # Generate summary table for each target (yes / no)
    Summary: yes
    # Generate histogram of PCR duplicates, for use with PreSeq (yes / no)
    DuplicateHist: no


# Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
  # Replace 'NAME_OF_PREFIX' with name of the prefix; this name
  # is used in summary statistics and as part of output filenames.
  # southern sea otter:
  polar_bear:
    # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the
    # references against which reads are to be mapped. Using the same name
    # as filename is strongly recommended (e.g. /path/to/Human_g1k_v37.fasta
    # should be named 'Human_g1k_v37').
    Path: /net/harris/vol1/home/beichman/reference_genomes/polar_bear/polar_bear.fasta
  # northern sea otter
  brown_bear:
    # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the
    # references against which reads are to be mapped. Using the same name
    # as filename is strongly recommended (e.g. /path/to/Human_g1k_v37.fasta
    # should be named 'Human_g1k_v37').
    Path: /net/harris/vol1/home/beichman/reference_genomes/brown_bear/brown_bear.fasta

    # (Optional) Uncomment and replace 'PATH_TO_BEDFILE' with the path to a
    # .bed file listing extra regions for which coverage / depth statistics
    # should be calculated; if no names are specified for the BED records,
    # results are named after the chromosome / contig. Change 'NAME' to the
    # name to be used in summary statistics and output filenames.
#    RegionsOfInterest:
#      NAME: PATH_TO_BEDFILE



# Mapping targets are specified using the following structure. Uncomment and
 #replace 'NAME_OF_TARGET' with the desired prefix for filenames.
004_UARC_IT_APP2:
  #Uncomment and replace 'NAME_OF_SAMPLE' with the name of this sample.
  004_UARC_IT_APP2:
    #Uncomment and replace 'NAME_OF_LIBRARY' with the name of this sample.
    Lib_S14:
      #Uncomment and replace 'NAME_OF_LANE' with the name of this lane,
      #and replace 'PATH_WITH_WILDCARDS' with the path to the FASTQ files
      #to be trimmed and mapped for this lane (may include wildcards).
      SRR5878348: /net/harris/vol1/home/beichman/bears/fastqs.fromENA.nobackup/SRR5878348_{Pair}.fastq.gz
    Lib_S2:
      #Uncomment and replace 'NAME_OF_LANE' with the name of this lane,
      #and replace 'PATH_WITH_WILDCARDS' with the path to the FASTQ files
      #to be trimmed and mapped for this lane (may include wildcards).
      SRR5878360: /net/harris/vol1/home/beichman/bears/fastqs.fromENA.nobackup/SRR5878360_{Pair}.fastq.gz

picard tools are not individual jars anymore

the documentation says that picard' version should be 1.85+.
Actually, now picard is one main command, and tools are specified as subcommands.
How to cope this that in the jar_root folder?

Soft clipped reads are not correctly deduplicated

Hi Mikkel,

We've noticed a problem with deduplication after trying out bwa-mem on some of our data. The underlying reason seems to be that PCR duplicates can exhibit different sequencing errors, and thus be soft-clipped at different positions. Deduplication based on the length and cigar string is thus problematic. Further, if the reads are soft-clipped at the beginning of a forward-mapping sequence (or the end of a reverse-mapping sequence), the reported mapping positions will also be different. The attached file shows 4 reads which should be marked as duplicates but are not.

-Graham

z.txt

Phylo pipeline "unknown command"

Hello,

I am trying to run the phylo pipeline with the latest version of paleomix - 1.3.7

I've tried lots of variants on the same command, including using example documentation as well as my own own makefile, but consistently get the same error message:

ERROR unknown command 'genotyping'

Just one example of the commands yielding this error is:
paleomix phylo_pipeline genotyping+msa+phylogeny makefile.yaml

I've checked and I believe I have all the correct development files installed.
Could you advise what might be causing this error?

Many thanks,
Ben

Failure in build time test

Hi,
I tried to upgrade the Debian package of paleomix to the latest release 1.2.12 and noticed that there is a build time issue. I skipped the test test_requirementobj__version__command_not_executable and in the comment of the according patch I droped the error that results when keeping the test.
Am I missing something or is this test just wrong?
Kind regards, Andreas.

paleomix spawning zombies, waiting in futex_

I've noticed a number of zombie processes sitting on one of our servers, which are attributable to paleomix runs (see below). The system has been up for 147 days now, so these may have been undead for some time. I can see that some are attributable to an old version of paleomix (from April 2014) and some from a more recent version (git 2fc2560).

$ ps -lA|awk '$1=="F"||$2~/Z/'
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
0 Z 1186063 3238 72611  0 80   0 -     0 exit   ?        00:00:00 mapDamage <defunct>
0 Z 1186063 3321 72615  0 80   0 -     0 exit   ?        00:00:00 mapDamage <defunct>
0 Z 1186063 4337 72628  0 80   0 -     0 exit   ?        00:00:00 mapDamage <defunct>
0 Z 1186063 4372 72630  0 80   0 -     0 exit   ?        00:00:00 mapDamage <defunct>
0 Z 1186063 4744 72636  0 80   0 -     0 exit   ?        00:00:00 mapDamage <defunct>
0 Z 1186063 18703 8808  0 80   0 -     0 exit   ?        00:00:54 paleomix <defunct>
0 Z 1186063 27678 20614  0 80  0 -     0 exit   ?        00:05:44 bwa <defunct>
0 Z 1186063 27698 20614  0 80  0 -     0 exit   ?        00:00:00 paleomix <defunct>
0 Z 1186063 27737 20614  0 80  0 -     0 exit   ?        00:00:00 bwa <defunct>
0 Z 1186063 27746 20614  0 80  0 -     0 exit   ?        00:00:00 paleomix <defunct>
0 Z 1186063 29232 20625  0 80  0 -     0 exit   ?        00:00:00 paleomix <defunct>
0 Z 1186063 36627 20751  0 80  0 -     0 exit   ?        00:13:10 bwa <defunct>
0 Z 1186063 36631 20751  0 80  0 -     0 exit   ?        00:00:00 paleomix <defunct>
0 Z 1186063 36634 20751  0 80  0 -     0 exit   ?        00:00:00 bwa <defunct>
0 Z 1186063 36636 20751  0 80  0 -     0 exit   ?        00:00:00 paleomix <defunct>
0 Z 1186063 39950 20763  0 80  0 -     0 exit   ?        01:34:43 bwa <defunct>
0 Z 1186063 39952 20763  0 80  0 -     0 exit   ?        00:00:00 bwa <defunct>
0 Z 1186063 39953 20763  0 80  0 -     0 exit   ?        00:00:00 paleomix <defunct>
0 Z 1158147 44627 43895  0 80  0 -     0 exit   ?        00:01:26 bwa <defunct>
0 Z 1158147 44659 43895  0 80  0 -     0 exit   ?        00:01:14 bwa <defunct>
0 Z 1158147 44695 43895  0 80  0 -     0 exit   ?        00:00:00 bwa <defunct>
0 Z 1158147 44699 43895  0 80  0 -     0 exit   ?        00:00:00 bam_cleanup <defunct>
0 Z 1186063 56136 34878  0 80  0 -     0 exit   ?        00:00:00 paleomix <defunct>
0 Z 1186063 57149 34883  0 80  0 -     0 exit   ?        00:00:00 paleomix <defunct>
0 Z 1186063 59789 34900  0 80  0 -     0 exit   ?        00:00:00 paleomix <defunct>
0 Z 1186063 68004 43865  0 80  0 -     0 exit   ?        00:12:44 bwa <defunct>
0 Z 1186063 68006 43865  0 80  0 -     0 exit   ?        00:00:00 paleomix <defunct>
0 Z 1186063 68028 43865  0 80  0 -     0 exit   ?        00:00:00 bwa <defunct>
0 Z 1186063 68045 43865  0 80  0 -     0 exit   ?        00:00:00 paleomix <defunct>
0 Z 1186063 68973 43868  0 80  0 -     0 exit   ?        00:00:00 paleomix <defunct>

$ ps -l 43895 72611
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY        TIME CMD
1 S 1158147 43895   1  0  80   0 - 118817 futex_ ?         0:00 /opt/shared/python/gcc4.4.4/2.7.2/bin/python -3 /opt/shared/paleomix/git20140411/bin/bam_pipeline run makefile.yaml
1 S 1186063 72611   1  0  80   0 - 83977 futex_ pts/24     0:00 /opt/shared/python/gcc4.8.0/2.7.5/bin/python /localscratch/Programs/paleomix-git-2fc2560-20150601/opt/shared/python/gcc4.8.0/2.7.5//bin/bam_pipeline run --max-threads 34 --bwa

$ uptime
 13:27:03 up 147 days, 21:44,  1 user,  load average: 0.00, 0.00, 0.00
$ cat /etc/redhat-release
Red Hat Enterprise Linux ComputeNode release 6.6 (Santiago)
$ uname -r
2.6.32-504.el6.x86_64

[info] conda instructions for paleomix

I don't know if it would be of interest, but as conda is very popular nowadays, I thought to share my notes on how to create a conda environment that can run PALEOMIX (Only tested bam_pipeline so far though).

If you're interested I can make a PR into the docs, otherwise you could just leave this open for people to find.

This assumes you've already installed conda, and set it up to scan the bioconda channel.

Make conda environment; note adding missing GATK and R requirement(s) not listed explicitly in the current PALEOMIX documentation.

conda create -n paleomix python=2.7 pip adapterremoval=2.3.1 samtools=1.9 picard=2.22.9 bowtie2=2.3.5.1 bwa=0.7.17 mapdamage2=2.0.9 gatk=3.8 r-base=3.5.1 r-rcpp=1.0.4.6 r-rcppgsl=0.3.7 r-gam=1.16.1 r-inline=0.3.15

conda activate paleomix

Then while in the paleomix environment, install paleomix

pip install --user paleomix

Now fix the 'difficult' recipes of GATK by download the last GATK v3 version JAR, and putting that and the conda version of picard in the place the paleomix requires.

wget https://storage.googleapis.com/gatk-software/package-archive/gatk/GenomeAnalysisTK-3.8-1-0-gf15c1c3ef.tar.bz2
## not completely necessary, but might as well
gatk3-register GenomeAnalysisTK-3.8-1-0-gf15c1c3ef.tar.bz2

mkdir -p /home/cloud/install/jar_root/
ln -s /<path>/<to>/miniconda2/envs/paleomix/opt/gatk-3.8/GenomeAnalysisTK.jar /home/<user>/install/jar_root/
ln -s /<path>/<to>/miniconda2/envs/paleomix/share/picard-2.22.9-0/picard.jar /home/<user>/install/jar_root/

To finally test it worked properly.

cd ~
paleomix bam_pipeline example .
cd ~/bam_pipeline
paleomix bam_pipeline run 000_makefile.yaml

Once completed, you can disconnect from the PALEOMIX environment with

conda deactivate

I also made a environment file instead to make it slightly easier (you need to remove the .txt suffix before running the command), but obviously this just makes the create command slightly less long, the rest of the setup is still required.

To create you can run:

conda env create -f paleomix_environment.yaml

paleomix_environment.yaml.txt

GATKv4 not supported?

Hello,
I've been trying to install and use paleomix and always get an error when trying to use GATK (can't locate file GenomeAnalysisTK.jar)
I've found out that in GATKv4, there is no more jar file untitled GenomeAnalysisTK.jar but two jar files launched by a gatk module. I managed to overcome this issue but it still result in an error (see above)

"Version could not be determined for GenomeAnalysisTK:

Attempted to run command:
$ java -server -Djava.io.tmpdir=/tmp/doctorant/bam_pipeline -Djava.awt.headless=true -XX:+UseSerialGC -jar /home/doctorant/install/jar_root/GenomeAnalysisTK.jar --version

Program may be broken or a version not supported by the
pipeline; please refer to the PALEOMIX documentation.

Required:       any version
Search string:  ^(\d+)\.(\d+)

---------------------- Command output ----------------------
The Genome Analysis Toolkit (GATK) v4.1.4.0
HTSJDK Version: 2.20.3
Picard Version: 2.21.1"

If I run the command, there is no error, as you can see the output is the version. So the problem remains only when using paleomix.

Plus, I've read through my search that indel realigner isn't supported anymore in GATK4.
It seems to me that this version isn't supported by paleomix, appending the user recommandation of softwares consequently would be great.

(Sorry if I got something wrong!)

Regards

Change in default tmp dir?

Is it possible to change the default tmp dir? I think I've been able to do this in the past with the .conf file but was wondering if this was till possible and easy to answer before I go diving in and breaking things. Right now we are running out of tmp space on the cluster we are running on.

paleomix doesnot work

I have installed paleomix, and run the example, for a few days, but no result. we refer to top, it accounts for ZERO CPU.
only like this ...
Reading makefiles ...

Validating prefixes ...

Your kind reply will be more helpful for me, thankyou!

Please try to drop code copy of pyyaml

Hi,
I'm working on Debian packages for paleomix. The packaging is done in the Debian Med team which is packaging free software in life sciences for official Debian. When doing the packaging I was stumbling upon a code copy of pyyaml which has some slight changes. If these changes are relevant for paleomix it would be better to file an issue to official pyyaml to merge these changes rather than maintaining an code copy.
Kind regards, Andreas.

Paleomix can not find picard even though it is there

Hi,
I am having some problems with paleomix. I recently uninstalled paleomix and reinstalled (with conda) the latest version, created the symlink to picard at ~/install/jar_root/picard.jar and even changed the ~/.paleomix/bam_pipeline.ini option to direct paleomix directly to my picard.jar file. If I run "java -server -Djava.io.tmpdir=/tmp/gthemudo/bam_pipeline -Djava.awt.headless=true -XX:+UseSerialGC -Xmx4g -jar /home/gthemudo/install/jar_root/picard.jar MarkDuplicates --version" on the command line I get "Version:2.27.2" as response.
Any idea what is causing this?
Cheers,
Gonçalo

A problem of PALEOMIX 2.0.0-alpha documentation

In section 2.3 Conda installation， the URL has changed.
The right URL is ‘https://raw.githubusercontent.com/MikkelSchubert/paleomix/master/paleomix_environment.yaml > paleomix_environment.yaml‘’

thanks for your tools.

paleomix bam_pipeline: error: unrecognized arguments: --gatk_max_threads=1 --progress_ui=running --jre_options=

Hi,

thank you for porting to python3!

I immediately tried to create a singularity container.
Although I had to work around the conda environment it seems to work despite two things. The installation with pip only worked when providing --ignore-installed ruamel-yaml, because it was already somehow installed before (by conda create I assume).

But the main reason I write this ticket is something else. When running the example workflow for testing I get the following error message:

$ singularity run paleomix_v1.3.sif bam example . 
13:41:13 INFO Copying example project to '.'
13:41:13 INFO Sucessfully saved example in './bam_pipeline'
$ cd bam_pipeline 
$ singularity run ../paleomix_v1.3.sif bam run makefile.yaml
usage: paleomix bam_pipeline [-h] [--version] command ...
paleomix bam_pipeline: error: unrecognized arguments: --gatk_max_threads=1 --progress_ui=running --jre_options== --ui_colors=on

I didn't change the ~/.paleomix/bam_pipeline.ini. Do I have to add the value for the --jre_options to get it working? And if yes, what exactly goes there? The path to java?

[Defaults]
max_threads = 2
log_level = warning
gatk_max_threads = 1
jar_root = /root/install/jar_root
bwa_max_threads = 1
progress_ui = running
temp_root = /tmp/root/bam_pipeline
jre_options = 
bowtie2_max_threads = 1
adapterremoval_max_threads = 1
ui_colors = on

EDIT:
What I've noticed was the dual "=" after the jre_options, don't know where this is coming from.

The version is of course 1.3:

$ singularity run paleomix_v1.3.sif bam --version
paleomix bam_pipeline v1.3.0

BWA terminated by SIGKILL, PALEOMIX in BAM pipline

I have recently tried running the BAM pipeline for PE reads. But once I run it against a nuclear reference genome, each sample generates a long error and an empty bam file. STDOUT and STDERR* files are almost all empty, too.
The errors all look something like this, first error terminated by SIGKILL and second terminated by PALEOMIX:

PALEOMIX         = v1.3.8
Command          = '/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/bin/paleomix bam run makefile_PE_allbig_bubals_nuclear_mito_21.01.24.YAML'
CWD              = '/scratch300/uriw1/Bubals'
PATH             = '/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/bin:/powerapps/share/centos7/miniconda/miniconda3-2023/condabin:/powerapps/share/ExaML/examl:/powerapps/share/ExaML/parser:/powerapps/share/centos7/miniconda/miniconda3-2023/etc/profile.d/miniconda3-2023-environmentally/condabin:/powerapps/share/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local.cc/bin:/mathematica/vers/11.2'
Node             = alignment of './bubal_NM231/reads/Sample_NM231/Seq_NM231/Lane_2/V350180618_L02_NGS444_s2_Seq_NM231_x.fq.gz/reads.collapsed.gz' onto GCA_006408545.1_HBT_genomic using BWA samse
Threads          = 1
Input files      = ./bubal_NM231/bubal_genome_reference/Sample_NM231/Seq_NM231/Lane_2/V350180618_L02_NGS444_s2_Seq_NM231_x.fq.gz/collapsed.sai
                   ./bubal_NM231/reads/Sample_NM231/Seq_NM231/Lane_2/V350180618_L02_NGS444_s2_Seq_NM231_x.fq.gz/reads.collapsed.gz
                   /scratch300/uriw1/Bubals/Bubal_reference/ncbi_dataset/data/GCA_006408545.1/GCA_006408545.1_HBT_genomic.fasta
                   /scratch300/uriw1/Bubals/Bubal_reference/ncbi_dataset/data/GCA_006408545.1/GCA_006408545.1_HBT_genomic.fasta.amb
                   /scratch300/uriw1/Bubals/Bubal_reference/ncbi_dataset/data/GCA_006408545.1/GCA_006408545.1_HBT_genomic.fasta.ann
                   /scratch300/uriw1/Bubals/Bubal_reference/ncbi_dataset/data/GCA_006408545.1/GCA_006408545.1_HBT_genomic.fasta.bwt
                   /scratch300/uriw1/Bubals/Bubal_reference/ncbi_dataset/data/GCA_006408545.1/GCA_006408545.1_HBT_genomic.fasta.pac
                   /scratch300/uriw1/Bubals/Bubal_reference/ncbi_dataset/data/GCA_006408545.1/GCA_006408545.1_HBT_genomic.fasta.sa
Output files     = ./bubal_NM231/bubal_genome_reference/Sample_NM231/Seq_NM231/Lane_2/V350180618_L02_NGS444_s2_Seq_NM231_x.fq.gz/collapsed.bam
Auxiliary files  = /powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/site-packages/paleomix/main.py
Executables      = /powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/bin/python
                   bwa

Errors =
Parallel processes:
  Process 1:
    Command = bwa samse \
                  /scratch300/uriw1/Bubals/Bubal_reference/ncbi_dataset/data/GCA_006408545.1/GCA_006408545.1_HBT_genomic.fasta \
                  ./bubal_NM231/bubal_genome_reference/Sample_NM231/Seq_NM231/Lane_2/V350180618_L02_NGS444_s2_Seq_NM231_x.fq.gz/collapsed.sai \
                  ./bubal_NM231/reads/Sample_NM231/Seq_NM231/Lane_2/V350180618_L02_NGS444_s2_Seq_NM231_x.fq.gz/reads.collapsed.gz
    Status  = Terminated with signal SIGKILL
    STDOUT  = Piped to process 2
    STDERR* = '/scratch300/uriw1/Bubals/temp/e19d64cc-c3d4-4ded-89d5-3d8ada5945d5/pipe_bwa_139916491756176.stderr'
    CWD     = '/scratch300/uriw1/Bubals'

  Process 2:
    Command = /powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/bin/python \
                  /powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/site-packages/paleomix/main.py \
                  cleanup --fasta \
                  /scratch300/uriw1/Bubals/Bubal_reference/ncbi_dataset/data/GCA_006408545.1/GCA_006408545.1_HBT_genomic.fasta \
                  --temp-prefix \
                  /scratch300/uriw1/Bubals/temp/e19d64cc-c3d4-4ded-89d5-3d8ada5945d5/bam_cleanup \
                  --rg-id Seq_NM231 --rg SM:Sample_NM231 --rg LB:Seq_NM231 --rg PU:Lane_2 --rg \
                  PL:ILLUMINA --rg PG:bwa --rg \
                  'DS:/scratch300/uriw1/Bubals/Bubal_raw_data/V350180618_L02_NGS444_s2_Seq_NM231_[12].fq.gz' \
                  -q 25 -F 0x4
    Status  = Automatically terminated by PALEOMIX
    STDIN   = Piped from process 1
    STDOUT  = '/scratch300/uriw1/Bubals/temp/e19d64cc-c3d4-4ded-89d5-3d8ada5945d5/collapsed.bam'
    STDERR* = '/scratch300/uriw1/Bubals/temp/e19d64cc-c3d4-4ded-89d5-3d8ada5945d5/pipe_python_139916491756752.stderr'
    CWD     = '/scratch300/uriw1/Bubals'

I've also attached my makefile as a .txt:

makefile_PE_allbig_bubals_nuclear.txt

maybe the issue is the wildcard token following the {Pair}*?

Best regards,
Uri Wolkowski

GATK error

Hello,

I'm getting the following error when trying to run Paleomix:

Version could not be determined for GenomeAnalysisTK:
 
Attempted to run command:
    $ java -server -Djava.io.tmpdir=/pool/genomics/xxx/xxxxx/xxxxxxx/xxxxxxxxxx/temp -Djava.awt.headless=true -XX:+UseSerialGC -Xmx2g -jar /home/xxxxxxxx/install/jar_root/GenomeAnalysisTK.jar --version
 
Program may be broken or a version not supported by the
pipeline; please refer to the PALEOMIX documentation.
 
    Required:       prior to v4.0.x
    Search string:  ^(?:The Genome Analysis Toolkit \(GATK\) v)?(\d+)\.(\d+)

But when I try to run the command showed above:
$ java -server -Djava.io.tmpdir=/pool/genomics/xxx/xxxxx/xxxxxxx/xxxxxxxxxx/temp -Djava.awt.headless=true -XX:+UseSerialGC -Xmx2g -jar /home/xxxxxxxx/install/jar_root/GenomeAnalysisTK.jar --version

I get: 3.8-1-0-gf15c1c3ef

So, I`m not sure why it is not working.

Thank you.

Java issue with an updated openjdk version

Hi, I run BAM pipeline but it returns:

Building BAM pipeline ...

Validating prefixes ...
Running BAM pipeline ...
Checking file dependencies ...
Checking for required executables ...
Checking version requirements ...
- Checking version of 'JAVA Runtime Environment' ...
  Program may be broken or a version not supported by the
  pipeline; please refer to the PALEOMIX documentation.
Required: at least v1.6.x
Search string: 'java version "(\d+)\._'
---------------------- Command output ----------------------
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~16.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

I guess this version is updated: how can I figure out this issue, without changing the openjdk version?

Thanks

BWA backtrack additional options added to samse, not aln

Hi Mikkel,

I added additional options (-n and -o) for bwa backtrack in the yaml file (bam pipeline), expecting them to be applied to the bwa aln command (as stated in the yaml file), but instead they seem to be applied to the bwa samse command (see log file), which then results in an error (I assume because bwa samse doesn't have a -o option). I've attached the yaml and log files, and I'm using paleomix v1.3.7. Please can you have a look to see what's going wrong here?

Cheers,
Deon
SCR_086_RmA009.yaml.txt
bam_pipeline.20230515_151100_01.log

biobambam2 instead of picard-tools?

Is it feasible to switch to biobambam2 for the duplicate marking and BAM validation steps?

One of the main problems we have with large paleomix runs (e.g. 100 samples) is that it begins hanging and failing at these steps, especially the BAM validation steps. We have sort of hacked it to manually do the BAM validation steps using biobambam2. In my experience it is more reliable and MUCH faster -- 10-20X faster, and it never fails!

Just a suggestion for you Mikkel. Best wishes from Norway

bam_pipeline error

Hi,
I'm trying to use paleomix on my data, but I'm getting this error:

Reading makefiles ...
Error reading makefiles:
MakefileError:
Inconsistency between makefile specification and current makefile at root:CGF279_FagioloAntico:CGF279_FagioloAntico:Lane_1:
Expected dict, found str '/home/denise.lavezzari/Projects/CGF279_FagioloAntico/fastq/Fagiolo_Antico_ID_S72_R{Pair}_*.fastq.gz'!

Here my yaml file:
000_makefile_template.txt

Thank you,
Denise

Alignment to sequence variation graph?

Rui Martiniano and others have shown that mapping aDNA reads to a variation graph can substantially reduce reference bias. It would be great if vg could be included as a mapping option!

Cheers,
Nathan

Tox tests do not work

Hi,
when trying to upgrade the paleomix Debian package to version 1.2.12 I also wanted to run the topx test suite. Unfortunately I get:

$ tox
GLOB sdist-make: /home/tillea/debian-maintain/alioth/debian-med_git/paleomix/setup.py
ERROR: invocation failed (exit code 1), logfile: /home/tillea/debian-maintain/alioth/debian-med_git/paleomix/.tox/log/tox-0.log
ERROR: actionid: tox
msg: packaging
cmdargs: ['/usr/bin/python3', local('/home/tillea/debian-maintain/alioth/debian-med_git/paleomix/setup.py'), 'sdist', '--formats=zip', '--dist-dir', local('/home/tillea/debian-maintain/alioth/debian-med_git/paleomix/.tox/dist')]
env: None

ERROR: Python version 2.7.x required!
       Current version is v3.5.3 (default, Jan 19 2017, 14:11:04)  [GCC 6.3.0 20170118]

ERROR: FAIL could not package project - v = InvocationError('/usr/bin/python3 .../debian-med_git/paleomix/setup.py sdist --formats=zip --dist-dir .../debian-med_git/paleomix/.tox/dist (see .../debian-med_git/paleomix/.tox/log/tox-0.log)', 1)

The machine where I tried to run this test has

$ python --version
Python 2.7.13
$ python3 --version
Python 3.5.3

Any hint what might went wrong here?

BTW, in Debian it is considered a good idea to move to Python3 rather sooner than later - but the interpreter that you call with simply python will remain the 2.x series of Python.

Kind regards, Andreas.

Issue running the example bam_pipeline

Hi,

we are trying to run the latest paleomix version. Everything is installed but we cannot troubleshoot the error we get back trying to run:

paleomix bam_pipeline run 000_makefile.yaml

pipe_python_46912708195856.stderr.txt

error.log

Basically we get a truncated bam out of BWA ( see error log and attached .stderr file).

Could you maybe help us with that? Thank you so much for your help!

Ludovic Dutoit & Alex Verry
University of Otago, New Zealand

time elapsed reset by using function keys

In v1.2.5 in the screen output,
the time elapsed is updated, a feature I like:
14:02:34 Running 3 tasks using ~3 of max 6 threads; 143 done of 166 tasks in 2:03s

after using l to list (or increasing / decreasing # cores to use), the elapsed time is reset

14:11:11 Running 3 tasks using ~3 of max 6 threads; 143 done of 166 tasks in 0s

would it be possible to leave this time counter aside from using keys?

documentation for paloemix phylo pipeline

Dear Mikkel,

After successfully using your fantastic paloemix bam_pipeline to generate bam files of my genomes, I am trying to used the phylo pipeline. However, I am struggling to make it run because I am unsure on how to set the makefile.yaml script properly (e.g. I want a genome-wie analysis, and not in specific regions, so I don't know what to write on prefix). Do you have perhaps extended documentation on the phylo package (it says is under construction on readthedocs file), or perhaps a makefile.yaml script of reference for the phylo pipeline?

Many thanks in advance!

Oscar

Receieved a NodeError while running the pipeline

Hello! I ran into an error as following while trying to running my sample.

38 INFO Validating FASTA files
16:44:38 INFO Building BAM pipeline for 'N_chinensis-novo.yaml'
16:44:38 INFO Running BAM pipeline
16:44:38 INFO Checking file dependencies
16:44:38 INFO Checking for auxiliary files
16:44:38 INFO Checking required software
16:44:38 INFO  - Found Rscript v3.4.4
16:44:38 INFO  - Found AdapterRemoval v2.3.1
16:44:38 INFO  - Found BWA v0.7.17
16:44:38 INFO  - Found Picard tools v2.23
16:44:39 INFO  - Found R module: Rcpp v0.12.15
16:44:39 INFO  - Found R module: RcppGSL v0.3.3
16:44:39 INFO  - Found R module: gam v1.14.4
16:44:39 INFO  - Found R module: ggplot2 v2.2.1
16:44:39 INFO  - Found R module: inline v0.3.14
16:44:39 INFO  - Found mapDamage v2.2.1
16:44:39 INFO  - Found samtools v1.10.0
16:44:39 INFO Determining states
16:44:39 INFO Ready
16:44:39 INFO [1/22] Started trimming SE adapters from '/media/birg/Disk_2/student/farah/paleomix2/data/O1_interleaved.fastq.gz'
16:44:39 INFO [2/22] Started validating '/media/birg/Disk_2/student/farah/paleomix2/prefixes/N_chinensis_novo.fasta'
16:44:39 ERROR NodeError while validating '/media/birg/Disk_2/student/farah/paleomix2/prefixes/N_chinensis_novo.fasta':
16:44:39 INFO Saving error logs to '/media/birg/Disk_2/student/farah/paleomix2/stats/bam_pipeline.20231227_164438_01.log'
16:44:39 ERROR     Error(s) running Node:
16:44:39 ERROR     	Temporary directory: '/media/birg/Disk_2/student/farah/paleomix2/stats/temp/044fc85f-e3fb-4604-9661-7d6922492089'
16:44:39 ERROR     
16:44:39 ERROR     FASTA sequence contains invalid characters
16:44:39 ERROR         Filename = '/media/birg/Disk_2/student/farah/paleomix2/prefixes/N_chinensis_novo.fasta'
16:44:39 ERROR         Line = 106
16:44:39 ERROR         Invalid characters = '*'
16:48:35 INFO [0/0] Finished trimming SE adapters from '/media/birg/Disk_2/student/farah/paleomix2/data/O1_interleaved.fastq.gz'

What could this possibly mean and what can I do to avoid this error? My guess was that the fasta file I'm has some '*' to it. Thank you in advance!

Error with trimming SE adapters from sample

Hello! I ran into a node running error with a sample I'm working with.
I would add that this sample is different form my samples of interest, who are PE - mine seem to be working fine through the pipeline, but the results look suspiciously empty, as if there is no match at all between sample and reference. This is why I elected to run another sample which was already mapped in other methods, not using the pipeline, to check if the issue is with my data or in my makefile

On my first attempt, the STDERR file includes this text:

Trimming single ended reads ...
Opening FASTQ file 'Vole_test_raw/MM1000.fastq.gz', line numbers start at 1

Processed 1,000,285 reads in 14.0s; 71,000 reads per second ...
Processed 2,001,875 reads in 27.4s; 72,000 reads per second ...
Processed 3,003,467 reads in 40.4s; 74,000 reads per second ...
Processed 4,005,354 reads in 53.9s; 74,000 reads per second ...
Processed 5,007,149 reads in 1:07.8s; 73,000 reads per second ...
Processed 6,008,780 reads in 1:20.8s; 74,000 reads per second ...
Processed 7,009,170 reads in 1:34.3s; 74,000 reads per second ...
Processed 8,010,903 reads in 1:48.1s; 74,000 reads per second ...
Processed 9,012,598 reads in 2:01.9s; 73,000 reads per second ...
Processed 10,014,250 reads in 2:15.1s; 74,000 reads per second ...
Processed 11,015,965 reads in 2:28.8s; 74,000 reads per second ...
Processed 12,017,599 reads in 2:42.0s; 74,000 reads per second ...
Processed 13,017,802 reads in 2:55.2s; 74,000 reads per second ...
Processed 14,019,601 reads in 3:09.4s; 73,000 reads per second ...
Processed 15,021,310 reads in 3:22.6s; 74,000 reads per second ...
Processed 16,022,832 reads in 3:35.7s; 74,000 reads per second ...
Processed 17,024,547 reads in 3:49.1s; 74,000 reads per second ...
Processed 18,026,280 reads in 4:03.0s; 74,000 reads per second ...
Processed 19,027,986 reads in 4:16.3s; 74,000 reads per second ...
Processed 20,028,236 reads in 4:29.8s; 74,000 reads per second ...
Processed 21,029,958 reads in 4:43.6s; 74,000 reads per second ...
Processed 22,031,616 reads in 4:56.7s; 74,000 reads per second ...
Processed 23,033,374 reads in 5:10.4s; 74,000 reads per second ...
Processed 24,035,170 reads in 5:23.1s; 74,000 reads per second ...
Processed 25,035,579 reads in 5:37.2s; 74,000 reads per second ...
Processed 26,037,423 reads in 5:50.7s; 74,000 reads per second ...
Processed 27,039,254 reads in 6:04.8s; 73,000 reads per second ...
Processed 28,040,910 reads in 6:17.7s; 74,000 reads per second ...
Processed 29,041,372 reads in 6:31.9s; 73,000 reads per second ...
Processed 30,043,188 reads in 6:46.1s; 73,000 reads per second ...
Processed 31,044,794 reads in 6:59.9s; 73,000 reads per second ...
Processed 32,046,528 reads in 7:14.2s; 72,000 reads per second ...
Processed 33,048,283 reads in 7:27.5s; 73,000 reads per second ...
Processed 34,050,017 reads in 7:41.4s; 72,000 reads per second ...
Processed 35,050,334 reads in 7:54.7s; 72,000 reads per second ...
ERROR: Unhandled exception in thread:
    line_reader::refill_buffers_gzip: unknown error ('incorrect data check'):
    iostream error
ERROR: AdapterRemoval did not run to completion;
       do NOT make use of resulting trimmed reads!

Later I got a different, OS error, when I later edited the targets in my makefile to be specifically SE using the "Single:" key, as indicated in the Documentation). That is the STDERR it produced - this time an OS error, in the file validation stage:

Traceback (most recent call last):
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/site-packages/paleomix/main.py", line 122, in <module>
    sys.exit(main(sys.argv[1:]))
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/site-packages/paleomix/main.py", line 114, in main
    return module.main(argv[1:])
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/site-packages/paleomix/tools/validate_fastq.py", line 48, in main
    for record in FASTQ.from_file(filename):
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/site-packages/paleomix/common/formats/fastq.py", line 107, in from_file
    yield from FASTQ.from_lines(handle)
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/site-packages/paleomix/common/formats/fastq.py", line 82, in from_lines
    separator = next(lines_iter).rstrip()
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/gzip.py", line 289, in read1
    return self._buffer.read1(size)
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/gzip.py", line 454, in read
    self._read_eof()
  File "/powerapps/share/centos7/miniconda/miniconda3-2023/envs/paleomix_new_env/lib/python3.7/gzip.py", line 501, in _read_eof
    hex(self._crc)))
OSError: CRC check failed 0xffffffff != 0xfb10cbb7

Any suggestions or ideas on what could be done?
Best wishes,
Uri

"Positional data is too large for BAM format"

Hi all,

I'm currently try to align a set of fastqs on Grch37.

$ samtools --version
samtools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.

Some steps are failing with the following message in the log file.

Reading SAM file from STDIN ...
Joinining subprocesses:
[E::bam_write1] Positional data is too large for BAM format
samtools fixmate: Couldn't write to output file: No such file or directory
[E::bgzf_flush] File write failed (wrong size)
[E::bgzf_close] File write failed
Traceback (most recent call last):
  File "/tmp/pip-unpacked-wheel-VmAfDD/paleomix/main.py", line 242, in <module>
    sys.exit(main(sys.argv[1:]))
  File "/tmp/pip-unpacked-wheel-VmAfDD/paleomix/main.py", line 235, in main
    return module.main(argv[1:])
  File "/sandbox/users/lindenbaum-p/.local/lib/python2.7/site-packages/paleomix/tools/cleanup.py", line 388, in main
    return _pipe_to_bam()
  File "/sandbox/users/lindenbaum-p/.local/lib/python2.7/site-packages/paleomix/tools/cleanup.py", line 93, in _pipe_to_bam
    output_handle.write(record)
  File "pysam/libcalignmentfile.pyx", line 1704, in pysam.libcalignmentfile.AlignmentFile.write
  File "pysam/libcalignmentfile.pyx", line 1736, in pysam.libcalignmentfile.AlignmentFile.write
IOError: sam_write1 failed with error code -1
  - Command finished: /sandbox/users/lindenbaum-p/packages/anaconda3/envs/PIP/bin/python /sandbox/users/lindenbaum-p/.local/lib/python2.7/site-packages/paleomix/main.pyc cleanup --fasta /sandbox/resources/species/human/cng.fr/hs37d5/hs37d5_all_chr.fasta --temp-prefix /sandbox/shares/u1087/lindenb/work/20200204.paleomix/tmp/c2e3b794-df63-4621-acd6-c932118126e4/bam_cleanup --min-quality 25 --exclude-flags 0x4 --samtools1x yes --rg-id ACTGGAC --rg SM:B00IK32 --rg LB:ACTGGAC --rg PU:Lane_1 --rg PL:ILLUMINA --rg PG:bwa pipe
    - Return-code:    1
  - Command finished: samtools sort -l 0 -O bam -T /sandbox/shares/u1087/lindenb/work/20200204.paleomix/tmp/c2e3b794-df63-4621-acd6-c932118126e4/bam_cleanup
    - Return-code:    0
  - Command finished: samtools calmd -b - /sandbox/resources/species/human/cng.fr/hs37d5/hs37d5_all_chr.fasta
    - Return-code:    0
  - Command finished: /sandbox/users/lindenbaum-p/packages/anaconda3/envs/PIP/bin/python /sandbox/users/lindenbaum-p/.local/lib/python2.7/site-packages/paleomix/main.pyc cleanup --fasta /sandbox/resources/species/human/cng.fr/hs37d5/hs37d5_all_chr.fasta --temp-prefix /sandbox/shares/u1087/lindenb/work/20200204.paleomix/tmp/c2e3b794-df63-4621-acd6-c932118126e4/bam_cleanup --min-quality 25 --exclude-flags 0x4 --samtools1x yes --rg-id ACTGGAC --rg SM:B00IK32 --rg LB:ACTGGAC --rg PU:Lane_1 --rg PL:ILLUMINA --rg PG:bwa cleanup
    - Return-code:    0
  - Command finished: samtools fixmate -O bam - -
    - Return-code:    1
Errors occured during processing!

can you help me please.

Passing path to reference sequence containing spaces

Looks like at present you cannot pass a path to the reference sequence which contains spaces when using Bowtie2. The path is split on the string and subsequent parts of the path are treated as extra arguments.

I have tried many combinations of escaping and none work e.g.

Path: "/DATA/Reference\ genome/myref.fasta"
Path: "/DATA/Reference\\ genome/myref.fasta"
Path: "/DATA/Reference' 'genome/myref.fasta"
Path: /DATA/Reference\\ genome/myref.fasta

Is there a way to include a space in the path to the reference sequence?

Paleomix expecting gatk.jar to be in a specific folder?

One of my user tried to launch paleomix after a successful pip install and got the following error:

paleomix bam_pipeline run makefile.yaml
Reading makefiles ...
- Validating prefixes ...
Building BAM pipeline .
Running BAM pipeline ...
- Checking file dependencies ...
Errors detected during graph construction (max 20 shown):
Required file does not exist, and is not created by a node:
Filename: /home/genouest/ecobio/mollivier/install/jar_root/GenomeAnalysisTK.jar
Dependent node(s): <GATK Indel Realigner (aligning): 2 files in 'SAMPLE/BDD_METAZOA/Sample_1' -> 'SAMPLE.BDD_METAZOA.realigned.bam'>
<GATK Indel Realigner (training): 2 files in 'SAMPLE/BDD_METAZOA/Sample_1' -> 'SAMPLE/BDD_METAZOA.intervals'>
Required file does not exist, and is not created by a node:
Filename: /home/genouest/ecobio/mollivier/install/jar_root/picard.jar
Dependent node(s): <DepthHistogram: 2 files in 'SAMPLE/BDD_METAZOA/Sample_1' -> 'SAMPLE.BDD_METAZOA.depths'>
<MarkDuplicates: 3 files in 'SAMPLE/BDD_METAZOA/Sample_1/SL383339/Lane_1'>
<SequenceDictionary: 'prefixes/BDD_METAZOA.fasta'>
<Validate BAM: 'SAMPLE.BDD_METAZOA.realigned.bam'>
<Validate BAM: 'SAMPLE/BDD_METAZOA/Sample_1/SL383339.rmdup.collapsed.bam'>
and 11 more nodes ...

I read paleomix/nodes/gatk.py and don't understand why not replacing the search for gatk.jar by a simple gatk call that would be managed by a conda env installation...

Could you otherwise specify a step by step guide on how to install the various dependencies in the correct folders that paleomix expect?

Thank you.

skip adapter-trimming

If I want to map the reads as they are in the raw FASTQ. How should I do?
In the makefile I think I can use the syntax (for single-end reads)

Single: path/raw.fastq.gz

so paleomix should consider them as already trimmed. But, could trimming be an additional feature?

conda environment perpetually solving during installation

When using the instructions for installing paleomix in a conda environment, the "solving environment" step takes many hours and never finishes, despite many work arounds. Is there something that I am missing? I would prefer to set up everthing an a conda enviroment since I am doing remote htc work.

BOWTIE2 errors in the pipeline

Dear Mikkel,

I hope you're well! I'm working on my MPhil dissertation using paleomix to investigate ancient millet DNA. I've been able to trim adapters using the pipeline, but the aligning is causing problems. I have been able to align using both bowtie2 and BWA not through paleomix. This is my current makefile, there seems to be an issue with MinQuality 30. if I put 30 in brackets, the error changes slightly to observed value: true. I'm new to bioinformatics and I'm not sure what I'm doing wrong! Best, Roz.

-- mode: Yaml; --

Default options.

Can also be specific for a set of samples, libraries, and lanes,

by including the "Options" hierarchy at the same level as those

samples, libraries, or lanes below.

Options:

Sequencing platform, see SAM/BAM reference for valid values

Platform: Illumina

Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)

or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to

specify 'Solexa', to handle reads on the Solexa scale. This is

used during adapter-trimming and sequence alignment

QualityOffset: 33

Settings for trimming of reads, see AdapterRemoval man-page

AdapterRemoval:
# Set and uncomment to override defaults adapter sequences
--adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

--adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

 # Some BAM pipeline defaults differ from AR defaults;
 # To override, change these value(s):
 --mm: 3
 --minlength: 25
 # Extra features enabled by default; change 'yes' to 'no' to disable
 --collapse: no
 --trimns: yes
 --trimqualities: yes

Target name; all output files uses this name as a prefix

Z2_align:

Sample name; used to tag data for segregation in downstream analyses

Z2_align:
# Library name; used to tag data for segregation in downstream analyses
Z2_align:
# Lane / run names and paths to FASTQ files
Z2_align: Z2_S5_L001_R1_001.fastq.gz

Settings for mappings performed using Bowtie2

Bowtie2:
  # Filter aligned reads with a mapping quality (Phred) below this value
  MinQuality: 30
  # Filter reads that did not map to the reference sequence
  FilterUnmappedReads: yes
  # Examples of how to add additional command-line options

--trim5: 5

--trim3: 5

  # Note that the colon is required, even if no value is specified
  --very-sensitive:
  # Example of how to specify multiple values for an option

--rg:

- CN:SequencingCenterNameHere

- DS:DescriptionOfReadGroup

Prefixes:
millet_genome:
Path: ~/align.bowtie2/GCA_002895445.2_ASM289544v2_genomic.fna.gz

Error:
ERROR Error reading makefiles: Makefile requirement not met at 'Z2_align :: Z2_align :: Bowtie2 :: MinQuality':
Expected value: (a non-empty string) or ({(a non-empty string) : (a non-empty string)})
Observed value: 30
Observed type: int

Please switch to Python3

Hi,
Debian is currently debating the support of Python2 which is marked end of life in 2020 (which is only one release cycle away. I'd recommend to switch to Python3 in the near future. I made some good experiences using 2to3 to port code from Python2 to Python3.
Kind regards, Andreas.

Paleomix bam_pipeline error

Hi, I tried to run paleomix using bam_pipeline and I got this error (refer to attached image), could you help me troubleshoot what is the problem? Thank you.

pip install seems broken

Hello @MikkelSchubert

I'm trying to make a conda recipe for your tool here:
bioconda/bioconda-recipes#20339

The pip installation is broken (issue is coming from the coverage>=4.0 dependency). I think it could be possible to let pip down in terms of dependencies management and let conda do the work since all the dependencies for paleomix are available in conda.

similar bams with rescale and non rescale

Hello,

I am using the paleomix bam pipeline on target capture, DNA sequences from herbarium samples. I used the rescale (mapDamage: rescale) and no rescale (mapDamage: no) option but not seeing differences in general numbers of the output bam files with samtools coverage - would you say this could be due to not having damage in my samples? Output graphs do not show clear patterns as the ones from the documentation.

Thank you very much for any thoughts.

mikkelschubert / paleomix Goto Github PK

paleomix's Introduction

The PALEOMIX pipelines

Installation and usage

Citations

Related tools

paleomix's People

Contributors

Stargazers

Watchers

Forkers

paleomix's Issues

-- mode: Yaml; --

Default options.

Can also be specific for a set of samples, libraries, and lanes,

by including the "Options" hierarchy at the same level as those

samples, libraries, or lanes below.

Sequencing platform, see SAM/BAM reference for valid values

Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)

or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to

specify 'Solexa', to handle reads on the Solexa scale. This is

used during adapter-trimming and sequence alignment

Settings for trimming of reads, see AdapterRemoval man-page

--adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

Target name; all output files uses this name as a prefix

Sample name; used to tag data for segregation in downstream analyses

Settings for mappings performed using Bowtie2

--trim5: 5

--trim3: 5

--rg:

- CN:SequencingCenterNameHere

- DS:DescriptionOfReadGroup

Recommend Projects

Recommend Topics

Recommend Org