Git Product home page Git Product logo

neat-genreads's People

Contributors

johanneskoester avatar joshfactorial avatar lsmainzer avatar meridith-e avatar morgantaschuk avatar mrweber2 avatar niemasd avatar thondeboer avatar zstephens avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neat-genreads's Issues

Coverage and VAF

I'm trying to simulate normal and tumor samples with high levels of coverage, but the coverage is much lower than the given value.

Normal

I generated samples with the below command, which resulted in two 6.2G FASTQ files, one 5.2G BAM file, and an empty VCF file.

for i in $(seq 1 100)
do
  ( python /neat-genreads/genReads.py \
    -r /refs/ensembl/ensembl_build_75.fa \
    -R 126 \
    -o /data/neat/normal \
    --pe 300 30 \
    -t /data/regions.bed \
    -c 100 \
    -M 0 \
    --bam --vcf \
    --job $i 100 ) &
done
wait
python /neat-genreads/mergeJobs.py \
    -i /data/neat/normal -o /data/neat/normal_merged -s /bin/samtools

Calling samtools bedcov with the region file showed that several thousand reads were aligning within each 2-7Mbp region, and samtools mpileup showed that the bam was averaging about 17 reads per bp (for a specified region).

# bedcov
...
1   XXX YYY 2885
1   XXX YYY 11359
1   XXX YYY 3256
...

# mpileup (within region with 11359 reads)
...
1   XXX N   16  NNNN    NNNN
1   XXX N   16  NNNN    NNNN
1   XXX N   18  NNNN    NNNN
1   XXX N   18  NNNN    NNNN
1   XXX N   17  NNNN    NNNN
...

Tumor

Similarly, I tried to generate a synthetic tumor file with the following:

for i in $(seq 1 80)
do
  ( python /neat-genreads/genReads.py \
    -r /refs/ensembl/ensembl_build_75.fa \
    -R 126 \
    -o /data/neat/tumor \
    --pe 300 30 \
    -t /data/regions.bed \
    -c 80 \
    -M 0 \
    --bam --vcf \
    --job $i 100 ) &
done

for i in $(seq 81 100)
do
  ( python /neat-genreads/genReads.py \
    -r /refs/ensembl/ensembl_build_75.fa \
    -R 126 \
    -o /data/neat/tumor \
    --pe 300 30 \
    -t /data/regions.bed \
    -c 20 \
    -M 0 \
    --bam --vcf \
    -v /refs/cosmic/cosmic.vcf.gz \
    --job $i 100 ) &
done
wait
python /neat-genreads/mergeJobs.py \
    -i /data/neat/tumor -o /data/neat/tumor_merged -s /bin/samtools

This resulted in two 11G FASTQ, one 9.3G BAM, and another empty VCF. The subsequent bedcov and mpileup showed better coverage:

# bedcov
1   XXX YYY 4887
1   XXX YYY 20249
1   XXX YYY 7470

# mpileup (for 20249)
1   XXX N   25  
1   XXX N   25  
1   XXX N   27  
1   XXX N   27  
1   XXX N   27

I jumped up the coverage to -c 1000 (still using 100 jobs), and (not unexpectedly) bash killed the jobs.

Questions

  1. Is this a valid way to simulate a mixed tumor / normal sample?
  2. Why was the file size of the tumor sample 2x the normal despite the same coverage parameter?
  3. Any thoughts on how to increase the coverage?

Empty Golden VCF

The below commands create empty VCFs. The tumor run only had a vcf given for the last 6 runs.

I imagine this is because the normal run (a) had no VCF given and (b) no mutation rate (-M 0), while the tumor run (c) also had no mutation rate and (d) only included mutations present in the given VCF. It makes sense in this scenario that there is no need to create a VCF--is this correct / the expected behavior?

python /neat-genreads/genReads.py \
  -R 126 -c 100 -M 0  --pe 300 30 \
  -o /data/neat/vaf_15/normal \
  -r /refs/ensembl/ensembl_build_75.fa \
  -t /data/regions.bed \
  --bam --vcf --job N 36

python /neat-genreads/genReads.py \
  -R 126 --pe 300 30 -c 100 -M 0 \
  -o /data/neat/vaf_15/tumor11 \
  -t /data/regions.bed \
  -r /refs/ensembl/ensembl_build_75.fa \
  -v /refs/exac/exac.vcf \
  --bam --vcf --job 1 36

Here's the directory output:

-rw-r--r-- 1 root root         821 Oct 13 12:24 tumor_golden.vcf.job01of36
-rw-r--r-- 1 root root           0 Oct 13 09:36 tumor_golden.vcf.job02of36
-rw-r--r-- 1 root root           0 Oct 13 09:30 tumor_golden.vcf.job03of36
-rw-r--r-- 1 root root           0 Oct 13 09:32 tumor_golden.vcf.job04of36
-rw-r--r-- 1 root root           0 Oct 13 09:29 tumor_golden.vcf.job05of36
-rw-r--r-- 1 root root           0 Oct 13 09:35 tumor_golden.vcf.job06of36
-rw-r--r-- 1 root root           0 Oct 13 09:32 tumor_golden.vcf.job07of36
-rw-r--r-- 1 root root           0 Oct 13 09:35 tumor_golden.vcf.job08of36
-rw-r--r-- 1 root root           0 Oct 13 09:29 tumor_golden.vcf.job09of36
-rw-r--r-- 1 root root           0 Oct 13 09:35 tumor_golden.vcf.job10of36
-rw-r--r-- 1 root root           0 Oct 13 09:34 tumor_golden.vcf.job11of36
-rw-r--r-- 1 root root           0 Oct 13 09:34 tumor_golden.vcf.job12of36
-rw-r--r-- 1 root root           0 Oct 13 09:35 tumor_golden.vcf.job13of36
-rw-r--r-- 1 root root           0 Oct 13 09:29 tumor_golden.vcf.job14of36
-rw-r--r-- 1 root root           0 Oct 13 09:33 tumor_golden.vcf.job15of36
-rw-r--r-- 1 root root           0 Oct 13 09:35 tumor_golden.vcf.job16of36
-rw-r--r-- 1 root root           0 Oct 13 09:29 tumor_golden.vcf.job17of36
-rw-r--r-- 1 root root           0 Oct 13 09:33 tumor_golden.vcf.job18of36
-rw-r--r-- 1 root root           0 Oct 13 09:30 tumor_golden.vcf.job19of36
-rw-r--r-- 1 root root           0 Oct 13 09:35 tumor_golden.vcf.job20of36
-rw-r--r-- 1 root root           0 Oct 13 09:36 tumor_golden.vcf.job21of36
-rw-r--r-- 1 root root           0 Oct 13 09:29 tumor_golden.vcf.job22of36
-rw-r--r-- 1 root root           0 Oct 13 09:30 tumor_golden.vcf.job23of36
-rw-r--r-- 1 root root           0 Oct 13 09:35 tumor_golden.vcf.job24of36
-rw-r--r-- 1 root root           0 Oct 13 09:34 tumor_golden.vcf.job25of36
-rw-r--r-- 1 root root           0 Oct 13 09:29 tumor_golden.vcf.job26of36
-rw-r--r-- 1 root root           0 Oct 13 09:35 tumor_golden.vcf.job27of36
-rw-r--r-- 1 root root           0 Oct 13 09:35 tumor_golden.vcf.job28of36
-rw-r--r-- 1 root root           0 Oct 13 09:29 tumor_golden.vcf.job29of36
-rw-r--r-- 1 root root 22125196019 Oct 13 15:06 tumor_merged_golden.bam
-rw-r--r-- 1 root root        3284 Oct 13 15:06 tumor_merged_golden.vcf

struct.error: ushort format requires 0 <= number <= USHRT_MAX

Hi,

I'm trying to use the genReads script and am running into following error:

Using default sequencing error model.
Warning: Read length of error model (101) does not match -R value (150), rescaling model...
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
index not found, creating one... 5.100 (sec)
Traceback (most recent call last):
File "neat-genreads-master/genReads.py", line 791, in
main()
File "neat-genreads-master/genReads.py", line 353, in main
OFW = OutputFileWriter(OUT_PREFIX,paired=PAIRED_END,BAM_header=bamHeader,VCF_header=vcfHeader,gzipped=GZIPPED_OUT,jobTuple=(MYJOB,corrected_nJobs),noFASTQ=NO_FASTQ)
File "/Users/paulhager/Documents/SmartPhase/Publication/Comparison/neat-genreads-master/py/OutputFileWriter.py", line 120, in init
self.bam_file.write(n[0]+'\0')
File "/Users/paulhager/Documents/SmartPhase/Publication/Comparison/neat-genreads-master/py/biopython_modified_bgzf.py", line 73, in write
self._write_block(self._buffer[:65536])
File "/Users/paulhager/Documents/SmartPhase/Publication/Comparison/neat-genreads-master/py/biopython_modified_bgzf.py", line 59, in _write_block
bsize = struct.pack("<H", len(compressed) + 25) # includes -1
struct.error: ushort format requires 0 <= number <= USHRT_MAX

I'm using osX High Sierra.

My command was:

python neat-genreads-master/genReads.py -r reference/human_g1k_v37.fasta -R 150 -o simNEAT/simulated.NEAT.R150.rg1kv37.c150.pe300_30 -c 150 --pe 300 30 --bam --vcf

Any ideas why it's giving me this error? By my limited tests it seems to occur with the --vcf tag.

Circular dependency import

I think you have a circular dependency in your imports.

Using a fresh clone of the repo:

python genReads.py -r test.fasta -R 101 -o simulated_data
Traceback (most recent call last):
  File "genReads.py", line 38, in <module>
    from SequenceContainer      import SequenceContainer, ReadContainer, parseInputMutationModel
  File "/neat-genreads/py/SequenceContainer.py", line 10, in <module>
    from cigar import CigarString
ImportError: cannot import name CigarString

Also, any plan to make this an actual package with a setup.py, etc ?

Possible Error

In genReads.py ~ line 764-766

                    if myChr not in inputRegions:
                        mutRateRegions[myChr] = [-1]
                        mutRateValues[myChr] = [0.0]

I think inputRegions is supposed to be mutRateRegions.

Masked reference genome Error

When I try to simulate reads for the masked version of A.thaliana genome, after some progress, I experience the following error:

Error: weight or value vector given to DiscreteDistribution() are 0-length.

Data set:
ftp://ftp.ensemblgenomes.org/pub/plants/release-39/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna_rm.toplevel.fa.gz

Parameters:
-R 101 --pe 300 30 -c 20 --vcf --bam

It does not happen with the unmasked version.

Generating paired tumor/normal sequence

Hi
Thanks to all the authors for such a well-designed tool!
I'm a novice in the Bioinformatics field. I want to generate paired tumor/normal sequence for testing variant callers but I have no idea what parameters should be set.
Also, based on the default Mutation Models, would it be possible to do this?

Bug in vcf-only generation

The following bug is present in 5b8bc84.

I want to generate a VCF file only. I am using the following command:

python $NEAT2_PATH/genReads.py -r "$HG38_CHR1" -R 50 -c 1 -o "$WORKDIR/common-vcf/" -t "$TARGET_BED" -to 0 --no-fastq --vcf --rng 0

The reference sequence is chr1 from HG38.
Target BED content is as follows: chr1 10021 914744 1.797693135e+308 0 0

I get a "variable referenced before assignment" error:

Only producing VCF output, that should speed things up a bit...
Using default sequencing error model.
Warning: Read length of error model (101) does not match -R value (50), rescaling model...
Using default gc-bias model.
Bypassing FASTQ generation...
found index ...
reading chr1... 60.868 (sec)
--------------------------------
generating vcf...
Traceback (most recent call last):
  File ".../neat-genreads2-wip/genReads.py", line 789, in <module>
    main()
  File ".../neat-genreads2-wip/genReads.py", line 616, in main
    if sum(coverage_dat[2]) >= LOW_COV_THRESH:
UnboundLocalError: local variable 'coverage_dat' referenced before assignment

Error distribution adjused with -E lower than expected

I simulated reads ~1million 2x150bp with rescaled error rates using the -E parameter. Using the golden bam file, I ran samtools calmd to generate the MD and NM (edit distance to the reference) tags because it doesn't contain them by default.

full set of options:

--pe 500 100 -E 0.1 -c 40 --bam -M 0 -p 1 --rng 1 -R 150

Here is a plot of the NM values for -E 0.1 (expecting a median of 150*0.1 = 15):
0 1_editdist
median is 6

And for -E 0.3 (the maximum)
0 3_editdist
median is 10 but would have expected 45

This is using the default error model. Because at the very least -E seems to rescale the error rate at higher values this isn't bug per say but it could be critically misleading. That is, if you are benchmarking a tool it could mislead you into thinking the tool can tolerate much higher error rates than it can in reality.

Memory leak

In py/vcfFunc.py, line 88: for line in open(vcfPath,'r'): contains an unclosed file handle. Also, there's a mixture of using the context manager with open(...) as fh: and just calling open fh = open(...). The latter is fine so long as the handle is closed even if any exceptions occur, but the former is better (e.g., refFunc.py:83,160)

Also, OutputFileWriter should open files as needed (just 'append') rather than keeping them open. Opening a file is a very low cost. Another option would be to add a context manager to OutputFileWriter, e.g., with ofw.openVcf(...) as fh:, so the files remain open only as long as you intend.

Using the --no-fastq option with --bam

Hello Zach,

I have been generating simulated data using different options. I noticed that when I specify the no-fastq option with --bam, the bam file generated is very very small and does not contain any data. However when I run the very same command without the "--no-fastq" option, bam file generated looks good.

Shaheen

Attempting to insert variant out of window bounds

Hello,

(the following error is present on commit e96b43f)

I am trying to generate only the VCF file using the HG19 reference (chr1 only), given a target file (attached to the issue).

The command is:

python $PATH_TO_NEAT/genReads.py -r $PATH_TO_HG19_CHR1_FASTA -R 10 -o $OUTPUT_PREFIX -t target.bed -to 0 --vcf --no-fastq --rng 0

I get the following error:

generating vcf...
1% 2% 
Error: Attempting to insert variant out of window bounds:
(-984, 'C', 'T') --> blackList[0:1000]

target.bed.txt

Error if '--bam' missing

Getting an error that I think is due to not using --bam.

sampling reads...
Traceback (most recent call last):
  File "/neat-genreads/genReads.py", line 637, in <module>
    main()
  File "/neat-genreads/genReads.py", line 585, in main
    myRefIndex = indices_by_refName[bamHeader[0][RI][0]]
TypeError: 'NoneType' object has no attribute '__getitem__'
40.290 (sec)

Monoploid issues with VCF

Hi,

I was trying to run genReads.py and provided it with a VCF to add in variants. I also specified that this is a monoploid genome. None of my variants were present in the output. I then tried adjusting my VCF to have two alleles, and then set the ploidy to diploid. In this case, the variants were present in the output. So there seems to be something wrong with the monoploid settings and how it pulls in variants from a VCF file.

Thanks,
Jessica

Error adding read groups to golden bam file

After sorting the golden bam I encounter this message:

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: WARNING: Read name A.tha_01-1-35477/2, No M or N operator between pair of I operators in CIGAR at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:454) at htsjdk.samtools.BAMRecord.getCigar(BAMRecord.java:253) at htsjdk.samtools.SAMRecord.getAlignmentEnd(SAMRecord.java:606) at htsjdk.samtools.SAMRecord.computeIndexingBin(SAMRecord.java:1575) at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2087) at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:811) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548) at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:182) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
This is the read:

A.tha_01-1-35477/2 147 1 24491 70 85M2I1D6I57M = 24142 -499 AAAATTGGTTACCTATTGGTTTTGGTTGGTTGATTTATCTTACGTTATCAGCAAGCAGAAGTGATCCGTAATCAAACTTGTTTTCAATTGGACATTATTGTCCGAGGGGCGGTGAGATGGGACAGGACTTTTGGGATTCTCGAAGGTGGC 06FFFE>FC>F?FF?@FFFF9GAEGGGEGGGGD5FD)5FFGFGEFE@DGGFG3@FAGFGFGA>GG3GGGEFBFGG4G><;GG>G:F&GGGGBGGE2GFF6F9*B85GE6?FCEEBFG@GGGFE@3DCDGG;E<;FEDG:GEFA-3GG<FD

Also:

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 9111, Read name A.tha_02-1-6417/1, CIGAR covers 148 bases but the sequence is 150 read bases at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:454) at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:812) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548) at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:182) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

there is no AF information in golden_vcf

Hi

I download latest neat from this site and try to simulate some reads containing variants from a vcf, but the output vcf contains no AF information, here is the comman I use

python neat-genreads-master/genReads.py -r human_g1k_v37.fasta -t test.bed -v test.vcf --bam --vcf -R 50 -o test --pe 300 30 -M 0.1 -p 10

Runtime warning and NameError

Just came across the following error in the master branch. I haven't had a chance to track it down, but I'll update if I do.

Command

for i in $(seq 1 100)
do
  python genReads.py \
    -r /refs/ensembl/ensembl_build_75.fa \
    -R 126 \
    -o /data/neat/normal \
    --pe 300 30 \
    -t /data/regions.bed \
    -c 100 \
    -M 0 \
    --bam --vcf \
    --job $i 100
done

Error

A single command is giving the following error:

Using default sequencing error model.
Warning: Read length of error model (101) does not match -R value (126), rescaling model...
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index /refs/ensembl/ensembl_build_75.fa.fai
enumerating all non-N regions in reference sequence...
reading 1... 33.343 (sec)
/neat-genreads/py/probability.py:86: RuntimeWarning: divide by zero encountered in log
  w_range = [np.exp(k*np.log(l) - l - logFactorial_list[k]) for k in k_range]
/neat-genreads/py/probability.py:86: RuntimeWarning: invalid value encountered in double_scalars
  w_range = [np.exp(k*np.log(l) - l - logFactorial_list[k]) for k in k_range]
Traceback (most recent call last):
  File "/neat-genreads/genReads.py", line 622, in <module>
    main()
  File "/neat-genreads/genReads.py", line 497, in main
    sequences = SequenceContainer(start,refSequence[start:end],PLOIDS,overlap,READLEN,[MUT_MODEL]*PLOIDS,MUT_RATE,coverage_dat,onlyVCF=ONLY_VCF)
  File "/neat-genreads/py/SequenceContainer.py", line 33, in __init__
    self.indelsToAdd = [n.sample() for n in self.ind_pois]
  File "/neat-genreads/py/probability.py", line 63, in sample
    return degenerateVal
NameError: global name 'degenerateVal' is not defined

-t targeted region bed file doesn't work

python genReads.py -r hg19.fa -R 150 -c 3000 -t target.bed -o sim1_norm --bam --vcf --pe 350 30

The program was still trying to sample all reads from the genome, but not limited to the target.bed regions.

Error while generating Tumor Normal Pair

Hello Zach,

I was trying to generate some tumor normal pair using your suggested method provided in the tutorial. I used the same script with the slightest changes(including file names, directories, pathnames etc). However, I am running into the following error: (copied from error log file)

Traceback (most recent call last):
File "neat-genreads/genReads.py", line 150, in
MUT_MODEL = parseInputMutationModel(MUT_MODEL,1)
File "/u/sbundhoo/neat-genreads/py/SequenceContainer.py", line 725, in parseInputMutationModel
outModel[8][i][j][l] /= float(sum(outModel[8][i][j]))
ZeroDivisionError: float division by zero

When I ran the script with the files you provided in the tutorial folder, it seems to work fine though. I am not sure what may be causing this error.

Thanks,

Shaheen

IndexError in SequenceContainer.py line 67

Hello,

I'm trying to perform reads generation on a sample data coming from https://github.com/slzhao/QC3#sExample. I am using two FASTQ files (3_1.example.fastq, 3_2.example.fastq) and a BAM file 3.example.bam.

I've used genSeqErrorModel.py (providing both FASTQ files using -i and -i2 options) for generation of sequencing error model.
I am using hg38 reference file truncated to contain only chr1.
I am also using a simple .bed file to simulate targeted sequencing:
chr1 10021 914744 1.797693135e+308 0 0

I am trying to generate single read data, with different values of read length and coverage. The command looks as follows:

python /path/to/neat-genreads/genReads.py -r '/path/to/hg38/chr1.fa' -R 76 -c 4 -o '/path/to/out' -t '/path/to/target_region.bed' -to 0 -e '/path/to/sequencing_error_model.pickle' --bam --vcf --rng 0

For read length of 125, read generation executes successfully (for coverage of 4X, and for coverage of 10X).
For read length of 76, read generation fails (both for 4X and 10X coverage), with an exception:

Using default gc-bias model.
found index /path/to/hg38/chr1.fa.fai
reading chr1... 54.779 (sec)
--------------------------------
sampling reads...
1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% Traceback (most recent call last):
  File "/path/to/neat-genreads/genReads.py", line 749, in <module>
    main()
  File "/path/to/neat-genreads/genReads.py", line 550, in main
    sequences.update(start,refSequence[start:end],PLOIDS,overlap,READLEN,[MUT_MODEL]*PLOIDS,MUT_RATE)
  File "/path/to/neat-genreads/py/SequenceContainer.py", line 233, in update
    self.init_basicVars(xOffset, sequence, ploidy, windowOverlap, readLen)
  File "/path/to/neat-genreads/py/SequenceContainer.py", line 67, in init_basicVars
    self.blackList[p][-self.winBuffer]   = 3
IndexError: index -76 is out of bounds for axis 0 with size 75

Threads exit when given VCF

For some reason, when I generate reads using a VCF, most of the threads exit without a sound a few minutes after starting the thread. This seems to happen before the reads are assigned parts of the genome. It is not always the same thread, nor is it always the same number of threads (8-10 still remain)

Command:

REGIONS="/data/regions.bed"
REFERENCE="/refs/ensembl/ensembl_build_75.fa"
VARIANTS="/scripts/cosmic_fix.vcf"
OUT=/data
VAF=15

TUMOR_PRCNT=$(( ${VAF} * 2 ))
G_COV=$(echo "scale=2; (100 - ${TUMOR_PRCNT}) * 4" | bc -l | xargs printf "%.0f")
S_COV=$(echo "scale=2; ${TUMOR_PRCNT} * 4" | bc -l | xargs printf "%.0f")

G_PREFIX=${OUT_DIR}/tumor_g
S_PREFIX=${OUT_DIR}/tumor_s

  for i in $(seq 1 ${MAX_JOB})
  do
    set -x
    python /neat-genreads/genReads.py \
      -r "${REFERENCE}" \
      -R 126 \
      -o "${S_PREFIX}" \
      --pe 300 30 \
      -t "${REGIONS}" \
      -c ${S_COV} \
      -M 0 \
      -v "${VARIANTS}" \
      --vcf --bam \
      --job $i ${MAX_JOB} 2>&1 | tee ${S_PREFIX}_${i}.log &
      set +x
  done
  wait

  for i in $(seq 1 ${MAX_JOB})
  do
    set -x
    python /neat-genreads/genReads.py \
      -r "${REFERENCE}" \
      -R 126 \
      -o ${G_PREFIX} \
      --pe 300 30 \
      -t "${REGIONS}" \
      -c ${G_COV} \
      -M 0 \
      --vcf --bam \
      --job $i ${MAX_JOB} 2>&1 | tee ${G_PREFIX}_${i}.log &
    set +x
  done
  wait

Output (before fail)

Using default sequencing error model.
Warning: Read length of error model (101) does not match -R value (126), rescaling model...
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index /refs/ensembl/ensembl_build_75.fa.fai
--------------------------------
reading input VCF...

Edit: This may be due to memory requirements. the parseVCF contains the unclosed open mentioned in a different issue; however, this still occurs when fixed. The VCF file I'm using is 800MB.

No reverse strand?

After aligning and SNP calling using mpileup I realized no (-) strand were being produced (reverse sequence from provided fasta). Is there an option for this?

Thanks

Any way to compare bam?

Hi,

Just wonder if there's any option to compare bam files. I tried to write to script for that purpose. With the neat version of fastq, there is huge difference (the absolute distance of mapped positions) between golden bam given by neat and bam from bwa. But with art-simulated fastq as input, the bam from bwa is very close to the true bam.

Thanks a lot!

-Han

Issue with targeted region simulation

Hi folks,

First of all, many thanks for providing this simulator.

I am trying to generate targeted region reads and I get the following error:

reading chr1... 39.328 (sec)
--------------------------------
sampling reads...
Traceback (most recent call last):
  File "../genReads.py", line 662, in <module>
    main()
  File "../genReads.py", line 508, in main
    tCov = not(bisect.bisect(inputRegions[refIndex[RI][0]],rInd)%2) or not(bisect.bisect(inputRegions[refIndex[RI][0]],rInd+GC_WINDOW_SIZE)%2)
KeyError: 'chr1'

Would be grateful if you could please help me to resolve this.

Best,
MH

--pe parameter question

Hi zstephens,

I am a new in simulation, I occasionally find your paper in plos one journal.

so I decide to try your simulator to generate simulated normal/tumor to evaluate CNV tools

In case of --pe parameter, can I interpret this to be "Mean insert size" ?

in our lab real data , the expected insert size is 500 bp

so can I set this parameter to be "--pe 500 30"?

Many thanks!

Parsing errors with samtools

when using the --bam options I keep getting the following error with the resultant file:

[E::bam_read1] CIGAR and query sequence lengths differ for simulatedReads-FN677756.1-74597/1

on samtools 1.6

I'm pretty sure this is a recent thing because it seems to work before, but maybe on another version of samtools.

Something went wrong

At higher coverage paramters (e.g., -c 500), I am getting the following quite often:

Error: Something went wrong!
(14487, 'TCC', 'TTT') TTC

from this command

  for i in $(seq 1 10)
  do
    ( python /neat-genreads/genReads.py \
      -r ${REFERENCE} \
      -R 126 \
      -o "${prefix}" \
      --pe 300 30 \
      -t "${REGIONS}" \
      -c 500 \
      -M 0 \
      --bam --vcf \
      --job $i 10 ) &
  done

I am unsure what the exact error is, but as the script exists when this occurs, the resulting files are incomplete.

-rw-r--r-- 1 root root 151M Oct 12 16:55 normal_golden.bam.job01of10
-rw-r--r-- 1 root root 264M Oct 12 17:06 normal_golden.bam.job03of10
-rw-r--r-- 1 root root  45M Oct 12 16:43 normal_golden.bam.job04of10
-rw-r--r-- 1 root root 172M Oct 12 16:57 normal_golden.bam.job05of10
-rw-r--r-- 1 root root 164M Oct 12 16:56 normal_golden.bam.job06of10
-rw-r--r-- 1 root root 1.1M Oct 12 16:39 normal_golden.bam.job07of10
-rw-r--r-- 1 root root  85M Oct 12 16:48 normal_golden.bam.job08of10
-rw-r--r-- 1 root root  30M Oct 12 16:42 normal_golden.bam.job09of10
-rw-r--r-- 1 root root 137M Oct 12 16:53 normal_golden.bam.job10of10
-rw-r--r-- 1 root root  821 Oct 12 16:55 normal_golden.vcf.job01of10
-rw-r--r-- 1 root root 843K Oct 12 17:06 normal_golden.vcf.job03of10
-rw-r--r-- 1 root root    0 Oct 12 16:38 normal_golden.vcf.job04of10
-rw-r--r-- 1 root root    0 Oct 12 16:38 normal_golden.vcf.job05of10
-rw-r--r-- 1 root root    0 Oct 12 16:38 normal_golden.vcf.job06of10
-rw-r--r-- 1 root root    0 Oct 12 16:38 normal_golden.vcf.job07of10
-rw-r--r-- 1 root root    0 Oct 12 16:38 normal_golden.vcf.job08of10
-rw-r--r-- 1 root root  89K Oct 12 16:42 normal_golden.vcf.job09of10
-rw-r--r-- 1 root root  16K Oct 12 16:53 normal_golden.vcf.job10of10
-rw-r--r-- 1 root root 1.7G Oct 12 17:09 normal_merged_golden.bam
-rw-r--r-- 1 root root 947K Oct 12 17:09 normal_merged_golden.vcf
-rw-r--r-- 1 root root 2.0G Oct 12 17:07 normal_merged_read1.fq
-rw-r--r-- 1 root root 2.0G Oct 12 17:08 normal_merged_read2.fq
-rw-r--r-- 1 root root 180M Oct 12 16:55 normal_read1.fq.job01of10
-rw-r--r-- 1 root root 312M Oct 12 17:06 normal_read1.fq.job03of10
-rw-r--r-- 1 root root  53M Oct 12 16:43 normal_read1.fq.job04of10
-rw-r--r-- 1 root root 203M Oct 12 16:57 normal_read1.fq.job05of10
-rw-r--r-- 1 root root 193M Oct 12 16:56 normal_read1.fq.job06of10
-rw-r--r-- 1 root root 1.3M Oct 12 16:39 normal_read1.fq.job07of10
-rw-r--r-- 1 root root 100M Oct 12 16:48 normal_read1.fq.job08of10
-rw-r--r-- 1 root root  35M Oct 12 16:42 normal_read1.fq.job09of10
-rw-r--r-- 1 root root 165M Oct 12 16:53 normal_read1.fq.job10of10
-rw-r--r-- 1 root root 180M Oct 12 16:55 normal_read2.fq.job01of10
-rw-r--r-- 1 root root 312M Oct 12 17:06 normal_read2.fq.job03of10
-rw-r--r-- 1 root root  53M Oct 12 16:43 normal_read2.fq.job04of10
-rw-r--r-- 1 root root 203M Oct 12 16:57 normal_read2.fq.job05of10
-rw-r--r-- 1 root root 193M Oct 12 16:56 normal_read2.fq.job06of10
-rw-r--r-- 1 root root 1.3M Oct 12 16:39 normal_read2.fq.job07of10
-rw-r--r-- 1 root root 100M Oct 12 16:48 normal_read2.fq.job08of10
-rw-r--r-- 1 root root  35M Oct 12 16:42 normal_read2.fq.job09of10
-rw-r--r-- 1 root root 165M Oct 12 16:53 normal_read2.fq.job10of10

How to run genReads.py with multiple threads using --job?

Hi Zachary,
Thank you for making this! Is it possible to make it clearer in the documentation about running genReads.py using multiple threads? I couldn't exactly understand what arguments are expected with the command "--job" to achieve this.

Error indexing golden.bam

When I try to build the index of the golden.bam I encounter an error:

java -jar $PICARD BuildBamIndex I=golden.bam O=golden.bam.bai
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing SAM header. @RG line missing SM tag. Line:
@RG	ID:NEAT; File /home/sequentia1/Alvaro/todo/ggg/masked_2n_control_golden.bam; Line number 9
	at htsjdk.samtools.SAMTextHeaderCodec.reportErrorParsingLine(SAMTextHeaderCodec.java:265)
	at htsjdk.samtools.SAMTextHeaderCodec.access$200(SAMTextHeaderCodec.java:43)
	at htsjdk.samtools.SAMTextHeaderCodec$ParsedHeaderLine.requireTag(SAMTextHeaderCodec.java:346)
	at htsjdk.samtools.SAMTextHeaderCodec.parseRGLine(SAMTextHeaderCodec.java:175)
	at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:108)
	at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:667)
	at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
	at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
	at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:396)
	at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:208)
	at picard.sam.BuildBamIndex.doWork(BuildBamIndex.java:137)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
	at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

When I try to correct it by AddOrReplaceReadGroups, there is another error:

java -jar $PICARD AddOrReplaceReadGroups I=golden.bam O=golden_RG.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=20 

Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing SAM header. @RG line missing SM tag. Line:
@RG	ID:NEAT; File /home/sequentia1/Alvaro/todo/ggg/masked_2n_control_golden.bam; Line number 9
	at htsjdk.samtools.SAMTextHeaderCodec.reportErrorParsingLine(SAMTextHeaderCodec.java:265)
	at htsjdk.samtools.SAMTextHeaderCodec.access$200(SAMTextHeaderCodec.java:43)
	at htsjdk.samtools.SAMTextHeaderCodec$ParsedHeaderLine.requireTag(SAMTextHeaderCodec.java:346)
	at htsjdk.samtools.SAMTextHeaderCodec.parseRGLine(SAMTextHeaderCodec.java:175)
	at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:108)
	at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:667)
	at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
	at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
	at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:396)
	at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:152)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
	at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)


Exact Coverage

Hi,

How can I determine the exact coverage and no the average one using your tool?

Thanks for your attention!

The -t parameter doesnt seem to be working

Hey there!
I run the following line

python genReads.py -r /shared/seq/hg38.fa -R 126 -o /home/eap/simulated_reads/phenyl2 -t /home/eap/docencia/biocomp/phenylk.bed

but I get reads from the whole genome! Any help? I attach the bed file.
phenylk.bed.txt

ZeroDivisionError: float division by zero when computingGC.

Dear Zach,
I trying to computeGC.py and I have generated a genomecov file from bedtools as fellow:
bedtools genomecov -d -ibam id.sorted.bam -g ${ref} > id.genomecov
then I took id.genomecov to computeGC by neat:
python ${neat}utilities/computeGC.py -r ${ref} -i id.genomecov -w 50000 -o id_GC_models.p
I got this result:
line 103, in
avgCov = allMean/float(runningTot)
ZeroDivisionError: float division by zer

Any suggestions on how I can fix it or maybe change the sliding window? Any help is appreciated !
Thank you in advance.

Regards
Shatha

Coverage too low

The coverage across most of the sites specified in the given BED file is extremely low, typically 10 or less.

This is not always true, as some regions are extremely saturated.

VCF file from COSMIC

Hi Zachary,
Thank you for making this! Is it possible to make it clearer in the documentation about running genReads.py using a VCF? What is the exact VCF format that is expected? I am trying to a VCF from COSMIC however am getting errors presumably due to missing GT/AF fields? I did modify the parseVCF function to have "INCLUDE_FAIL = True" however am still getting errors. Any suggestions on how to mod the COSMIC VCF for compatibility would be greatly appreciated!

Example from CSOMIC VCF:

##fileformat=VCFv4.1
##source=COSMICv77
##reference=GRCh37
##fileDate=20160407
##comment="Missing nucleotide details indicate ambiguity during curation process"
##comment="URL stub for COSM ID field (use numeric portion of ID)='http://grch37-cancer.sanger.ac.uk/cosmic/mutation/overview?id='"
##comment="REF and ALT sequences are both forward strand
##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">
##INFO=<ID=STRAND,Number=1,Type=String,Description="Gene strand">
##INFO=<ID=CDS,Number=1,Type=String,Description="CDS annotation">
##INFO=<ID=AA,Number=1,Type=String,Description="Peptide annotation">
##INFO=<ID=CNT,Number=1,Type=Integer,Description="How many samples have this mutation">
##INFO=<ID=SNP,Number=0,Type=Flag,Description="classified as SNP">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
7   55242448    COSM3734669 A   G   .   .   GENE=EGFR;STRAND=+;CDS=c.2218A>G;AA=p.I740V;CNT=1
7   55242449    COSM85892   T   TTCCCGTCGCTATTAAAAT .   .   GENE=EGFR;STRAND=+;CDS=c.2219_2220ins18;AA=p.K745_E746insIPVAIK;CNT=1
7   55242449    COSM52934   T   C   .   .   GENE=EGFR;STRAND=+;CDS=c.2219T>C;AA=p.I740T;CNT=1
7   55242463    COSM18420   AAGG    A   .   .   GENE=EGFR;STRAND=+;CDS=c.2234_2236delAGG;AA=p.E746delE;CNT=2
7   55242463    COSM1190791 AAGGAATTAAGAGAAG    A   .   .   GENE=EGFR;STRAND=+;CDS=c.2234_2248del15;AA=p.K745_A750>T;CNT=4
7   55242464    COSM13549   AGGAATTAAGAGAAGCAA  AAG .   .   GENE=EGFR;STRAND=+;CDS=c.2235_2251>AG;AA=p.E746_T751>A;CNT=1
7   55242464    COSM255152  A   AAACTCCCGTCGCTATCAA .   .   GENE=EGFR;STRAND=+;CDS=c.2234_2235ins18;AA=p.K745_E746insTPVAIK;CNT=1
7   55242464    COSM13550   AGGAATTAAGAGAAG AAATTC  .   .   GENE=EGFR;STRAND=+;CDS=c.2235_2248>AATTC;AA=p.E746_A750>IP;CNT=3
7   55242464    COSM24869   AGGAATTAAGAGAAGCAAC A   .   .   GENE=EGFR;STRAND=+;CDS=c.2235_2252del18;AA=p.E746_T751delELREAT;CNT=1
7   55242464    COSM6223    AGGAATTAAGAGAAGC    A   .   .   GENE=EGFR;STRAND=+;CDS=c.2235_2249del15;AA=p.E746_A750delELREA;CNT=1039
7   55242464    COSM51504   A   AAATTCCCGTCGCTATCAA .   .   GENE=EGFR;STRAND=+;CDS=c.2234_2235ins18;AA=p.K745_E746insIPVAIK;CNT=6

Regards,
Palak Sheth

This job id has no regions to process, exiting...

I keep getting this error, but I cannot tell if this is an issue or not. Looking at the code, the regions should be spread evenly across the threads, and if there are too many threads, it will lower the number of threads.

This job id has no regions to process, exiting...

I wonder--is this check working? (And does it happen before the reference is loaded in memory?)

The parameter setting

Thank you so much for the prompt reply!

I'd also like to ask:

(1) Why is the read length in the "Usage" description different from the command line? Is it a typo or a tricky thing I should be cautious dealing with?
For example, I want to set read length with 150 bps, but in the command line -R should be 151?
image

(2) After generating the sequence with this command, I got the golden VCF:
python genReads.py -r chr21.fa -R 150 -o /home/phoebe/neat-genreads/0728_chr21_simulated --bam --vcf --pe 300 30

What does the WP in the column INFO stand for?

image

Memory error (coverage/threads optimization)

I've generally been setting the number of threads equal to the coverage, but if I set the coverage too high (200x), this results in MemoryErrors.

I need about 100x coverage (a single sample with -c 100 nets me about 30-40x). I've been generating two sets of samples at 100x and combining them, but I'm not sure what sorts of side effects this may have, plus it takes 2x the time.

Do you have an estimate of the optimal coverage / thread settings?

Bug in genSeqErrorModel.py -i2 option

Hi,

I was using the genSeqErrorModel.py script with two input FASTQ files (using the -i2 option), and what caught my attention is that the script actually reads the same file twice (looking at the reading xxx.fastq log line).

A brief investigation of the script proved that the parseFQ function takes inf argument, but what it actually uses is the INF variable, which corresponds to the first FASTQ file.

Is there a galaxy tool-shed version of this somewhere?

I really like to be able to use your tool inside of Galaxy, but before I embark on doing a custom install, I was wondering if there is a version already out there.

Tried the usual places (tool-shed, google) but did not find anything so guessing I should start embarking?

Reversed base qualities in the second generated FASTQ file

Hello,

I've performed reads generation on a sample data coming from https://github.com/slzhao/QC3#sExample.

I have used two FASTQ files (3_1.example.fastq, 3_2.example.fastq) and a BAM file 3.example.bam. I want to generate paired-end data.

I have used the following utils to provide additional models for the generation:

  • computeFraglen.py
  • genSeqErrorModel.py (providing both FASTQ files using -i and -i2 options)

Then, I've run FastQC on both input and output (generated) files.
For the first generated FASTQ file, results were quite similar to the original FASTQ file.
Although, for the second generated FASTQ file, the quality values seem to be reversed (with regard to position in read), see the attached FastQC graphs below:

Original:
reversed_base_qualities_in_the_second_generated_fastq_file_original
Generated:
reversed_base_qualities_in_the_second_generated_fastq_file_generated

Is it a bug in neat-genreads? Or is this behavior expected? Am I missing some knowledge about how R2 reads are stored?

ZeroDivisionError when specifying -to option to be 0

Hello Zach,

I was generating simulated reads using the following command:
"python /u/sbundhoo/neat-genreads/genReads.py -r /.mounts/labs/PDE/data/reference/hg19/fasta/UCSC/chr21.fa -R 101 -o /scratch2/users/sbundhoo/02_02 --bam --vcf --gz --pe 300 30 -c 50 -to 0-t /u/sbundhoo/bed_files/sorted_offtarget500b_chr21_all.bed"

However, when I specify -to to 0, I get an error as such:
"ZeroDivisionError: float division by zero"

Isn't the off-target coverage option supposed to work with 0 as the argument?

Thanks,

Shaheen

Pep8 and python standards

For the most part the code not too difficult to follow, but readability could be improved by limiting the line length to 80 columns and using better whitespace. I would recommend using a linter and following pep8 standards.

This is less an issue and more of a suggestion. Whether or not these recommendations are implemented is, of course, entirely up to you. Either way, thank you for taking the time to write this software!

For example, line 58 in probability.py:

# current
return str(self.weights)+' '+str(self.values)+' '+self.method

# better
return str(self.weights) + ' ' + str(self.values) + ' ' + self.method

# best
return '{} {} {}'.format(self.weights, self.values, self.method)

# another option
return ' '.join([self.weights, self.values, self.method])

Similarly, lines 22-24 in probability.py:

if len(self.values) != len(self.weights):
    print '\nError: length and weights and values vectors must be the same.\n'
    exit(1)

could be more pythonic by using exceptions:

if len(self.values) != len(self.weights):
    msg = 'Length and weight vectors must be the same length'
    raise ValueError(msg)

(also, take a look at the logging package instead of using print).

Finally, tabs. I personally prefer using spaces (pep8 also recommends spaces), but the decision is ultimately yours; however, please try to be consistent. For example, lines 98-101 in SequenceContainer.py are indented using tabs up to the normal indentation, but the arguments are aligned to the [ using spaces.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.