Git Product home page Git Product logo

bedtools2's People

Contributors

38 avatar agordon avatar alanhoyle avatar arjanvandervelde avatar arq5x avatar brentp avatar bwlang avatar cbrueffer avatar charles-plessy avatar daler avatar davidrichardson avatar ghuls avatar gtamazian avatar hisakatha avatar jakebiesinger avatar jayhesselberth avatar jmarshall avatar lbthrice avatar lindenb avatar lukegoodsell avatar mvdbeek avatar nkindlon avatar portah avatar rmzelle avatar ryan-williams avatar skotchandsoda avatar timflutre-perso avatar wookietreiber avatar xhsien avatar yesimon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bedtools2's Issues

multicov: Could not open input BAM files.

For certain BED files, multicov will output Could not open input BAM files. when given 2 files. However, when run with either of these files independently, it will run without problem. Below is a script to recreate the problem tested on ed71c8e .

The BAMs show no errors with samtools flagstat $BAM.

wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam
wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7H3k27me3bUcdAlnRep1.bam
echo $'chr1\t100\t200' | ./bin/bedtools multicov -bams \
                wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam  \
                wgEncodeSydhHistoneMcf7H3k27me3bUcdAlnRep1.bam \
                -bed -

Make BAM fields accessible by fieldNum.

This will allow the map and merge tools to apply operations to numbered columns. Relatively straightforward for the core BAM fields. Optional BAM tags could be tricky, especially with the BamTools tag interface. One solution is to start with support just for the core fields.

intersectBed with standard input, and -sorted

Hi,

I am doing an intersectBed with two files, bedA.bed and bedB.bed. Both are presorted, and one is being piped through standard input. I notice the following two commands give different results:

  1. cat bedA.bed | intersectBed -sorted -a stdin -b bedB.bed
  2. cat bedA.bed | intersectBed -sorted -a bedB.bed -b stdin

Namely, option #2 seems to give the 'wrong' answer. Running without the -sorted option gives the correct answer no matter which is in -a and -b.

I realize this may be because of how the memory sweep algorithm works, but could you put this in the intersectBed documentation directly as a warning?

fastaFromBed 0-based retrieval question.

Hi Guys

this might be a stupid question... but in USCS (a 0-based coordinate system browser), the coordinates :

chr14:79498951-79499010

correspond to the sequence:

CTAAGCCACACCATAACTGACTTCTAGGCATTCATCTTTCTTCCACTTAAATTCATTCTC

however, using fastaFromBed to retrieve this sequence from the hg19 assembly fasta file with this command:

bedtools getfasta -name -tab -bed coordinates.bed -fi hg19.fasta -fo output_coords.txt

Trims the 1st base from the sequence

TAAGCCACACCATAACTGACTTCTAGGCATTCATCTTTCTTCCACTTAAATTCATTCTC

so in order to get the correct sequence I need to subtract 1 from the start coordinate.

Since both bedtools and USCS use 0-based coordinates , why is this ?

Thank you for your help.

Duarte

PS: Using bedtools v2.17.0

intersect no longer reports original data with -wa

When running bedtools intersect -a fileA.bed -b fileB.bed -wa the output forces proper bed format for column 5 (strand) in the original file instead of reporting the contents of the original file.

For example if the fileA input region were:
chr1 16110 16390 CTCF 227 K562,HeLa-S3,MCF-7,HCPEpiC

The output if it overlapped a feature is this:
chr1 16110 16390 CTCF 227 .

whereas in 2.17 it was this:
chr1 16110 16390 CTCF 227 K562,HeLa-S3,MCF-7,HCPEpiC

Allow to read whole sequence names in FASTA files

Currently, it seems that, for sequence names in FASTA files, only the first word after ">" is taken into account. The code even contains a comment "just write the first component of the name, for compliance with other tools".

Would you agree to change this behavior? Or add an option for it? Would it break many things in the rest of the code? (And out of curiosity, which tools are referred to in the comment?)

Although it is easy to let the user fix this, it's a hassle to do that for genome reference sequences containing thousands (or more) of contigs.

groupBy ops difference

Hi There,

I am just looking at groupBy usage and was wondering how can we use it with an ops that is not on the list which is "difference" or subtraction

Is that possible ?

Thanks

subtractBed on vcfs duplicates the indels from the first file

discovered it is 2.17, but replicated it is latest version from git (v2.19.1-10-g377f1b1):

If an indel in the first file overlaps multiple variants from the second file, the indel is repeated multiple times in subtractBed output. Here is an example:

rvijaya@ubuntu:~/misc/temp$ cat testbed1.vcf

fileformat=VCFv4.1

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample

chr1 5935162 rs1287637 AAAAAAAAAAAAAAA T 14677.8 PASS AC=2;AF=1.00;AN=2;DP=625;Dels=0.00 GT:AD:DP:GQ:PL 1/1:0,625:625:99:14706,1035,0
rvijaya@ubuntu:~/misc/temp$ cat testbed2.vcf

fileformat=VCFv4.1

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample

chr1 5935162 rs1287637 A T 14677.8 PASS AC=2;AF=1.00;AN=2;DP=625;Dels=0.00 GT:AD:DP:GQ:PL 1/1:0,625:625:99:14706,1035,0
chr1 5935168 rs3747992;COSM426517 G A 20400.8 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=24.992;DP=2096;Dels=0.00 GT:AD:DP:GQ:PL 0/1:1048,1046:2096:99:20429,0,19945
rvijaya@ubuntu:~/misc/temp$ ~/software/bedtools2/bin/subtractBed -a testbed1.vcf -b testbed2.vcf
chr1 5935162 rs1287637 AAAAAAAAAAAAAAA T 14677.8 PASS AC=2;AF=1.00;AN=2;DP=625;Dels=0.00 GT:AD:DP:GQ:PL 1/1:0,625:625:99:14706,1035,0
chr1 5935162 rs1287637 AAAAAAAAAAAAAAA T 14677.8 PASS AC=2;AF=1.00;AN=2;DP=625;Dels=0.00 GT:AD:DP:GQ:PL 1/1:0,625:625:99:14706,1035,0

slop broken with big -b value

Newest github repo.

test_hg19.bed

chr1    36337   36537
chr1    780737  781137
chr1    948337  949337
slopBed -i test_hg19.bed -g hg19.genome -b 3000000000 > test_slop.bed

and the content of test_slop.bed:

chr1    0   -2147447111
chr1    0   -2146702511
chr1    0   -2146534311

Here I just want to use it in a diffrenet cutoff loop, it works with small -b, but failed at big value.

bedtools custom include for zlib

Right now it is not possible to build bedtools cleanly without zlib installed as root. The Makefile should pass custom include / library header $CXXFLAGS.

This is useful for tools like conda.

export CXXFLAGS = -Wall -O2 -D_FILE_OFFSET_BITS=64 -fPIC $(INCLUDE)

to compile like this:

export INCLUDE="-I$PREFIX/include -L$PREFIX/lib"
make

sample outputs BED12 when input is BAM

bedtools sample -i NA18152.bam -n 100 \
| head -n 2
chrX    70381952    70382329    NA18152-SRR007381.808759    46  -   70381952    70382329    0,0,0   1   377,    0,
chr7    100594607   100595034   NA18152-SRR007381.795852    15  +   100594607   100595034   0,0,0   1   427,    0,

This is confusing because it should just output BAM unless the -bed option is used. Also, the -ubam option is dsiplayed in the help indicating that the default output for BAM is compressed BAM.

File type checking not recognized during context validation

The validation stage is supposed to make sure all options given are allowed with the detected file types. However, it doesn't appear to be recognizing file types at that point. For example:
bedtools2-release/bin/bedtools intersect -a a.bam -b a.bed -header
Should give a warning that the header option doesn't work with a BAM query unless BED output is specified with the -bed option. However, this warning does not occur.

Incorrect binary output with intersect -c -sorted with BAM as database

Using >=2.19.0 with a BED query and a BAM database, binary output is added at the end of the (supposedly BED) output when using the -c option (and possibly other options). The BED output appears to be correct, but binary output is incorrectly tacked onto the end. I suspect this is a BAM EOF footer that is incorrectly being added based on the fact that the database is a BAM file.

bedtools intersect -a hg19.100K.windows.gatk.bed -b ../bam/s05-R-2379-AJ-0009.bwamem.sort.dedup.bam -c -sorted > s05-R.100k.bedg
tail -20  s05-R.100k.bedg
Y   58800000    58900000    1115
Y   58900000    59000000    11942
Y   59000000    59100000    3851
Y   59100000    59200000    0
Y   59200000    59300000    0
Y   59300000    59373566    0

�BC
mUKh^U���S)

overlap reported with -wo does not honor -split

Basically, -split detects overlaps with blocked records just fine, but then -wo is used, the number of base pairs reported reflects what would be reported were -split not used.

For example, if you had a two block SAM record with blocks from 0:10 and 20:30 and you also had a BED from 5:15, -wo with -split would return 10 instead of 5.

Also, when -f is also requested, the result is incorrect owing to the fact that the number of overlapping bases is miscomputed prior to testing on -f.

We need to fix thisd and put in appropriate unit tests.

Details here: https://code.google.com/p/bedtools/issues/detail?id=165

Ignore DOS end-line characters in PFM

The old version of bedtools was able to recognize and ignore the DOS end-of-line characters, represented by "\r\n", or "^M" on the command line. PFM needs to be able to do this as well. Currently it throws errors on files containing these, saying it can not recognize the format.

Add unit tests for GFF files

I was looking at the print method's code for GFF records, and believe it may be incorrect. We should double check, and add unit tests for it if we don't already have it.

Imprecise Structural Variants

Bedtools does not correctly find intersections for imprecise Structural Variants. The Alt allele is encoded as a or for example with recent VCF format standards. Instead of using the ALT text to determine the length, bedtools should use the SVLEN INFO field that is available to determine the length of the SV.

IO Size is 4KB

Hi Guys,

We have users running bedtools on our compute clusters have have noticed that the IO requests issued are 4KB. This is a breakdown of the IO sizes as seen by the Lustre llite VFS layer on Lustre clients for a run of bed tools over a couple of minutes:

    extents            calls    % cum%  |          calls    % cum%

   0K -    4K :       14099274   99   99  |              0    0    0
   4K -    8K :              0    0   99  |              0    0    0
   8K -   16K :              0    0   99  |              0    0    0
  16K -   32K :              0    0   99  |              0    0    0
  32K -   64K :              0    0   99  |              0    0    0
  64K -  128K :              0    0   99  |              0    0    0
 128K -  256K :              0    0   99  |              0    0    0
 256K -  512K :              0    0   99  |              0    0    0
 512K - 1024K :              0    0   99  |              0    0    0
   1M -    2M :              0    0   99  |              0    0    0
   2M -    4M :             88    0  100  |              0    0    0

Could you employ input and output buffering or make the IO size a tunable? Running 1000s of bedtools run at the same time is causing a lot of small IO traffic to our Lustre servers.

Anyone seen something similar?

Intersect BED with VCF with genotype data

After the 2.17 release I get an error when intersecting a BED file with a VCF that contains genotype information (i.e. sample VCF file)

In my query VCF, containing genotypes for 100 samples, I get the following error in 2.19 release:
intersectBed -a query.vcf.gz -b target.bed.gz -wa -wb
Error: line number 1 of file query.vcf.gz has 109 fields, but 10 were expected.

This worked fine in previous releases.

Fails to build with clang on Mac OS X

$ make
Building BEDTools:
=========================================================
DETECTED_VERSION = v2.18.0
CURRENT_VERSION  = 
Updating version file.
 * Creating BamTools API
- Building in src/utils/bedFile
- Building in src/utils/BinTree
- Building in src/utils/version
- Building in src/utils/bedGraphFile
  * compiling BinTree.cpp
  * compiling bedFile.cpp
  * compiling version.cpp
  * compiling bedGraphFile.cpp
- Building in src/utils/chromsweep
  * compiling chromsweep.cpp
In file included from BinTree.cpp:2:
In file included from ../../utils//FileRecordTools/FileRecordMgr.h:16:
../../utils//general/DualQueue.h:48:35: error: declaration of 'T' shadows
      template parameter
template <class T, template<class T> class CompareFunc> class DualQueue {
                                  ^
../../utils//general/DualQueue.h:48:17: note: template parameter is declared
      here
template <class T, template<class T> class CompareFunc> class DualQueue {
                ^
- Building in src/utils/Contexts
  * compiling Context.cpp
1 error generated.
make[1]: *** [../../../obj//BinTree.o] Error 1
make: *** [src/utils/BinTree] Error 2
make: *** Waiting for unfinished jobs....

Remove A and B file from merge help

Uma noticed that I had references to the A and B file in the merge help in the sections explaining the -c and -o options. This was a cut and paste from the map help. Just need to remove those bits.

Can we reduce memory usage with the R-tree for unsorted data?

The memory usage has jumped significantly from 2.17 to 2.18 for unsorted data. This is primarily owing to the fact that both numeric and string versions of many fields are stored for each record. We should look into opportunities for reducing this footprint if at all possible to facilitate better scaling for unsorted datasets.

The counter-argument, of course, is that for large datasets, one should really be using pre-sorted data.

See this thread:
https://groups.google.com/forum/#!topic/bedtools-discuss/D04h7-o91_o

Speed up genomecov

Currently, even with sorted data, the algorithm is two-pass. It can reallty be done in one pass. As such, memory and runtime will be reduced.

bedtools python bindings

Hello,

I'd like to know what the most up-to-date python bindings are and whether they work with bedtools2.

Thanks.

-- Arjan

Build error on CentOS/RedHat systems with 2.18.0

Neil and Aaron;
We're running into build errors with 2.18.0 on RedHat/CentOS systems. It build cleanly on Ubuntu but got two separate reports of failures on RedHat systems (bcbio/bcbio-nextgen#219 (comment) and bcbio/bcbio-nextgen#220 (comment)).

I can replicate this on a CentOS 5 system:

$ uname -a
Linux hsph01.rc.fas.harvard.edu 2.6.18-274.12.1.el5 #1 SMP Tue Nov 29 13:37:46 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
hsph01:~/bio/cloudbiolinux $ cat /etc/redhat-release 
CentOS release 5.7 (Final)

The final error message is:

obj/InputStreamMgr.o: In function `InputStreamMgr::detectBamOrBgzip(int&, int)':
InputStreamMgr.cpp:(.text+0x534): undefined reference to BamTools::BamReader::OpenStream(std::istream*)'

and here's a full build log:

https://gist.github.com/chapmanb/7981795

Happy to provide any more information that could help. Thanks much,
Brad

How to incorporate part of bedtools2 (under GPLv2) into a project under GPLv3+?

Hello,

I am developing a C++ package doing eQTL detection and I licensed it under GPLv3+ (the "+" meaning "or any later version"). By looking at the README.md or the source code (say "bedFile.h"), bedtools is licensed under GPLv2. Thus, I can't use code from bedtools2 into my package, see here.

Maybe you could switch to GPLv3+. But I can understand that you want to keep GPLv2, as switching can disrupt things for other developers. However, would it be possible to unambiguously specify in bedtools2 that you allow usage of the code under GPLv2 as well as later versions? That is, to switch to GPLv2+?

In fact, this is suggested in the LICENSE file itself, in section 9. Practically, this would require adding "or any later version" in the README.md file (and also in the header of the source files, to avoid confusion).

Thanks,
Tim

making man file fails for 2.18.1

On OS X 10.7, building the documentation for 2.18.1, I get the following failure for the man file (the html target succeeds):

make -j1 -C docs man
sphinx-build -b man -d _build/doctrees   . _build/man
Making output directory...
Running Sphinx v1.1.3
loading pickled environment... done
loading intersphinx inventory from http://docs.python.org/objects.inv...
building [man]: all manpages
updating environment: 0 added, 48 changed, 0 removed
reading sources... [100%] index                                                                                                                  
/sw/build.build/bedtools-2.18.1-1/bedtools2/docs/content/bedtools-suite.rst:16: WARNING: toctree contains reference to nonexisting document u'content/tools/igv'
/sw/build.build/bedtools-2.18.1-1/bedtools2/docs/content/general-usage.rst:25: WARNING: Enumerated list ends without a blank line; unexpected unindent.

[ a bunch more warnings ]

/sw/build.build/bedtools-2.18.1-1/bedtools2/docs/content/general-usage.rst:166: ERROR: Unexpected indentation.
/sw/build.build/bedtools-2.18.1-1/bedtools2/docs/content/general-usage.rst:184: ERROR: Unexpected indentation.
/sw/build.build/bedtools-2.18.1-1/bedtools2/docs/content/general-usage.rst:161: ERROR: Unknown target name: "aza-z0-9".
/sw/build.build/bedtools-2.18.1-1/bedtools2/docs/content/related-tools.rst:3: WARNING: Duplicate explicit target name: "here".
=======================================================================================
``-both`` Report both the count of hits and the fraction covered from the annotation files
=======================================================================================
/sw/build.build/bedtools-2.18.1-1/bedtools2/docs/content/tools/bamtobed.rst:149: ERROR: Unexpected indentation.
/sw/build.build/bedtools-2.18.1-1/bedtools2/docs/content/tools/bed12tobed6.rst:14: ERROR: Unexpected indentation.
/sw/build.build/bedtools-2.18.1-1/bedtools2/docs/content/tools/bed12tobed6.rst:30: ERROR: Unexpected indentation.
/sw/build.build/bedtools-2.18.1-1/bedtools2/docs/content/tools/bedtobam.rst:12: ERROR: Unexpected indentation.
/sw/build.build/bedtools-2.18.1-1/bedtools2/docs/content/tools/bedtobam.rst:30: ERROR: Unexpected indentation.
/sw/build.build/bedtools-2.18.1-1/bedtools2/docs/content/tools/bedtobam.rst:57: ERROR: Unexpected indentation.

[ tons more similar errors ]

/sw/build.build/bedtools-2.18.1-1/bedtools2/docs/content/tools/unionbedg.rst:148: ERROR: Unexpected indentation.
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
writing... bedtools.1 { content/overview content/installation content/quick-start content/general-usage content/bedtools-suite content/tools/annotate content/tools/bamtobed content/tools/bamtofastq content/tools/bed12tobed6 content/tools/bedpetobam content/tools/bedtobam content/tools/closest content/tools/cluster content/tools/complement content/tools/coverage content/tools/expand content/tools/flank content/tools/genomecov content/tools/getfasta content/tools/groupby content/tools/intersect content/tools/jaccard content/tools/links content/tools/makewindows content/tools/map content/tools/maskfasta content/tools/merge content/tools/multicov content/tools/multiinter content/tools/nuc content/tools/overlap content/tools/pairtobed content/tools/pairtopair content/tools/random content/tools/reldist content/tools/shuffle content/tools/slop content/tools/sort content/tools/subtract content/tools/tag content/tools/unionbedg content/tools/window content/example-usage content/advanced-usage content/tips-and-tricks content/faq content/related-tools } 
Exception occurred:
  File "/sw/lib/python2.7/site-packages/docutils/writers/manpage.py", line 865, in dedent
    self._indent.pop()
IndexError: pop from empty list

This is the sphinx error log file:

# Sphinx version: 1.1.3
# Python version: 2.7.6
# Docutils version: 0.10 release
# Jinja2 version: 2.7.1
Traceback (most recent call last):
  File "/sw/lib/python2.7/site-packages/sphinx/cmdline.py", line 189, in main
    app.build(force_all, filenames)
  File "/sw/lib/python2.7/site-packages/sphinx/application.py", line 204, in build
    self.builder.build_update()
  File "/sw/lib/python2.7/site-packages/sphinx/builders/__init__.py", line 191, in build_update
    self.build(['__all__'], to_build)
  File "/sw/lib/python2.7/site-packages/sphinx/builders/__init__.py", line 252, in build
    self.write(docnames, list(updated_docnames), method)
  File "/sw/lib/python2.7/site-packages/sphinx/builders/manpage.py", line 88, in write
    docwriter.write(largetree, destination)
  File "/sw/lib/python2.7/site-packages/docutils/writers/__init__.py", line 80, in write
    self.translate()
  File "/sw/lib/python2.7/site-packages/sphinx/writers/manpage.py", line 35, in translate
    self.document.walkabout(visitor)
  File "/sw/lib/python2.7/site-packages/docutils/nodes.py", line 174, in walkabout
    if child.walkabout(visitor):
  File "/sw/lib/python2.7/site-packages/docutils/nodes.py", line 174, in walkabout
    if child.walkabout(visitor):
  File "/sw/lib/python2.7/site-packages/docutils/nodes.py", line 174, in walkabout
    if child.walkabout(visitor):
  File "/sw/lib/python2.7/site-packages/docutils/nodes.py", line 174, in walkabout
    if child.walkabout(visitor):
  File "/sw/lib/python2.7/site-packages/docutils/nodes.py", line 174, in walkabout
    if child.walkabout(visitor):
  File "/sw/lib/python2.7/site-packages/docutils/nodes.py", line 174, in walkabout
    if child.walkabout(visitor):
  File "/sw/lib/python2.7/site-packages/docutils/nodes.py", line 174, in walkabout
    if child.walkabout(visitor):
  File "/sw/lib/python2.7/site-packages/docutils/nodes.py", line 187, in walkabout
    visitor.dispatch_departure(self)
  File "/sw/lib/python2.7/site-packages/docutils/nodes.py", line 1640, in dispatch_departure
    return method(node)
  File "/sw/lib/python2.7/site-packages/docutils/writers/manpage.py", line 411, in depart_admonition
    self.depart_block_quote(node)
  File "/sw/lib/python2.7/site-packages/docutils/writers/manpage.py", line 449, in depart_block_quote
    self.dedent()
  File "/sw/lib/python2.7/site-packages/docutils/writers/manpage.py", line 865, in dedent
    self._indent.pop()
IndexError: pop from empty list

intersectBed 2.19.1 hangs with empty vcf

Running intersectBed on a vcf that has a header but no variant calls results hangs forever.

$cat test.vcf

fileformat=VCFv4.1

contig=<ID=chr1,assembly=hg19,length=249250621>

this command never exits:
$intersectBed -a test.vcf -b any.bed

Make fields accessible by fieldNum rather than name

From Aaron:
As you can see in this example: http://bedtools.readthedocs.org/en/latest/content/tools/map.html#mean-compute-the-mean-of-a-column-from-overlapping-intervals, map allows users to detect intersections between A and B and specify a column from the B file that should be used to summarize the hist in B for each record in A.

So, if the user said -c 8 -o mean, it would indicate that the mean of the 8th column (whatever it may be) in B should be calculated across all the records in B that overlao the current record in A.

Thus, it requires that we be able to access columns in Records by position, rather than by name. In 2.17, I had a "fields" vector of strings that stored each column.

Question: bedtools genomecov -d output from bam files does not include chromosomes with not coverage

after creating a BAM file by aligning RNA sequence to a small file of coding sequences bedtools genomecov -d did not report sequences with zero coverage over the entire length.

Is this a feature or a bug ?

$samtools idxstats ARC10_.unique.bam > ARC10_.unique.idxstats # get output of idxstats
$cat ARC10_.unique.idxstats | grep -v "^*" | wc -l # count the number of entries in idxstats
233
$cat ARC10_.unique.idxstats | cut -f3 | grep -v '^0' | wc -l # count the number of entries in idx stats without zero coverage
158
$./bedtools --versionbedtools v2.19.1-41-g656fd84 # show we are using current version of bedtoos
$ ./bedtools genomecov -dz -ibam ARC10_.unique.bam > ARC10_.unique.depth # output all coverage of bedtools
$ cat ARC10_.unique.depth | cut -f1 | uniq | wc -l # count the number of different sequences in bedtools
158

I hope to get small example fasta and SAM file together to better demonstrate I hope the above is understandable

empty bed files failing again

The previous arq5x/bedtools#30 issue has resurfaced in the new repository:

$ bedtools sort -i /dev/null
Error: The requested file (/dev/null) could not be opened. Error message: (Success). Exiting!

It appears that the isGzipFile() part of ca50f59 has been reverted by 69d82a7.

closest skips line from A file

I have 2 files
a.bed
chrUn_gl000220 130081 130187 2364_IonXpress_010 36 +

b.bed
chrUn_gl000220 5406 5414 XYZ 8 +
chrUn_gl000220 12203 12211 XYZ 8 -
chrUn_gl000220 25451 25459 XYZ 8 -
chrUn_gl000220 28956 28964 XYZ 8 -
chrUn_gl000220 37582 37590 XYZ 8 -
chrUn_gl000220 90417 90425 XYZ 8 +
chrUn_gl000220 120950 120958 XYZ 8 -

When running closest with the following options:
bedtools closest -S -D a -a a.bed -b b.bed
I got the expected:
chrUn_gl000220 130081 130187 2364_IonXpress_010 36 + chrUn_gl000220 120950 120958 XYZ 8 - -9124

However with this additional option:
bedtools closest -S -D a -iu -a a.bed -b b.bed
I got nothing. The problem seems to come from the fact that the interval from the a.bed file is close to the extremity of the chrUn and there is not a single interval in b.file which fits the -iu constraint. But instead of keeping the original entry of the a.bed file adding a "none" or another negative mark, it removes it.

Is it a bug or the expected behavior of this function?

PS: I use Bedtools 2.17.0

bedtools-2.19.1: Fasta.cpp:29:12: warning: converting to non-pointer type 'int' from NULL [-Wconversion-null]

While compiling the package on Gentoo Linux I get a warning about your coding style: ;-)

  • QA Notice: Package triggers severe warnings which indicate that it
  •        may exhibit random runtime failures.
    
  • Fasta.cpp:29:12: warning: converting to non-pointer type 'int' from NULL [-Wconversion-null]
  • Fasta.cpp:32:15: warning: converting to non-pointer type 'int' from NULL [-Wconversion-null]
  • Fasta.cpp:33:15: warning: converting to non-pointer type 'int' from NULL [-Wconversion-null]
  • Please do not file a Gentoo bug and instead report the above QA
  • issues directly to the upstream developers of this software.
  • Homepage: http://code.google.com/p/bedtools/

merge's output for BAM files needs to be modified.

The BED3 intervals reported when merging BAM input files appear to be correct. However, the remaining 9 columns of the full BED12 records emitted don't make sense.

Specifically, the read names and strands assigned to the merged record are odd because it appears that they are just chose from the first record in the merged block. Perhaps until -c and -o functionality are possible for BAM files we should just emit BED 3.

$ samtools view  ../BEDTools/testingData/NA18152.bam | cut -f 1-6 | head
NA18152-SRR007381.35051 16  chr1    554305  16  9M2D2M2D2M2D314M
NA18152-SRR007381.637219    16  chr1    554305  16  9M2D2M1D3M2D295M
NA18152-SRR007381.730912    16  chr1    554305  16  9M2D2M2D2M2D333M
NA18152-SRR007381.1166916   16  chr1    554305  15  9M2D2M2D2M2D15M1I16M1I415M
NA18152-SRR007381.1281127   16  chr1    554305  4   9M2D2M2D2M2D52M1D236M
NA18152-SRR007381.287200    16  chr1    554310  15  5M1D1M1D6M1I41M1I58M1D347M
NA18152-SRR007381.1466069   16  chr1    554324  24  8M1I6M1I217M1I109M
NA18152-SRR007381.1339811   1040    chr1    554335  39  106M1I178M
NA18152-SRR007381.591703    16  chr1    554338  29  132M1I33M
NA18152-SRR007381.437387    16  chr1    554346  31  42M1I56M1D347M

$ bin/bedtools merge -i ../BEDTools/testingData/NA18152.bam | head
chr1    554304  560167  NA18152-SRR007381.35051     -   -1  560167  0,0,0   1   0,  0,
chr1    714220  714373  NA18152-SRR007381.251923        +   -1  714373  0,0,0   1   0,  0,
chr1    780202  780556  NA18152-SRR007381.1452392   36  -   780202  780556  0,0,0   1   354,    0,
chr1    810530  810706  NA18152-SRR007381.1012740   55  -   810530  810706  0,0,0   1   176,    0,
chr1    825299  825688  NA18152-SRR007381.317662    37  -   825299  825688  0,0,0   1   389,    0,
chr1    847668  848139  NA18152-SRR007381.158556        +   -1  848139  0,0,0   1   0,  0,
chr1    860064  860310  NA18152-SRR007381.329161    10  -   860064  860310  0,0,0   1   246,    0,
chr1    863981  864228  NA18152-SRR007381.1450839   48  -   863981  864228  0,0,0   1   247,    0,
chr1    875876  876203  NA18152-SRR007381.732122    9   +   875876  876203  0,0,0   1   327,    0,
chr1    888310  888749  NA18152-SRR007381.814042        -   -1  888749  0,0,0   1   0,  0,

merge adds extra tab in some cases.

"mergeBed -s -nms -i z0 will put two consecutive tabs in the output that causes problems in the later processes"

z0 contents:
chr1 1102483 1102578 MIR200B 0 + 1102578 1102578 0 1 95, 0,
chr1 1103242 1103332 MIR200A 0 + 1103332 1103332 0 1 90, 0,
chr1 1104384 1104467 MIR429 0 + 1104467 1104467 0 1 83, 0,
chr1 3044538 3044599 MIR4251 0 + 3044599 3044599 0 1 61, 0,
chr1 3477259 3477354 MIR551A 0 - 3477354 3477354 0 1 95, 0,
chr1 5624130 5624203 MIR4417 0 + 5624203 5624203 0 1 73, 0,

Optimize VectorOps to construct from a vector or a HitSet

Many of the VectorOps methods (e.g., mean) can be computed in one pass of the data instead of the two passes current used (i.e., one pass to loop through the hits in a HitSet and extract the relevant columns, and another to compute the result). This should be optimized and we should be loading pointers to strings, not strings.

Secondly, optionally constructing from a HitSet will allow us to quickly compute results in one pass for many cases.

Empty .bam file from bedtools intersect

Hi,

I am running a Duplex-Seq pipeline that requires using bedtools.

All the sequence comes from a relatively small amplicon (90 bp), which is a p53 sequence. I have a sorted, indexed .bam file as the -abam input an a one-line .bed file for the p53 region as the -b input, but the resulting .bam file (which should have the overlapping reads) is empty. I have checked the alignment, and the coordinates are correct. In fact, I also tried using a .bed file specifying the entire chr17, and I still don't get any reads in the resulting .bam file

Can you please help me troubleshoot what might be going on?

Thanks,
Charles

Empty files and just headers with -h and -v

We need to insure that empty files give no output, rather than errors, as mentioned in another bug.

If there is a header, and -h is used, the header should still be output.

Lastly, the -v option needs to work correctly if the database file is empty.

Resolve issue with segfault on pybedtools test BAM files.

Bedtools intersect segfaults with the gdb.bam test file in pybedtools. However, this file works fine with bedtools 2.17 and samtools. Interestingly, if one converts it to SAM then back to BAM with samtools, the file works fine. It appears that something is odd with the way 2.18 is interpretting the header in this file or perhaps that the stream that is being created becomes corrupt.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.