Git Product home page Git Product logo

bgzfstreams.jl's Introduction

BGZFStreams

Unit Tests codecov.io Downstream

BGZF is a compression format that supports random access using virtual file offsets.

See the SAM/BAM file format specs for the details of BGZF: https://samtools.github.io/hts-specs/SAMv1.pdf.

Synopsis

using BGZFStreams

# The first argument is a filename or an IO object (e.g. IOStream).
stream = BGZFStream("data.bgz")

# BGZFStream is a subtype of IO and works like a usual IO object.
while !eof(stream)
    byte = read(stream, UInt8)
    # do something...
end

# BGZFStream is also seekable with a VirtualOffset.
seek(stream, VirtualOffset(0, 2))

# The current virtual file offset is available.
virtualoffset(stream)

close(stream)

Usage

The BGZFStreams.jl package exports three types and a function to the package user:

  • Types:
    • BGZFStream: an IO stream of the BGZF file format
    • VirtualOffset: data offset in a BGZF file
    • BGZFDataError: an error type thrown when reading a malformed byte stream
  • Function:
    • virtualoffset(stream): returns the current virtual file offset of stream

The BGZFStream type wraps an underlying IO object and transparently inflate (for reading) or deflate (for writing) data. Since it is a subtype of IO, an instance of it behaves like other IO objects, but the seek method takes a virtual offset instead of a normal file offset as its second argument.

The VirtualOffset type represents a 64-bit virtual file offset as described in the specification of the SAM file format. That is, the most significant 48-bit integer of a virtual offset is a byte offset to the BGZF file to the beginning position of a BGZF block and the least significant 16-bit integer is a byte offset to the uncompressed byte(s).

The BGZFDataError type is a subtype of Exception and used to throw an exception when invalid data are read.

The virtualoffset(stream::BGZFStream) returns the current virtual file offset. More specifically, it returns the virtual offset of the next reading byte while reading and the next writing byte while writing.

Parallel Processing

A major bottleneck of processing a BGZF file is the inflation and deflation process. The throughput of reading data is ~100 MiB/s, which is quite slower than that of raw reading from a file. In order to alleviate this problem, this package supports parallelized inflation when reading compressed data. This requires the support of multi-threading introduced in Julia 0.5. The JULIA_NUM_THREADS environmental variable sets the number of threads used for processing.

bash-3.2$ JULIA_NUM_THREADS=2 julia -q
julia> using Base.Threads

julia> nthreads()
2

Contributing

We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.

Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.

Financial contributions

We also welcome financial contributions in full transparency on our open collective. Anyone can file an expense. If the expense makes sense for the development the core contributors and the person who filed the expense will be reimbursed.

Backers & Sponsors

Thank you to all our backers and sponsors!

Love our work and community? Become a backer.

backers

Does your company use BioJulia? Help keep BioJulia feature rich and healthy by sponsoring the project. Your logo will show up here with a link to your website.

Questions?

If you have a question about contributing or using BioJulia software, come on over and chat to us on the Julia Slack workspace, or you can try the Bio category of the Julia discourse site.

bgzfstreams.jl's People

Contributors

bicycle1885 avatar ciaranomara avatar cth avatar dcjones avatar femtocleaner[bot] avatar staticfloat avatar tkelman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bgzfstreams.jl's Issues

Attempts at multithreading lead to various errors

I have a tool (https://github.com/IGS/FADU) that I would like to enable multithreading for. At an earlier iteration of the tool, I was using the BioAlignments package for BAM-file processing, and I believe multithreading was working properly then (at least implicitly where I would not have to call @threads or @spawn in my code). Because of the BAM functionality being moved to XAM.jl, I updated packages and it seems to have broken multithreading functionality. Currently I am trying to follow some tutorials (https://julialang.org/blog/2023/07/PSA-dont-use-threadid/#better_fix_work_directly_with_tasks) in an effort to get multithreading working again, but am not having much luck.

I wasn't sure what BioJulia repo to put this in, but judging from (BioJulia/XAM.jl#44) and (BioJulia/BioAlignments.jl#4) it seems this has not been implemented. The error I cite is from the BGZFStreams package so I am writing this here, but I can copy this over to XAM if it should be there.

Expected Behavior

If I run the following code without --threads=N, I expect the code to run successfully without erroring. Same if I run with julia --threads=N.

Current Behavior

If my GFF3 file has exactly 1 feature to process, then the test script runs as expected (with or with --threads command-line argument).

When there is 2+ features to process, then if --threads is not used or if --threads=1, I get the following:

ERROR: LoadError: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:349 [inlined]
 [2] fetch
   @ ./task.jl:369 [inlined]
 [3] _broadcast_getindex_evalf
   @ ./broadcast.jl:683 [inlined]
 [4] _broadcast_getindex
   @ ./broadcast.jl:656 [inlined]
 [5] getindex
   @ ./broadcast.jl:610 [inlined]
 [6] copy
   @ ./broadcast.jl:912 [inlined]
 [7] materialize
   @ ./broadcast.jl:873 [inlined]
 [8] process_all_features(features::Vector{GFF3.Record}, bam_file::String)
   @ Main ~/git/FADU/tst2.jl:92
 [9] top-level scope
   @ ~/git/FADU/tst2.jl:103

    nested task error: ArgumentError: too large in-block offset
    Stacktrace:
     [1] seek(stream::BGZFStreams.BGZFStream{IOStream}, voffset::BGZFStreams.VirtualOffset)
       @ BGZFStreams ~/.julia/packages/BGZFStreams/aVKm9/src/bgzfstream.jl:232
     [2] seek
       @ ~/.julia/packages/XAM/l6LT2/src/bam/reader.jl:68 [inlined]
     [3] iterate(iter::XAM.BAM.OverlapIterator{IOStream})
       @ XAM.BAM ~/.julia/packages/XAM/l6LT2/src/bam/overlap.jl:61
     [4] process_feature(feature::GFF3.Record, reader::XAM.BAM.Reader{IOStream})
       @ Main ~/git/FADU/tst2.jl:62
     [5] macro expansion
       @ ~/git/FADU/tst2.jl:85 [inlined]
     [6] (::var"#4#6"{SubArray{GFF3.Record, 1, Vector{GFF3.Record}, Tuple{UnitRange{Int64}}, true}, XAM.BAM.Reader{IOStream}})()
       @ Main ./threadingconstructs.jl:410
in expression starting at /Users/sadkins/git/FADU/tst2.jl:103

If --threads=2 is used, I get this:

ERROR: LoadError: TaskFailedException
Stacktrace:
    ### same "wait" stacktrace"
nested task error: zlib failed to inflate a compressed block
    Stacktrace:
     [1] error(s::String)
       @ Base ./error.jl:35
     [2] read_blocks!(stream::BGZFStreams.BGZFStream{IOStream})
       @ BGZFStreams ~/.julia/packages/BGZFStreams/aVKm9/src/bgzfstream.jl:402
     [3] seek(stream::BGZFStreams.BGZFStream{IOStream}, voffset::BGZFStreams.VirtualOffset)
       @ BGZFStreams ~/.julia/packages/BGZFStreams/aVKm9/src/bgzfstream.jl:229
     [4] seek
       @ ~/.julia/packages/XAM/l6LT2/src/bam/reader.jl:68 [inlined]
     [5] iterate(iter::XAM.BAM.OverlapIterator{IOStream})
       @ XAM.BAM ~/.julia/packages/XAM/l6LT2/src/bam/overlap.jl:61
     [6] process_feature(feature::GFF3.Record, reader::XAM.BAM.Reader{IOStream})
       @ Main ~/git/FADU/tst2.jl:196
     [7] macro expansion
       @ ~/git/FADU/tst2.jl:219 [inlined]
     [8] (::var"#4#6"{SubArray{GFF3.Record, 1, Vector{GFF3.Record}, Tuple{UnitRange{Int64}}, true}, XAM.BAM.Reader{IOStream}})()
       @ Main ./threadingconstructs.jl:410
in expression starting at /Users/sadkins/git/FADU/tst2.jl:103

Possible Solution / Implementation

I'm not sure of one at the moment.

Steps to Reproduce (for bugs)

using Base.Threads: nthreads, @spawn
using Base.Iterators: partition

using XAM
using GFF3
using GenomicFeatures

function process_feature(feature::GFF3.Record, reader::BAM.Reader)
    count = 0
    for alignment in eachoverlap(reader, feature)   # from XAM.jl   (BGZFStreams fails here)
        # Process the alignment here...
        count += 1
    end

    println("Processed feature: ", GFF3.attributes(feature, "ID"), " with ", count, " overlaps")
end

# this is inefficient, but it's just an example
function process_all_features(features::Vector{GFF3.Record}, bam_file::String)
    bai_file = bam_file * ".bai"
    reader = open(BAM.Reader, bam_file, index = bai_file)

    tasks_per_thread = 2

    chunk_size = max(1, length(features) ÷ (tasks_per_thread * nthreads()))
    data_chunks = partition(features, chunk_size) # partition your data into chunks that
                                                  # individual tasks will deal with
    tasks = map(data_chunks) do chunk
        # Each chunk of your data gets its own spawned task that does its own local, sequential work
        # and then returns the result
        @spawn begin
            for (i, feature) in enumerate(chunk)
                process_feature(feature, reader)

            end
            return
        end
    end

    fetch.(tasks) # get all the values returned by the individual tasks. fetch is type
                                # unstable, so you may optionally want to assert a specific return type.
    close(reader)
end

# Example usage:
gff_file = "test_data/small.gff3" #"path_to_your_gff_file" with 2 or more gene features
bam_file = "test_data/small.bam" #"path_to_your_bam_file" for my test, I just took the first 100 lines of a larger bam file

features = open(collect, GFF3.Reader, gff_file)    # Array of GFF3.Record objects
filter!(x -> GFF3.featuretype(x) == "gene", features)   # Filter to only gene features
process_all_features(features, bam_file)

I also realize I may just be doing this incorrectly as well. It has been a while since I have used Julia.

Context

Currently with the FADU tool I have worked on, we are seeing an increase in cases where the run times are super long (i.e. IGS/FADU#12) due to huge amounts of BAM read/GFF feature overlaps that need to be processed, and fixing the multithreading would go a long way to remedying some of this. I don't want to make it seem like it's the sole reason, as I am actively looking for other ways to speed up the execution as well.

Your Environment

  • Package Version used:
  • Julia Version used: v1.9.3
  • Operating System and version (desktop or mobile): Mac OS 14.2.1
  • Link to your project: https://github.com/IGS/FADU, though the code I am editing is not committed. My minimal test example for how I would use the multithreading is above.

The packages I am working with on my desktop are a bit newer than the ones I list in my tool. It is stable... just the multithreading isn't working.

[c7e460c6] ArgParse v1.1.4
[8e4a8c10] BED v0.3.0
[28d598bf] BGZFStreams v0.3.2
[af1dc308] GFF3 v0.2.3
⌅ [899a7d2d] GenomicFeatures v2.1.0
⌅ [4ffb77ac] Indexes v0.1.3
[09ab397b] StructArrays v0.6.17
[28d57a85] Transducers v0.4.81
⌃ [d759349c] XAM v0.3.1 - I'm aware there is a XAM v0.4 but Pkg.up does not update it.

BoundsError: attempt to access 1-element Array{BGZFStreams.Block,1} at index [2]

Encountered in BioJulia/XAM.jl#31
and reproduced by line 286 in https://github.com/BioJulia/XAM.jl/tree/hotfix/issue-31.

It appears there is a case where ensure_buffered_data increments the stream.block_index past the end of the stream.blocks array. I haven't had the time to understand how this might occur.

@inline function ensure_buffered_data(stream)
#@assert stream.mode == READ_MODE
@label doit
while stream.block_index lastindex(stream.blocks)
@inbounds block = stream.blocks[stream.block_index]
if block.position block.size
return stream.block_index
end
stream.block_index += 1
end
if !eof(stream.io)
read_blocks!(stream)
@goto doit
end
return 0
end

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

issue with concatenated bgzip

Hitting error when analysing a bgzip BED that has been concatenated.

test_390_JI.bed.gz
test_389_JI.bed.gz

In bash:
cat test_389_JI.bed.gz test_390_JI.bed.gz > test_JI.bed.gz
tabix -p bed test_JI.bed.gz

In Julia:

using GenomicFeatures
using BGZFStreams

JI_bed_reader = BED.Reader(BGZFStream("./test_JI.bed.gz"),
           index = "./test_JI.bed.gz.tbi")

for i in eachoverlap(JI_bed_reader, Interval("chr6", 61374400, 61374500))
           println(i)
       end

ERROR: ArgumentError: too large in-block offset
Stacktrace:
 [1] seek(::BGZFStream{IOStream}, ::VirtualOffset) at /nfs/users/nfs_r/ra11/.julia/packages/BGZFStreams/qApsr/src/bgzfstream.jl:225
 [2] done(::GenomicFeatures.Indexes.TabixOverlapIterator{GenomicFeatures.BED.Reader}, ::GenomicFeatures.Indexes.TabixOverlapIteratorState{GenomicFeatures.BED.Record}) at /nfs/users/nfs_r/ra11/.julia/packages/GenomicFeatures/VNoux/src/indexes/overlap.jl:46
 [3] iterate at /nfs/users/nfs_r/ra11/.julia/packages/GenomicFeatures/VNoux/src/indexes/overlap.jl:72 [inlined]
 [4] iterate(::GenomicFeatures.Indexes.TabixOverlapIterator{GenomicFeatures.BED.Reader}) at /nfs/users/nfs_r/ra11/.julia/packages/GenomicFeatures/VNoux/src/indexes/overlap.jl:35
 [5] top-level scope at ./REPL[5]:1

Expected Behavior

It shouldn't bring up that error. It should output intersected intervals just like tabix does.

tabix test_JI.bed.gz chr6:61374400-61374500

chr6	61374417	61374418	0,0,0,0,0,0,0,0,8,6,6,2,0,...
chr6	61374448	61374449	0,0,0,0,0,0,0,0,5,5,6,3,0,...

Current Behavior

produces the too large in-block offset error

Possible Solution / Implementation

Exclude the condition block.size = 0 & inblock_offset = 0 from the error check.

if inblock_offset ≥ block.size & inblock_offset > 0
   throw(ArgumentError("too large in-block offset"))
end

Missing `Project.toml`?

This project appears to be missing a Project.toml. Thus it is not understood by Pkg.jl properly.

Expected Behavior

Pkg.jl test, develop and similar should all work

Current Behavior

pkg> develop BGZFStreams
     Cloning git-repo `https://github.com/BioJulia/BGZFStreams.jl.git`
   Resolving package versions...
ERROR: could not find project file for package `BGZFStreams [28d598bf]` at `/home/jooa/.julia/dev/BGZFStreams`

Possible Solution / Implementation

Add a Project.toml to the repository.

Your Environment

  • Package Version used: current?
  • Julia Version used: 1.6
  • Operating System and version (desktop or mobile): Arch Linux

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.