Git Product home page Git Product logo

Comments (5)

maximilianh avatar maximilianh commented on May 27, 2024

Hi @mdehoon , can you tell us why you think that? A leaf has a size, an index doesn't have a size.

Also, the bigBed format was published almost two decades ago, we can certainly not play with the byte counts anymore.

I'm curious: why are you looking into this at this detail? Did you find a particular bug or test case that the format failed for you?

from kent.

mdehoon avatar mdehoon commented on May 27, 2024

@maximilianh Thank you for your quick reply.

There is a bigBed parser implementation in Biopython, and I was looking at the source code of bedToBigBed to confirm it works correctly.

I understand that the bigBed format cannot be changed at this point. I am just trying to understand if this really is a bug. If it is, it may leave some parts of the output bigBed file undefined.

The first loop is executed countOne times; each iteration writes 32 bytes.
The second loop fills the remaining itemsPerSlot - countOne slots with zeros. Then I would expect this to also write out 32 bytes in each iteration, so in total itemsPerSlot * 32 bytes are written.

For comparison, see writeIndexLevel in bPlusTree.c. In the first loop:

    for (j=i; j<endIx; j += slotSizePer)
        {
        void *item = items + j*itemSize;
        memset(keyBuf, 0, keySize);
        (*fetchKey)(item, keyBuf);
        mustWrite(f, keyBuf, keySize);
        writeOne(f, nextChild);
        nextChild += bytesInNextLevelBlock;
        ++slotsUsed;
        }

we write keySize + sizeof(nextChild) = keySize + sizeof(bits64) bytes in each iteration; in the second loop

   int slotSize = keySize + sizeof(bits64);
    for (j=countOne; j<blockSize; ++j)
        repeatCharOut(f, 0, slotSize);
    }

we fill each of the remaining blockSize - countOne slots with keySize + sizeof(bits64) bytes, i.e. the same number of bytes in each iteration as in the first loop.

from kent.

maximilianh avatar maximilianh commented on May 27, 2024

I think you're right, I just struggle why this has never lead to problems with billions of requests over this time on our website... I'll run it by more senior engineers here and get back to you.

from kent.

mdehoon avatar mdehoon commented on May 27, 2024

Thank you.

I just struggle why this has never lead to problems with billions of requests over this time on our website...

Maybe it's because the bigBed file is interpreted in the same way regardless of whether indexSlotSize or leafSlotSize is used in rWriteLeaves, at least by bigBedToBed, so it may just affect performance.

Using original bedToBigBed:

$ bedToBigBed -as=bed12.as ucsc.bed hg38.chrom.sizes ucsc.old.bb
pass1 - making usageList (47 chroms): 94 millis
pass2 - checking and writing primary data (251295 records, 12 fields): 2068 millis

After replacing indexSlotSize by leafSlotSize in rWriteLeaves:

$ bedToBigBed -as=bed12.as ucsc.bed hg38.chrom.sizes ucsc.new.bb
pass1 - making usageList (47 chroms): 76 millis
pass2 - checking and writing primary data (251295 records, 12 fields): 2076 millis

This increases the size of the bigBed file a little:

$ ls -l ucsc.old.bb 
-rw-r--r--  1 mdehoon  staff  11892583 Apr  7 08:32 ucsc.old.bb
$ ls -l ucsc.new.bb
-rw-r--r--  1 mdehoon  staff  11909687 Apr  7 08:32 ucsc.new.bb

Running bigBedToBed (without modifications) on these bigBed files results in the exact same bed file:

$ bigBedToBed ucsc.old.bb ucsc.old.bed

$ bigBedToBed ucsc.new.bb ucsc.new.bed

$ ls -l ucsc.old.bed ucsc.new.bed
-rw-r--r--  1 mdehoon  staff  33084176 Apr  7 08:33 ucsc.new.bed
-rw-r--r--  1 mdehoon  staff  33084176 Apr  7 08:33 ucsc.old.bed

$ md5 ucsc.old.bed ucsc.new.bed
MD5 (ucsc.old.bed) = 982fac74c700bc69ce85735a93bbab9c
MD5 (ucsc.new.bed) = 982fac74c700bc69ce85735a93bbab9c

$ diff ucsc.old.bed ucsc.new.bed 

$

from kent.

maximilianh avatar maximilianh commented on May 27, 2024

It looks like this will be backwards compatible, and I still do want to ask others here, but to me, it looks very likely that we'll make this change. This is as good as an error report can be - thank you!

from kent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.