In kent/src/lib/cirTree.c , in the function <code clas

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Incorrect number of bytes for empty slots in rWriteLeaves about kent HOT 5 OPEN

mdehoon commented on May 27, 2024

Incorrect number of bytes for empty slots in rWriteLeaves

from kent.

Comments (5)

maximilianh commented on May 27, 2024

Hi @mdehoon , can you tell us why you think that? A leaf has a size, an index doesn't have a size.

Also, the bigBed format was published almost two decades ago, we can certainly not play with the byte counts anymore.

I'm curious: why are you looking into this at this detail? Did you find a particular bug or test case that the format failed for you?

from kent.

mdehoon commented on May 27, 2024

@maximilianh Thank you for your quick reply.

There is a bigBed parser implementation in Biopython, and I was looking at the source code of bedToBigBed to confirm it works correctly.

I understand that the bigBed format cannot be changed at this point. I am just trying to understand if this really is a bug. If it is, it may leave some parts of the output bigBed file undefined.

The first loop is executed countOne times; each iteration writes 32 bytes.
The second loop fills the remaining itemsPerSlot - countOne slots with zeros. Then I would expect this to also write out 32 bytes in each iteration, so in total itemsPerSlot * 32 bytes are written.

For comparison, see writeIndexLevel in bPlusTree.c. In the first loop:

    for (j=i; j<endIx; j += slotSizePer)
        {
        void *item = items + j*itemSize;
        memset(keyBuf, 0, keySize);
        (*fetchKey)(item, keyBuf);
        mustWrite(f, keyBuf, keySize);
        writeOne(f, nextChild);
        nextChild += bytesInNextLevelBlock;
        ++slotsUsed;
        }

we write keySize + sizeof(nextChild) = keySize + sizeof(bits64) bytes in each iteration; in the second loop

   int slotSize = keySize + sizeof(bits64);
    for (j=countOne; j<blockSize; ++j)
        repeatCharOut(f, 0, slotSize);
    }

we fill each of the remaining blockSize - countOne slots with keySize + sizeof(bits64) bytes, i.e. the same number of bytes in each iteration as in the first loop.

from kent.

maximilianh commented on May 27, 2024

I think you're right, I just struggle why this has never lead to problems with billions of requests over this time on our website... I'll run it by more senior engineers here and get back to you.

from kent.

mdehoon commented on May 27, 2024

Thank you.

I just struggle why this has never lead to problems with billions of requests over this time on our website...

Maybe it's because the bigBed file is interpreted in the same way regardless of whether indexSlotSize or leafSlotSize is used in rWriteLeaves, at least by bigBedToBed, so it may just affect performance.

Using original bedToBigBed:

$ bedToBigBed -as=bed12.as ucsc.bed hg38.chrom.sizes ucsc.old.bb
pass1 - making usageList (47 chroms): 94 millis
pass2 - checking and writing primary data (251295 records, 12 fields): 2068 millis

After replacing indexSlotSize by leafSlotSize in rWriteLeaves:

$ bedToBigBed -as=bed12.as ucsc.bed hg38.chrom.sizes ucsc.new.bb
pass1 - making usageList (47 chroms): 76 millis
pass2 - checking and writing primary data (251295 records, 12 fields): 2076 millis

This increases the size of the bigBed file a little:

$ ls -l ucsc.old.bb 
-rw-r--r--  1 mdehoon  staff  11892583 Apr  7 08:32 ucsc.old.bb
$ ls -l ucsc.new.bb
-rw-r--r--  1 mdehoon  staff  11909687 Apr  7 08:32 ucsc.new.bb

Running bigBedToBed (without modifications) on these bigBed files results in the exact same bed file:

$ bigBedToBed ucsc.old.bb ucsc.old.bed

$ bigBedToBed ucsc.new.bb ucsc.new.bed

$ ls -l ucsc.old.bed ucsc.new.bed
-rw-r--r--  1 mdehoon  staff  33084176 Apr  7 08:33 ucsc.new.bed
-rw-r--r--  1 mdehoon  staff  33084176 Apr  7 08:33 ucsc.old.bed

$ md5 ucsc.old.bed ucsc.new.bed
MD5 (ucsc.old.bed) = 982fac74c700bc69ce85735a93bbab9c
MD5 (ucsc.new.bed) = 982fac74c700bc69ce85735a93bbab9c

$ diff ucsc.old.bed ucsc.new.bed 

$

from kent.

maximilianh commented on May 27, 2024

It looks like this will be backwards compatible, and I still do want to ask others here, but to me, it looks very likely that we'll make this change. This is as good as an error report can be - thank you!

from kent.

Incorrect number of bytes for empty slots in rWriteLeaves about kent HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent