Comments (5)
Hi @mdehoon , can you tell us why you think that? A leaf has a size, an index doesn't have a size.
Also, the bigBed format was published almost two decades ago, we can certainly not play with the byte counts anymore.
I'm curious: why are you looking into this at this detail? Did you find a particular bug or test case that the format failed for you?
from kent.
@maximilianh Thank you for your quick reply.
There is a bigBed parser implementation in Biopython, and I was looking at the source code of bedToBigBed
to confirm it works correctly.
I understand that the bigBed format cannot be changed at this point. I am just trying to understand if this really is a bug. If it is, it may leave some parts of the output bigBed file undefined.
The first loop is executed countOne
times; each iteration writes 32 bytes.
The second loop fills the remaining itemsPerSlot - countOne
slots with zeros. Then I would expect this to also write out 32 bytes in each iteration, so in total itemsPerSlot
* 32 bytes are written.
For comparison, see writeIndexLevel
in bPlusTree.c
. In the first loop:
for (j=i; j<endIx; j += slotSizePer)
{
void *item = items + j*itemSize;
memset(keyBuf, 0, keySize);
(*fetchKey)(item, keyBuf);
mustWrite(f, keyBuf, keySize);
writeOne(f, nextChild);
nextChild += bytesInNextLevelBlock;
++slotsUsed;
}
we write keySize + sizeof(nextChild)
= keySize + sizeof(bits64)
bytes in each iteration; in the second loop
int slotSize = keySize + sizeof(bits64);
for (j=countOne; j<blockSize; ++j)
repeatCharOut(f, 0, slotSize);
}
we fill each of the remaining blockSize - countOne
slots with keySize + sizeof(bits64)
bytes, i.e. the same number of bytes in each iteration as in the first loop.
from kent.
I think you're right, I just struggle why this has never lead to problems with billions of requests over this time on our website... I'll run it by more senior engineers here and get back to you.
from kent.
Thank you.
I just struggle why this has never lead to problems with billions of requests over this time on our website...
Maybe it's because the bigBed file is interpreted in the same way regardless of whether indexSlotSize
or leafSlotSize
is used in rWriteLeaves
, at least by bigBedToBed
, so it may just affect performance.
Using original bedToBigBed:
$ bedToBigBed -as=bed12.as ucsc.bed hg38.chrom.sizes ucsc.old.bb
pass1 - making usageList (47 chroms): 94 millis
pass2 - checking and writing primary data (251295 records, 12 fields): 2068 millis
After replacing indexSlotSize
by leafSlotSize
in rWriteLeaves
:
$ bedToBigBed -as=bed12.as ucsc.bed hg38.chrom.sizes ucsc.new.bb
pass1 - making usageList (47 chroms): 76 millis
pass2 - checking and writing primary data (251295 records, 12 fields): 2076 millis
This increases the size of the bigBed file a little:
$ ls -l ucsc.old.bb
-rw-r--r-- 1 mdehoon staff 11892583 Apr 7 08:32 ucsc.old.bb
$ ls -l ucsc.new.bb
-rw-r--r-- 1 mdehoon staff 11909687 Apr 7 08:32 ucsc.new.bb
Running bigBedToBed
(without modifications) on these bigBed files results in the exact same bed file:
$ bigBedToBed ucsc.old.bb ucsc.old.bed
$ bigBedToBed ucsc.new.bb ucsc.new.bed
$ ls -l ucsc.old.bed ucsc.new.bed
-rw-r--r-- 1 mdehoon staff 33084176 Apr 7 08:33 ucsc.new.bed
-rw-r--r-- 1 mdehoon staff 33084176 Apr 7 08:33 ucsc.old.bed
$ md5 ucsc.old.bed ucsc.new.bed
MD5 (ucsc.old.bed) = 982fac74c700bc69ce85735a93bbab9c
MD5 (ucsc.new.bed) = 982fac74c700bc69ce85735a93bbab9c
$ diff ucsc.old.bed ucsc.new.bed
$
from kent.
It looks like this will be backwards compatible, and I still do want to ask others here, but to me, it looks very likely that we'll make this change. This is as good as an error report can be - thank you!
from kent.
Related Issues (20)
- Support for limitted access HOT 1
- Blat not finding Sequence HOT 1
- FaToVcf reference issue. HOT 4
- path issue for doBlastzChainNet.pl HOT 2
- Certificate validation (https.c) causing issues with build in v424 HOT 9
- profile db not found in sqlProfileToMyCnf()
- bedGraphToBigWig: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory HOT 1
- twoBitToFa: Can only handle version 0 of this file. This is version 1 HOT 1
- module in src/hg can't make with MySQL 8^ HOT 1
- `Response is missing required header Content-Length: for url ` HOT 11
- Bioconda package for kent HOT 4
- wigToBigWig doesn't work for chromosomes with spaces in name. HOT 3
- src/lib/htmshell.c doesn't compile on aarch64 Ubuntu 20.04 HOT 2
- Where can I download the pslSplitOnTarget binary? HOT 2
- Provide Linux aarch64 binary for genePredToGtf HOT 17
- Proposal to remove S:417/440/505 in branch specific mask list.
- Bash reporting error HOT 1
- undefined reference to sqlSetIsUcscMirror HOT 1
- undefined reference HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kent.