Describe the bug In using the wigToBigWig utility, an error is th

wigToBigWig doesn't work for chromosomes with spaces in name.,about ucscgenomebrowser/kent

Comments (3)

NullModel commented on May 27, 2024

Good Evening: I'm very curious about this request. That would be an interesting challenge to support such names in these file formats where almost always space is the delimiter character between elements. The usual practice when such multiple word names are desired in an identifier is to separate them by underscore _ character, or withCamelCase notation. Some types of genomic annotation file formats are naturally tab separated elements which can allow for spaces in identifiers, but this is not a global definition. Please note the definition of a sequence name format in the NCBI assembly submission guidelines: https://www.ncbi.nlm.nih.gov/genbank/genomesubmit/#files

`Each sequence has a definition line beginning with a '>' and a unique identifier (SeqID), eg contig001, contig002. The SeqIDs must:

Be <50 characters
Can only include letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks (*), and number signs (#).
Be unique within a genome

from kent.

philippguevorguian commented on May 27, 2024

Thanks for your interest. I generally agree with and have no problems with the current standards for identifier nomenclature in principle. In practice however, spaces in identifiers are common, even when downloading genomes from NCBI.

See here: https://www.ncbi.nlm.nih.gov/nuccore/NC_060925.1?report=fasta
Where the chromosome 1 name is: NC_060925.1 Homo sapiens isolate CHM13 chromosome 1, alternate assembly T2T-CHM13v2.0

I hope this explains the nature of my motivation for this change and why I think it would be valuable. As for the technical aspect of things, my understanding is that in bbiWrite.c file, the most flexible approach would involve parsing values from the end of the line backwards to a delimiter, and storing the remainder of the line (from start of the line to that delimiter) as the chromosome name. It seems to me this should be implemented in the struct hash *bbiChromSizesFromFile(char *fileName) block. Granted, a change might be necessary in parsing the wig file as well.

Let me know what you think and thanks again!

from kent.

maximilianh commented on May 27, 2024

Hi Phillip, thanks for your suggestion. Genomics is a field with a long history. Identifiers were defined by Genbank a long time ago, and the FASTA format was defined almost 40 years ago (wow!). Sequence identifiers cannot contain spaces by definition, see https://www.ncbi.nlm.nih.gov/genbank/fastaformat/.

The identifer here is "NC_060925.1", the rest is just a human readable description, it's not the identifier. The splitting on space in bbiWrite is intentional, we do not support human readable descriptions in fields supposed to hold only identifiers in our binary formats and have no plan to change that in bigWig or bigBed formats, it would break a lot of old tools.

If you want human readable descriptions in your pipeline, you can either use underscores or - better - store the descriptions elsewhere (text file, sql database, etc) and show them when you display the identifiers to the user.

from kent.

wigToBigWig doesn't work for chromosomes with spaces in name. about kent HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent