Git Product home page Git Product logo

Comments (3)

NullModel avatar NullModel commented on May 27, 2024

Good Evening: I'm very curious about this request. That would be an interesting challenge to support such names in these file formats where almost always space is the delimiter character between elements. The usual practice when such multiple word names are desired in an identifier is to separate them by underscore _ character, or withCamelCase notation. Some types of genomic annotation file formats are naturally tab separated elements which can allow for spaces in identifiers, but this is not a global definition. Please note the definition of a sequence name format in the NCBI assembly submission guidelines: https://www.ncbi.nlm.nih.gov/genbank/genomesubmit/#files

`Each sequence has a definition line beginning with a '>' and a unique identifier (SeqID), eg contig001, contig002. The SeqIDs must:

Be <50 characters
Can only include letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks (*), and number signs (#).
Be unique within a genome

`

from kent.

philippguevorguian avatar philippguevorguian commented on May 27, 2024

Thanks for your interest. I generally agree with and have no problems with the current standards for identifier nomenclature in principle. In practice however, spaces in identifiers are common, even when downloading genomes from NCBI.

See here: https://www.ncbi.nlm.nih.gov/nuccore/NC_060925.1?report=fasta
Where the chromosome 1 name is: NC_060925.1 Homo sapiens isolate CHM13 chromosome 1, alternate assembly T2T-CHM13v2.0

I hope this explains the nature of my motivation for this change and why I think it would be valuable. As for the technical aspect of things, my understanding is that in bbiWrite.c file, the most flexible approach would involve parsing values from the end of the line backwards to a delimiter, and storing the remainder of the line (from start of the line to that delimiter) as the chromosome name. It seems to me this should be implemented in the struct hash *bbiChromSizesFromFile(char *fileName) block. Granted, a change might be necessary in parsing the wig file as well.

Let me know what you think and thanks again!

from kent.

maximilianh avatar maximilianh commented on May 27, 2024

Hi Phillip, thanks for your suggestion. Genomics is a field with a long history. Identifiers were defined by Genbank a long time ago, and the FASTA format was defined almost 40 years ago (wow!). Sequence identifiers cannot contain spaces by definition, see https://www.ncbi.nlm.nih.gov/genbank/fastaformat/.

The identifer here is "NC_060925.1", the rest is just a human readable description, it's not the identifier. The splitting on space in bbiWrite is intentional, we do not support human readable descriptions in fields supposed to hold only identifiers in our binary formats and have no plan to change that in bigWig or bigBed formats, it would break a lot of old tools.

If you want human readable descriptions in your pipeline, you can either use underscores or - better - store the descriptions elsewhere (text file, sql database, etc) and show them when you display the identifiers to the user.

from kent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.