Comments (3)
Good Evening: I'm very curious about this request. That would be an interesting challenge to support such names in these file formats where almost always space is the delimiter character between elements. The usual practice when such multiple word names are desired in an identifier is to separate them by underscore _ character, or withCamelCase notation. Some types of genomic annotation file formats are naturally tab separated elements which can allow for spaces in identifiers, but this is not a global definition. Please note the definition of a sequence name format in the NCBI assembly submission guidelines: https://www.ncbi.nlm.nih.gov/genbank/genomesubmit/#files
`Each sequence has a definition line beginning with a '>' and a unique identifier (SeqID), eg contig001, contig002. The SeqIDs must:
Be <50 characters
Can only include letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks (*), and number signs (#).
Be unique within a genome
`
from kent.
Thanks for your interest. I generally agree with and have no problems with the current standards for identifier nomenclature in principle. In practice however, spaces in identifiers are common, even when downloading genomes from NCBI.
See here: https://www.ncbi.nlm.nih.gov/nuccore/NC_060925.1?report=fasta
Where the chromosome 1 name is: NC_060925.1 Homo sapiens isolate CHM13 chromosome 1, alternate assembly T2T-CHM13v2.0
I hope this explains the nature of my motivation for this change and why I think it would be valuable. As for the technical aspect of things, my understanding is that in bbiWrite.c
file, the most flexible approach would involve parsing values from the end of the line backwards to a delimiter, and storing the remainder of the line (from start of the line to that delimiter) as the chromosome name. It seems to me this should be implemented in the struct hash *bbiChromSizesFromFile(char *fileName)
block. Granted, a change might be necessary in parsing the wig file as well.
Let me know what you think and thanks again!
from kent.
Hi Phillip, thanks for your suggestion. Genomics is a field with a long history. Identifiers were defined by Genbank a long time ago, and the FASTA format was defined almost 40 years ago (wow!). Sequence identifiers cannot contain spaces by definition, see https://www.ncbi.nlm.nih.gov/genbank/fastaformat/.
The identifer here is "NC_060925.1", the rest is just a human readable description, it's not the identifier. The splitting on space in bbiWrite is intentional, we do not support human readable descriptions in fields supposed to hold only identifiers in our binary formats and have no plan to change that in bigWig or bigBed formats, it would break a lot of old tools.
If you want human readable descriptions in your pipeline, you can either use underscores or - better - store the descriptions elsewhere (text file, sql database, etc) and show them when you display the identifiers to the user.
from kent.
Related Issues (20)
- FaToVcf reference issue. HOT 4
- path issue for doBlastzChainNet.pl HOT 2
- Certificate validation (https.c) causing issues with build in v424 HOT 9
- profile db not found in sqlProfileToMyCnf()
- bedGraphToBigWig: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory HOT 1
- twoBitToFa: Can only handle version 0 of this file. This is version 1 HOT 1
- module in src/hg can't make with MySQL 8^ HOT 1
- `Response is missing required header Content-Length: for url ` HOT 11
- Bioconda package for kent HOT 4
- src/lib/htmshell.c doesn't compile on aarch64 Ubuntu 20.04 HOT 2
- Where can I download the pslSplitOnTarget binary? HOT 2
- Provide Linux aarch64 binary for genePredToGtf HOT 17
- Incorrect number of bytes for empty slots in rWriteLeaves HOT 5
- Proposal to remove S:417/440/505 in branch specific mask list.
- Bash reporting error HOT 1
- undefined reference to sqlSetIsUcscMirror HOT 1
- undefined reference HOT 2
- Question about classNet HOT 9
- errors reported by axtchain HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kent.