Git Product home page Git Product logo

dnu's Introduction

This is the specification for the DNU file format. 'DNU' stands for 'Do Not Use.' It is an alternative to Comma Separated Value (CSV) and similar file formats.

Please note that this project is in the planning stage.

Description

Whereas CSV files use commas and newlines in order to delimit fields and records, DNU files use the delimiters specified in the ASCII standard. This means that, instead of using a comma (or tab or |) to separate data within a record, DNU files use ASCII 31 (the Unit Separator character). Instead of newlines to separate records, DNU files use ASCII 30 (the Record Separator character).

Additionally, the ASCII standard also specifies characters 28 (File Separator) and 29 (Group Separator). This standard will probably make use of those characters, as well. On the off-chance that any of these characters actually need to be escaped, then ASCII character 16 (Data Link Escape) will be reserved.

Inspiration

This project was inspired by this excellent bit of fact-trolling. After reading it, I had the idea of making a file standard which made use of those ASCII characters in order to address the drawbacks of the CSV format. Further inspiration comes from commentary on Ronald Duncan's original weblog entry from Hacker News and Reddit.

Here's the basic idea. Long ago, everyone felt the need to transfer text in a database-like format between user-level programs. So someone (likely, several someones) came up with the quick & dirty hack of putting this information in a text file where each record was its own line and each field was separated by commas. Such a simple convention was trivial to encode by the sending application and trivial to decode by the receiving application.

Unfortunately, commas and newline characters may appear in legitimate textual data, so they must be escaped if they appear in a field. The character used to escape them must itself be escaped if it appears in a field literally. Furthermore, different locales treat commas differently than in the United States. Then there's the fact that line breaks have never been standardised. Unix uses ASCII 10 (LF for Line Feed) as a newline character while MS-Windows adds the Carriage Return character to make the ASCII sequence '13 10' (CR LF).

RFC 4180 attempted to retroactively bring some order to the chaos, but it was released in late 2005, too late to turn a quick & dirty hack into a truly standard file format. As one Hacker News commenter put it, CSV is less a file format and more a hypothetical construct like "peace on Earth".

The Goal

Imagine if programmers in the 1960s & 1970s had actually paid attention to the ASCII standard instead of making the quick hack called CSV. That's the point of this project. I don't expect this file format to sweep all before it and settle the issues with CSV files once and for all. I'm doing this to teach myself how to make a real file format, however simple.

In addition to the file format, I will also make some simple command line tools to manipulate files in that format. They will perform functions similar to the Unix find, sort, wc, and other tools.

The Name

Of course, making such a format would put it in competition with CSV, TSV, and so forth. If it were to become common, then pressure would be brought to bear on the developers of standard tools to handle it. The maintainers of those tools would be most cross with me for forcing them to make even more special cases in their code. Therefore, I make this standard only for my own amusement. That means do not use this in production. To further mark this as an amateur project of the 'scratch an itch' variety, I have given it an extension of DNU, for Do Not Use.

At one with this intention, I deliberately make this file format gratuitously incompatible with other character-separated file formats—both comma- and tab-separated value files. In particular, I will take pains to make this format unreadable by Microsoft Excel.

And if you don't like how this Readme looks, never fear. I will consult this page in order to make a Readme that will help you.

Foundational Issues

Now it's time to be a bit more precise and rigorous. All of the characters discussed in this document are part of both the ASCII and Unicode standards. That means that they are either text or are meant to facilitate the display or transmission of text. The purpose of text is primarily to be read by humans and only secondarily to be processed by computer.

Text is stored as strings. Therefore, DNU files may not contain non-string data. Since text is meant for people to read, then DNU files are meant to store data from or transfer data between user-level programs. The Cold War-era ARPAnet is no more; that means that user data is guilty until proven innocent. Therefore, applications which use this format are at liberty—and may even be required—to throw away user data which does not conform to this standard.

Consider some of the comments on this Reddit thread. Several people reach Slashdot-levels of obtuseness. Keeping the fundamentals in mind allows one to see through the objections made against using native ASCII characters in tabular data files.

Objection: “Just because you consider these characters (i.e., ASCII 29-31) to be invalid in text data doesn't mean that you won't find them there. Therefore, they will still have to be escaped, reintroducing all of the complexities of comma-separated values.”

Answer: The whole point to having control codes in the ASCII standard is that they facilitate the display or transfer of text without themselves being text. Nobody says “My name is J.Q. Doe (the ASCII DEL is silent).” Nobody has a home address whose street name is interspersed with form feed characters. In such cases, the program dealing with such (possibly maliciously) corrupted data is required to strip out the offending control codes.

Objection: “So that would make this a data storage format that features silent data loss. That means that I could feed it, say, a hundred bytes but I would have no guarantee of getting those same hundred bytes back. I don't think so.”

Answer: That objection may mean something in the realm of system software, but if you're sticking a CSV file in the very implementation of a device driver, or the runtime system of a programming language, or the guts of a file system, then you deserve whatever you get.

Don't forget that the purpose of any character-separated data format is to transfer text between user-level applications. Inevitably, some user data will be corrupt, nonsensical, or even maliciously-formed. In other words, user data may contain illegal characters. Robust applications must take such situations into account. In fact, even system software must take such situations into account, as well (e.g., buffer overflow attacks).

Objection: “What if I want to make a delimited list of delimited lists?” (Note: Even the person who raised this objection admitted that this was an esoteric use case.)

Answer: Any character-delimited text file is ill-suited to such a use case, including CSV files. As the folks at DataHub point out, any CSV file “Works best for tabular data—not good for data with nesting or where structure is not especially tabular.” The Wikipedia entry for CSV files agrees: “CSV cannot naturally represent hierarchical or object-oriented database or other data.”

Objection: “Try this: Make a table of the ASCII character set using these delimiters. Use one column for the ASCII numeric code, another column for a description of that character, and the last column for the literal character, itself. You'll run into the escaping problem once you reach the columns that list these delimiters.” (Note: This is closely related to the previous objection.)

Answer: In other words, this table would describe the ASCII standard in the ASCII standard to a program which already uses the ASCII standard. Apparently, this is a pressing problem which happens all the time in The Year of Our Lord Two Thousand and Whatevs.

This objection, the previous objection, and so many other objections against proper ASCII-delimited text can be, not just answered, but obviated by noting the typical use cases of any character-delimited file format.

Objection: “There are no dedicated keys for these delimiters on any keyboard and there is no way to easily type these delimiters using a text editor.”

Answer: Granted, most editors cannot do so, but several do. More to the point, the problems of quoting delimiters is evidence that text editors were always the wrong tool for the job. As much as some of us nerds look down on it, the spreadsheet application was always the proper way to deal with tabular data—even if they were invented fifteen years too late. One of the aims of this project is to make appropriate tools for editing and viewing files in the DNU format.

Links

Possibly-Related Software:

Problems with CSV Files:

Other Links:

dnu's People

Contributors

winestock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.