Git Product home page Git Product logo

Comments (6)

simonmichael avatar simonmichael commented on June 5, 2024

That's very clear! Thank you.

I also found:

We do want hledger to just work on real world data where possible, so we should be permissive where it doesn't add complications. But I'm not sure if we need to go as far as ignoring BOMs appearing anywhere in the input. It seems like an unusual niche case, and one that's easy to solve with preprocessing. Is it really valid for files to change encoding in the middle ? I can't imagine many tools that would handle that properly.

from hledger.

simonmichael avatar simonmichael commented on June 5, 2024

Our BOM handling should be mentioned at https://hledger.org/dev/hledger.html#text-encoding .

from hledger.

simonmichael avatar simonmichael commented on June 5, 2024

Related, https://www.unicode.org/faq/utf_bom.html#BOM says:

Q: What should I do with U+FEFF in the middle of a file?

  • In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF should normally not occur.
  • For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string.
  • When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In that case, any U+FEFF occurring in the middle of a file can be treated as an unsupported character.

from hledger.

PSLLSP avatar PSLLSP commented on June 5, 2024

BOM is troublemaker... ;-) We use extended ASCII and banks produced CSV files in CP-1250 in the past. Some of them upgraded their software and moved to UTF-8 and I believe that is why they produce UTF-8 file with BOM, to clearly signal that CSV file is not in CP-1250 but in UTF-8.

It is possible to create file that starts with BOM for UTF-8 and there is a BOM for UTF-16LE in the middle file. Just join file in UTF-8 with file in UTF-16LE. But that will be illegal, because BOM is just one code point (U+FEFF) expressed in different ways for each version of UTF. I thought that it could be possible to start with UTF-8 and use BOM in the middle of file to switch encoding to UTF-16LE but it is not possible because BOM for UTF-16LE is invalid sequence in UTF-8... Well, it could be possible but software has to test why there is an error in data, test if error code could be BOM for other variant of UTF... The good news is that UTF-16LE files are rare, UTF-8 is used in most cases.

from hledger.

simonmichael avatar simonmichael commented on June 5, 2024

from hledger.

PSLLSP avatar PSLLSP commented on June 5, 2024

What about ignoring ZWNBSP characters during CSV import? I do not see any way how these invisible troublemakers could be useful in hledger journal... Other way of handling these is to see them as EOL, this will help in the case that CSV file is not ended with EOL... Exception could be that ZWNBSP is used as field separator. I do not know if there is a way to define invisible ZWNBSP as field separator, maybe separator \uFEFF or separator ZWNBSP. I do not know any case of such CSV file... Or maybe to address this in a way that new command will be added, to map one character to other character, like UNIX command tr. I can use it to translate CSV file in encoding CP-1250 to UTF-8, I will define translation table in hledger import rule. New command to map input code to new code, several such commands could be in the rule file, each mapping on new line. The problem here is that hledger reads input file as UTF-8 and extended ASCII characters are invalid codes when file is read as UTF-8 stream (hledger reports error invalid byte sequence); to address this, new command to disable UTF-8 parsing should be added too, maybe (encoding utf-8 - the default and encoding binary to parse csv in 8-bit mode).

from hledger.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.