Git Product home page Git Product logo

Comments (10)

tfmorris avatar tfmorris commented on July 1, 2024

This happens on both macOS and Windows? I'm unable to reproduce it on macOS 14.4.1 (Sonoma) with either the 3.8.0 tag or the current development head.

Did it guess UTF-16LE or did you specify it? (It's correct, but I'm curious because my system is guessing Windows-1252)

If you try switching to a different encoding (e.g. UTF-8) and then back to UTF-16LE, do the symptoms change?

It looks like it might be using the system default encoding instead of the correctly guessed encoding, but I'll have to dig further to understand where things are going awry.

from openrefine.

tfmorris avatar tfmorris commented on July 1, 2024

I just double checked with the binary 3.8.0 kit and bundled JVM (OpenJDK 64-Bit Server VM version 11.0.23+9) to be sure it wasn't some kind of binary kit build issue and got the same results as with my previous two tests using builds from source.

from openrefine.

thadguidry avatar thadguidry commented on July 1, 2024

The "same" results of reproducing the issue? or not reproducing the issue?

It did not guess the UTF-16LE encoding (it does on 3.7.9). On version 3.8, it is guessing Windows-1252 when I choose import preview on the ZIP file itself it looks exactly just like this when I did try fresh again, just to double-check:

image

If I change the character encoding to UTF-16LE it refreshes the preview, but the NUL characters still show between the letters as in image above.

Also interesting, but not applicable to this issue I think, is that the checkbox for "Use character " to enclose cells containing column separaters" is unchecked by default now in 3.8?

from openrefine.

thadguidry avatar thadguidry commented on July 1, 2024

I've been helping Mike Gordon on this issue, emailing back and forth and finally discovering the real issue coming up in regards to https://forum.openrefine.org/t/trouble-with-sql-export/1485

from openrefine.

tfmorris avatar tfmorris commented on July 1, 2024

The "same" results of reproducing the issue? or not reproducing the issue?

None of my three tests reproduced the issue on macOS.

The bug report says:

Operating System: Mac OS and Windows 11

Did you actually test on both? What version of macOS? Which operating system are the screenshots above from?

from openrefine.

thadguidry avatar thadguidry commented on July 1, 2024

I tested and reproduced on Windows 11.
Mike Gordon in the forum is on a MacBook Pro mid 2012 on macOS Catalina running OpenRefine 3.8.

from openrefine.

tfmorris avatar tfmorris commented on July 1, 2024

Also not reproducible with the binary 3.8.0 kit on macOS Big Sur 11.7.10 (20G1427). I'll see if I can find a Windows machine.

from openrefine.

tfmorris avatar tfmorris commented on July 1, 2024

Actually, it looks like my browsers are rendering things differently than yours and there are actually NULs in the data, so let me dig in more now that I can reproduce it.

from openrefine.

thadguidry avatar thadguidry commented on July 1, 2024

Ah, in the data, just reading the bytes directly out of the compressed file.. I also see NULLs or rather \u0000 or char(0).
For example, I see the gaps within all the char(78) which is "N" where it should be NNNNNNNN.
Since NULL is a non-printable character, you have to use unicode_escape to print any non-printable characters.

import zipfile

zip_file_path = 'F:\Downloads\IDENTIFIANTS_AIFM.zip'
# read a file from a windows path
with zipfile.ZipFile(zip_file_path, 'r') as myzip:
    file_name = 'IDENTIFIANTS_AIFM.csv'
    with myzip.open(file_name, 'r') as f:
        # read the first 32 bytes of the file and print the Unicode decimal codepoints
        f_list = [ord(i) for i in f.read(32).decode('unicode_escape')]
        print(f_list)

image

from openrefine.

tfmorris avatar tfmorris commented on July 1, 2024

There are a number of different problems here:

  1. The fix (#6025) for the wonky Microsoft UTF-8 + BOM problem (#1241) messed up the encoding guessing for other BOM-based encoding (including the UTF-16LE used in this file), because all BOMs were being stripped, making them unavailable for the character encoding guesser.
  2. The manual character encoding override is being ignored. This might have been broken for a long time, but wasn't critical as long as the encoding guessing was working correctly.
  3. There are no tests for encodings with BOM, other than the recently added UTF-8 + BOM test, so there was no test coverage for this failure.

I have fixes for all of these that I'll put up in a PR.

from openrefine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.