Comments (10)
This happens on both macOS and Windows? I'm unable to reproduce it on macOS 14.4.1 (Sonoma) with either the 3.8.0 tag or the current development head.
Did it guess UTF-16LE or did you specify it? (It's correct, but I'm curious because my system is guessing Windows-1252)
If you try switching to a different encoding (e.g. UTF-8) and then back to UTF-16LE, do the symptoms change?
It looks like it might be using the system default encoding instead of the correctly guessed encoding, but I'll have to dig further to understand where things are going awry.
from openrefine.
I just double checked with the binary 3.8.0 kit and bundled JVM (OpenJDK 64-Bit Server VM version 11.0.23+9) to be sure it wasn't some kind of binary kit build issue and got the same results as with my previous two tests using builds from source.
from openrefine.
The "same" results of reproducing the issue? or not reproducing the issue?
It did not guess the UTF-16LE encoding (it does on 3.7.9). On version 3.8, it is guessing Windows-1252 when I choose import preview on the ZIP file itself it looks exactly just like this when I did try fresh again, just to double-check:
If I change the character encoding to UTF-16LE it refreshes the preview, but the NUL characters still show between the letters as in image above.
Also interesting, but not applicable to this issue I think, is that the checkbox for "Use character " to enclose cells containing column separaters" is unchecked by default now in 3.8?
from openrefine.
I've been helping Mike Gordon on this issue, emailing back and forth and finally discovering the real issue coming up in regards to https://forum.openrefine.org/t/trouble-with-sql-export/1485
from openrefine.
The "same" results of reproducing the issue? or not reproducing the issue?
None of my three tests reproduced the issue on macOS.
The bug report says:
Operating System: Mac OS and Windows 11
Did you actually test on both? What version of macOS? Which operating system are the screenshots above from?
from openrefine.
I tested and reproduced on Windows 11.
Mike Gordon in the forum is on a MacBook Pro mid 2012 on macOS Catalina running OpenRefine 3.8.
from openrefine.
Also not reproducible with the binary 3.8.0 kit on macOS Big Sur 11.7.10 (20G1427). I'll see if I can find a Windows machine.
from openrefine.
Actually, it looks like my browsers are rendering things differently than yours and there are actually NULs in the data, so let me dig in more now that I can reproduce it.
from openrefine.
Ah, in the data, just reading the bytes directly out of the compressed file.. I also see NULLs or rather \u0000
or char(0)
.
For example, I see the gaps within all the char(78) which is "N" where it should be NNNNNNNN
.
Since NULL is a non-printable character, you have to use unicode_escape
to print any non-printable characters.
import zipfile
zip_file_path = 'F:\Downloads\IDENTIFIANTS_AIFM.zip'
# read a file from a windows path
with zipfile.ZipFile(zip_file_path, 'r') as myzip:
file_name = 'IDENTIFIANTS_AIFM.csv'
with myzip.open(file_name, 'r') as f:
# read the first 32 bytes of the file and print the Unicode decimal codepoints
f_list = [ord(i) for i in f.read(32).decode('unicode_escape')]
print(f_list)
from openrefine.
There are a number of different problems here:
- The fix (#6025) for the wonky Microsoft UTF-8 + BOM problem (#1241) messed up the encoding guessing for other BOM-based encoding (including the UTF-16LE used in this file), because all BOMs were being stripped, making them unavailable for the character encoding guesser.
- The manual character encoding override is being ignored. This might have been broken for a long time, but wasn't critical as long as the encoding guessing was working correctly.
- There are no tests for encodings with BOM, other than the recently added UTF-8 + BOM test, so there was no test coverage for this failure.
I have fixes for all of these that I'll put up in a PR.
from openrefine.
Related Issues (20)
- Don't catch exceptions in Java unit tests
- Allow user to automatically report their OpenRefine installation configuration
- Incorrect localization for row/record count in main summary bar
- Restore deleted constructor to StandardReconConfig
- Import progress bar exceeds the intended box HOT 1
- Fail to open the browser after startup on linux without Desktop.browse support
- Update the UI for the starred tab in expression dialogue HOT 5
- Column menus: select submenu item by moving mouse diagonally
- When checking for a running open refine localhost should be included in the no proxy list
- Trying to load a 3.4.1 (or 3.6?) project using OpenRefine v3.8.1 HOT 1
- Introduce new top level GREL variable `record` HOT 2
- Fix OSSRH upload in release pipeline HOT 6
- join() of array with nulls throws NullPointerException HOT 4
- toTitleCase() second argument (delimiters) undocumented
- Permanent logging to file HOT 1
- Carriage Return added to cell value incorrectly during import
- Null values in rows are not parsed correctly during import preview
- Reconciliation candidates discarded when matching multiple cells to a single new item
- Filter() control returns text string not error object when first argument is not an array HOT 4
- Regression: matching a single cell to a reconciliation candidate resets the grid to the first page HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openrefine.