Git Product home page Git Product logo

Comments (7)

tfmorris avatar tfmorris commented on June 11, 2024 1

We actually have already use two different separator guessers, but if they disagreed we were just giving up which caused it to default to TSV, which also disabled the quote handling which defaults to ON for CSV and OFF for TSV. The Univocity parser guessed " " (the space character), which isn't completely unreasonable, and we guessed "," (comma), which is correct. I've created a patch which will use our internal guess in cases of disagreement, which preserves backward compatibility.

from openrefine.

wetneb avatar wetneb commented on June 11, 2024 1

Ok, let's do that

from openrefine.

wetneb avatar wetneb commented on June 11, 2024

Yes, you're not the first one to be surprised by that! We are aware that the records mode is pretty unintuitive for newcomers, especially because it gets activated by default in a case like yours.
You can read more about it here:
https://openrefine.org/docs/manual/exploring#rows-vs-records

We have some ideas of how to make that more transparent #5174, #5175. Do you think they would have helped?

from openrefine.

clombion avatar clombion commented on June 11, 2024

Wow I did not consider the record mode because I've never used it, despite years of using OpenRefine! Is it for dealing with json and other hierarchical data?

My bad then! I really like the suggestion in #5175 as it would have made clear to me that OR is warning me about something, even if I don't know yet what it is. It would then be easy to have a Q&A page with "why is my first column red?" for people looking for answers.

A tutorial would have been useless because probably forgotten by the time you end up activating record mode by chance. As I mentionned, I've been using OpenRefine for years, and I mentioned the issue to 2 other colleagues who use it regularly, and everybody thought that was a bug.

What about the double quotes though?

from openrefine.

wetneb avatar wetneb commented on June 11, 2024

What about the double quotes though?

Ah yes I had missed that part, well, that's something that we should definitely investigate, I don't see any reason why this would be by design.

from openrefine.

tfmorris avatar tfmorris commented on June 11, 2024

I think perhaps just changing the defaults to never start in record mode might be the best solution.

There's no "hallucination" here because a) we don't use LLMs and b) the quotes are in the original data. Selecting the option Use character " to enclose cells containing column separators will make them go away.

I see three things that could be improved with this dataset:

  1. Guess CSV instead of TSV, as it currently does, for the initial format
  2. Enable quote stripping by default when it's indicated
  3. Start in row mode instead of record mode

The CSV package that we use has a format "sniffer" which may be able to help with 1 & 2.

from openrefine.

tfmorris avatar tfmorris commented on June 11, 2024

@wetneb I recommend that the fix for this be included in 3.8 since the regression was first introduced in that release with #6098 and this is, in my opinion, a low risk fix which restores the prior behavior.

from openrefine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.