Git Product home page Git Product logo

Comments (10)

quinnj avatar quinnj commented on June 29, 2024 1

There shouldn't be any cases where a Vector{WeakRefString{UInt8}} is returned from parsing a CSV file. Careful attention is paid to ensure that if the user requests a non-NullableVector, then a Vector{String} is returned. Please let me know if you have cases where you are seeing otherwise, but I've tried to be careful in testing that WeakRefStrings are never used without NullableVectors.

from csv.jl.

davidanthoff avatar davidanthoff commented on June 29, 2024

@quinnj Could you elaborate a bit more on this? I don't understand why the problem goes away if you use NullableVector. For example, when I simply do a CSV.read("data.csv"), I get back a DataFrame and one column might have elements of type Nullable{WeakRefString{UInt8}}. To what memory are those instances pointing, and who is keeping that alive?

from csv.jl.

quinnj avatar quinnj commented on June 29, 2024

The NullableArray itself holds a reference to the memory that it's WeakRefStrings are pointing to. See the member field of NullableArray here.

from csv.jl.

davidanthoff avatar davidanthoff commented on June 29, 2024

Ah, ok. What about this:

df = CSV.read("data.csv")
s = df[1,1]
df = nothing

Now s is of type Nullable{WeakRefString{UInt8}}. Wouldn't the original memory disappear at this point, once the DataFrame that was holding the NullableArray is gone?

from csv.jl.

quinnj avatar quinnj commented on June 29, 2024

Yes.

from csv.jl.

davidanthoff avatar davidanthoff commented on June 29, 2024

I almost think weakrefstrings=false should be the default in that case for the various read functions, and then folks can opt in if they understand that design and need the speed. I think it would also just make the whole thing more approachable for novice users, right now I see them quite confused when they see their first WeakRefString in a simple situation where they are just reading a file. Would that be an option?

from csv.jl.

quinnj avatar quinnj commented on June 29, 2024

So far, we've gotten many more perf-related issues filed than "what is a weakrefstring" type of issues. I'm inclined to leave it as is, due to the # of issues like "why is CSV.read is 5x slower than...." issues.

from csv.jl.

davidanthoff avatar davidanthoff commented on June 29, 2024

I have to admit that I'm quite worried about this... I think this is a recipe for incredibly subtle bugs. For example the code above, where the original buffer is freed, does actually not crash or show any weird behavior at first, probably because that memory is just not being reclaimed yet. So folks can happily write code like that, everything works, until all of a sudden it doesn't, because the memory manager actually used that area of memory for something else. This might also introduce serious security concerns, right? I'm not an expert for that, but I do know that pointers to wrong memory locations can be a huge problem.

I guess my gut feeling here is that CSV is a package that will be used by a lot of "normal" users, and I don't think they should be exposed to such a subtle and difficult memory management story.

from csv.jl.

dmbates avatar dmbates commented on June 29, 2024

I agree with @davidanthoff. To me the name WeakRefString would mean a pointer to a string with a weak reference to the memory, not a pointer that requires an external weak reference to the memory. It seems that returning such a type by default is a "fail faulty" design.

from csv.jl.

nalimilan avatar nalimilan commented on June 29, 2024

I tend to agree that safety should win over speed by default. The option to use WeakRefString could be advertised prominently in the docs. It will be easier to point people to that option when they complain about speed than to explain them why they get subtle issues by default.

BTW, since most CSV string data is categorical, would an option to convert the columns to CategoricalArray improve performance over creating a Vector{String}? That could be another, safer alternative.

from csv.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.