xiaodaigh / jdf.jl Goto Github PK

View Code? Open in Web Editor NEW

88.0 8.0 9.0 362 KB

Julia DataFrames serialization format

License: MIT License

Julia 100.00%

julia serialization-format

jdf.jl's Introduction

What is JDF.jl?

JDF is a DataFrames serialization format with the following goals

Fast save and load times
Compressed storage on disk
Enable disk-based data manipulation (not yet achieved)
Supports machine learning workloads, e.g. mini-batch, sampling (not yet achieved)

JDF.jl is the Julia pacakge for all things related to JDF.

JDF stores a DataFrame in a folder with each column stored as a separate file. There is also a metadata.jls file that stores metadata about the original DataFrame. Collectively, the column files, the metadata file, and the folder is called a JDF "file".

JDF.jl is a pure-Julia solution and there are a lot of ways to do nifty things like compression and encapsulating the underlying struture of the arrays that's hard to do in R and Python. E.g. Python's numpy arrays are C objects, but all the vector types used in JDF are Julia data types.

Please note

The next major version of JDF will contain breaking changes. But don't worry I am fully committed to providing an automatic upgrade path. This means that you can safely use JDF.jl to save your data and not have to worry about the impending breaking change breaking all your JDF files.

Example: Quick Start

using RDatasets, JDF, DataFrames

a = dataset("datasets", "iris");

first(a, 2)

2×5 DataFrame
 Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
     │ Float64      Float64     Float64      Float64     Cat…
─────┼───────────────────────────────────────────────────────────
   1 │         5.1         3.5          1.4         0.2  setosa
   2 │         4.9         3.0          1.4         0.2  setosa

Saving and Loading data

By default JDF loads and saves DataFrames using multiple threads starting from Julia 1.3. For Julia < 1.3, it saves and loads using one thread only.

@time jdffile = JDF.save("iris.jdf", a)
@time a2 = DataFrame(JDF.load("iris.jdf"))

0.091923 seconds (157.33 k allocations: 9.226 MiB, 98.89% compilation tim
e)
  0.165332 seconds (197.31 k allocations: 11.476 MiB, 98.49% compilation ti
me)
150×5 DataFrame
 Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
     │ Float64      Float64     Float64      Float64     Cat…
─────┼─────────────────────────────────────────────────────────────
   1 │         5.1         3.5          1.4         0.2  setosa
   2 │         4.9         3.0          1.4         0.2  setosa
   3 │         4.7         3.2          1.3         0.2  setosa
   4 │         4.6         3.1          1.5         0.2  setosa
   5 │         5.0         3.6          1.4         0.2  setosa
   6 │         5.4         3.9          1.7         0.4  setosa
   7 │         4.6         3.4          1.4         0.3  setosa
   8 │         5.0         3.4          1.5         0.2  setosa
  ⋮  │      ⋮           ⋮            ⋮           ⋮           ⋮
 144 │         6.8         3.2          5.9         2.3  virginica
 145 │         6.7         3.3          5.7         2.5  virginica
 146 │         6.7         3.0          5.2         2.3  virginica
 147 │         6.3         2.5          5.0         1.9  virginica
 148 │         6.5         3.0          5.2         2.0  virginica
 149 │         6.2         3.4          5.4         2.3  virginica
 150 │         5.9         3.0          5.1         1.8  virginica
                                                   135 rows omitted

Simple checks for correctness

all(names(a2) .== names(a)) # true
all(skipmissing([all(a2[!,name] .== Array(a[!,name])) for name in names(a2)])) #true

true

Loading only certain columns

You can load only a few columns from the dataset by specifying cols = [:column1, :column2]. For example

a2_selected = DataFrame(JDF.load("iris.jdf", cols = [:Species, :SepalLength, :PetalWidth]))

150×3 DataFrame
 Row │ SepalLength  PetalWidth  Species
     │ Float64      Float64     Cat…
─────┼────────────────────────────────────
   1 │         5.1         0.2  setosa
   2 │         4.9         0.2  setosa
   3 │         4.7         0.2  setosa
   4 │         4.6         0.2  setosa
   5 │         5.0         0.2  setosa
   6 │         5.4         0.4  setosa
   7 │         4.6         0.3  setosa
   8 │         5.0         0.2  setosa
  ⋮  │      ⋮           ⋮           ⋮
 144 │         6.8         2.3  virginica
 145 │         6.7         2.5  virginica
 146 │         6.7         2.3  virginica
 147 │         6.3         1.9  virginica
 148 │         6.5         2.0  virginica
 149 │         6.2         2.3  virginica
 150 │         5.9         1.8  virginica
                          135 rows omitted

The difference with loading the whole datasets and then subsetting the columns is that it saves time as only the selected columns are loaded from disk.

Some `DataFrame`-like convenience syntax/functions

To take advatnage of some these convenience functions, you need to create a variable of type JDFFile pointed to the JDF file on disk. For example

jdf"path/to/JDF.jdf"

JDFFile{String}("path/to/JDF.jdf")

path_to_JDF = "path/to/JDF.jdf"
JDFFile(path_to_JDF)

JDFFile{String}("path/to/JDF.jdf")

Using `df[col::Symbol]` syntax

You can load arbitrary col using the df[col] syntax. However, some of these operations are not yet optimized and hence may not be efficient.

afile = JDFFile("iris.jdf")

afile[:Species] # load Species column

150-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 ⋮
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

JDFFile is Table.jl columm-accessible

using Tables
ajdf = JDFFile("iris.jdf")
Tables.columnaccess(ajdf)

true

Tables.columns(ajdf)

JDFFile{String}("iris.jdf")

Tables.schema(ajdf)

Tables.Schema:
 :SepalLength  Float64
 :SepalWidth   Float64
 :PetalLength  Float64
 :PetalWidth   Float64
 :Species      CategoricalVector (alias for CategoricalArrays.CategoricalAr
ray{T, 1} where T)

getproperty(Tables.columns(ajdf), :Species)

150-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 ⋮
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

Load each column from disk

You can load each column of a JDF file from disk using iterations

jdffile = jdf"iris.jdf"
for col in eachcol(jdffile)
  # do something to col
  # where `col` is the content of one column of iris.jdf
end

To iterate through the columns names and the col

jdffile = jdf"iris.jdf"
for (name, col) in zip(names(jdffile), eachcol(jdffile))
  # `name::Symbol` is the name of the column
  #  `col` is the content of one column of iris.jdf
end

Metadata Names & Size from disk

You can obtain the column names and number of columns ncol of a JDF, for example:

using JDF, DataFrames
df = DataFrame(a = 1:3, b = 1:3)
JDF.save(df, "plsdel.jdf")

names(jdf"plsdel.jdf") # [:a, :b]

# clean up
rm("plsdel.jdf", force = true, recursive = true)

Additional functionality: In memory `DataFrame` compression

DataFrame sizes are out of control. A 2GB CSV file can easily take up 10GB in RAM. One can use the function type_compress!(df) to compress any df::DataFrame. E.g.

type_compress!(df)

3×2 DataFrame
 Row │ a     b
     │ Int8  Int8
─────┼────────────
   1 │    1     1
   2 │    2     2
   3 │    3     3

The function looks at Int* columns and see if it can be safely "downgraded" to another Int* type with a smaller bits size. It will convert Float64 to Float32 if compress_float = true. E.g.

type_compress!(df, compress_float = true)

3×2 DataFrame
 Row │ a     b
     │ Int8  Int8
─────┼────────────
   1 │    1     1
   2 │    2     2
   3 │    3     3

String compression is planned and will likely employ categorical encoding combined with RLE encoding.

Benchmarks

Here are some benchmarks using the Fannie Mae Mortgage Data. Please note that a reading of zero means that the method has failed to read or write.

JDF is a decent performer on both read and write and can achieve comparable performance to R's {fst}, once compiled. The JDF format also results in much smaller file size vs Feather.jl in this particular example (probably due to Feather.jl's inefficient storage of Union{String, Missing}).

Please note that the benchmarks were obtained on Julia 1.3+. On earlier versions of Julia where multi-threading isn't available, JDF is roughly 2x slower than as shown in the benchmarks.

Supported data types

I believe that restricting the types that JDF supports is vital for simplicity and maintainability.

There is support for

WeakRefStrings.StringVector
Vector{T}, Vector{Union{Mising, T}}, Vector{Union{Nothing, T}}
CategoricalArrays.CategoricalVetors{T} and PooledArrays.PooledVector

where T can be String, Bool, Symbol, Char, SubString{String}, TimeZones.ZonedDateTime (experimental) and isbits types i.e. UInt*, Int*, and Float* Date* types etc.

RLEVectors support will be considered in the future when missing support arrives for RLEVectors.jl.

Resources

@bkamins's excellent DataFrames.jl tutorial contains a section on using JDF.jl.

How does JDF work?

When saving a JDF, each vector is Blosc compressed (using the default settings) if possible; this includes all T and Unions{Missing, T} types where T is isbits. For String vectors, they are first converted to a Run Length Encoding (RLE) representation, and the lengths component in the RLE are Blosc compressed.

Development Plans

I fully intend to develop JDF.jl into a language neutral format by version v0.4. However, I have other OSS commitments including R's {disk.frame} and hence new features might be slow to come onboard. But I am fully committed to making JDF files created using JDF.jl v0.2 or higher loadable in all future JDF.jl versions.

Notes

Parallel read and write support is only available from Julia 1.3.
The design of JDF was inspired by fst in terms of using compressions and allowing random-access to columns

jdf.jl's People

Contributors

Stargazers

Watchers

Forkers

kristofferc bkamins quinnj stjordanis kafisatz ossifragus jerlich kronosthelate jialincammieli

jdf.jl's Issues

Combining files in the folder into one file?

Hi,

Thanks for writing this package. I have just tried using this package. It's pretty efficient both writing and reading. Just one question that maybe not so important. Right now it stores all the columns into a folder with each file corresponds to one column. Will you be considering combine all files into one file, say, xxx.jdf is just one binary file. This is basically for storage reasons. Forgive me if this sounds stupid. Thanks.

syntax error in savejdf

first of all: JDF is a great package. well done. unfortunately I get an syntax error for the example mentioned in the github:
using VegaDatasets, JDF, DataFrames

a = dataset("iris") |> DataFrame
@time metadatas = savejdf("iris.jdf", a)
@time a2 = loadjdf("iris.jdf")

here savejdf raises an syntax error:

ERROR: MethodError: no method matching getindex(::DataFrame, ::typeof(!), ::Int64)
Closest candidates are:
getindex(::DataFrame, ::Integer, ::Union{Signed, Unsigned}) at /home/user/.julia/packages/DataFrames/0Em9Q/src/dataframe/dataframe.jl:311
getindex(::DataFrame, ::AbstractArray{T,1} where T, ::Union{Signed, Symbol, Unsigned}) at /home/user/.julia/packages/DataFrames/0Em9Q/src/dataframe/dataframe.jl:337
getindex(::DataFrame, ::Colon, ::Union{Signed, Symbol, Unsigned}) at /home/user/.julia/packages/DataFrames/0Em9Q/src/dataframe/dataframe.jl:358
...
Stacktrace:
[1] ssavejdf(::String, ::DataFrame) at /home/user/.julia/packages/JDF/Saot5/src/JDF.jl:105
[2] savejdf(::String, ::DataFrame) at /home/user/.julia/packages/JDF/Saot5/src/JDF.jl:57
[3] top-level scope at util.jl:156

on julia 1.2 as well as on julia 1.4

hope this info helps for further optimizing JDF

best regards from Germany

Support custom structs

The performance looks great and am open to trying JDF.jl.

I happen to be using custom structs in DataFrames and serializing with JLD2.jl.

Have you considered supporting custom structs?

Fwiw, not sure if JSON3.jl usage of StructTypes.jl would help you support custom structs.

error when trying to save df that has array of substrings as column

Hi,

Nice work on this package. Just wanted to report that when I tried saving a df that had Array{SubString{String},1} for one of its columns, I got the following error:

ERROR: TaskFailedException: ArgumentError: buffer eltype must be isbitstype Stacktrace: [1] compress!(::Array{UInt8,1}, ::Ptr{SubString{String}}, ::Int64; level::Int64, shuffle::Bool, itemsize::Int64) at /users/yh31/.julia/packages/Blosc/Six7M/src/Blosc.jl:60 [2] compress!(::Array{UInt8,1}, ::Ptr{SubString{String}}, ::Int64) at /users/yh31/.julia/packages/Blosc/Six7M/src/Blosc.jl:58 [3] compress(::Ptr{SubString{String}}, ::Int64; kws::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /users/yh31/.julia/packages/Blosc/Six7M/src/Blosc.jl:86 [4] compress(::Ptr{SubString{String}}, ::Int64) at /users/yh31/.julia/packages/Blosc/Six7M/src/Blosc.jl:85 [5] compress(::Array{SubString{String},1}; kws::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /users/yh31/.julia/packages/Blosc/Six7M/src/Blosc.jl:92 [6] compress(::Array{SubString{String},1}) at /users/yh31/.julia/packages/Blosc/Six7M/src/Blosc.jl:91 [7] compress_then_write(::Array{SubString{String},1}, ::BufferedStreams.BufferedOutputStream{IOStream}) at /users/yh31/.julia/packages/JDF/Na1Xj/src/compress_then_write.jl:8 [8] macro expansion at /users/yh31/.julia/packages/JDF/Na1Xj/src/savejdf.jl:69 [inlined] [9] (::JDF.var"#49#52"{String,DataFrame,String,Int64})() at ./threadingconstructs.jl:169 Stacktrace: [1] wait at ./task.jl:267 [inlined] [2] fetch(::Task) at ./task.jl:282 [3] _broadcast_getindex_evalf at ./broadcast.jl:648 [inlined] [4] _broadcast_getindex at ./broadcast.jl:621 [inlined] [5] getindex at ./broadcast.jl:575 [inlined] [6] copyto_nonleaf!(::Array{NamedTuple{(:string_compressed_bytes, :string_len_bytes, :rle_bytes, :rle_len, :type, :len),Tuple{Int64,Int64,Int64,Int64,DataType,Int64}},1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(fetch),Tuple{Base.Broadcast.Extruded{Array{Any,1},Tuple{Bool},Tuple{Int64}}}}, ::Base.OneTo{Int64}, ::Int64, ::Int64) at ./broadcast.jl:1026 [7] copy at ./broadcast.jl:880 [inlined] [8] materialize at ./broadcast.jl:837 [inlined] [9] savejdf(::String, ::DataFrame; verbose::Bool) at /users/yh31/.julia/packages/JDF/Na1Xj/src/savejdf.jl:76 [10] savejdf at /users/yh31/.julia/packages/JDF/Na1Xj/src/savejdf.jl:48 [inlined] [11] savejdf(::DataFrame, ::String) at /users/yh31/.julia/packages/JDF/Na1Xj/src/savejdf.jl:45 [12] top-level scope at REPL[290]:1

This error went away when I converted the array to an array of strings before saving the df.

Missing values in categorical arrays turn into #undef

julia> using DataFrames, CategoricalArrays

julia> using JDF

julia> df1 = DataFrame(sex=["Male", missing, "Female"])
3×1 DataFrame  
 Row │ sex     
     │ String? 
─────┼─────────
   1 │ Male    
   2 │ missing 
   3 │ Female  

julia> df2 = DataFrame(sex=categorical(["Male", missing, "Female"]))
3×1 DataFrame  
 Row │ sex     
     │ Cat…?   
─────┼─────────
   1 │ Male    
   2 │ missing 
   3 │ Female  

julia> JDF.save("df1.jdf", df1)
JDFFile{String}("df1.jdf")

julia> JDF.save("df2.jdf", df2)
JDFFile{String}("df2.jdf")

julia> JDF.load("df1.jdf")
JDF.Table((sex = Union{Missing, String}["Male", missing, "Female"],))

julia> JDF.load("df2.jdf")
JDF.Table((sex = CategoricalValue{String, UInt32}["Male", #undef, "Female"],))

Compression using https://github.com/tpapp/StrFs.jl

For fixed length strings, using that pkg is a good way to save memory

I need the author to accept https://github.com/tpapp/StrFs.jl/pull/6/files

But given it's not been accepted nor reviewed, the chance of this ever working is slim

Type Array does not have a definite size.

According to the readme, JDF supports vectors of floats. but if i try to have a Vector{Float64} as an element type, it fails.

MWE

using DataFrames, JDF

df = (;a = range(1,10), b=[rand(2) for _ in 1:10])
JDF.save(tempdir() * "/test.jdf", df)

A way for users to convert any type to a type that is supported

Sometimes a user would want a type to be saved,

we require the use to provide a function f to transform that type to a type supported, and a function f' to transform that data back.

This way the user can support arbitrary types.

implementation
documentation

Save the data in chunks

add ability to store arbitrary metadata like documentation strings

this was raised as something good for Feather by Douglas Bates on zulip

Date Vector with missing entries

I run into the issue below with missing Date entries.
Is this expected on JDF v0.4.0 ?


julia> col = [Date(1999,1,1),missing,today()]
3-element Array{Union{Missing, Date},1}:
 1999-01-01
 missing
 2021-04-14

julia> df = DataFrame(d = col)
3×1 DataFrame
 Row │ d
     │ Date?      
─────┼────────────
   1 │ 1999-01-01
   2 │ missing    
   3 │ 2021-04-14

julia> JDF.save(raw"c:\temp\fifi", df)
ERROR: TaskFailedException:
Abstract type AbstractTime does not have a definite size.
Stacktrace:
 [1] sizeof at .\essentials.jl:449 [inlined]
 [2] compress!(::Array{UInt8,1}, ::Ptr{Dates.AbstractTime}, ::Int64) at C:\Users\bernhard.konig.ROOT_MILLIMAN\.julia\packages\Blosc\Six7M\src\Blosc.jl:58
 [3] compress(::Ptr{Dates.AbstractTime}, ::Int64; kws::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\bernhard.konig.ROOT_MILLIMAN\.julia\packages\Blosc\Six7M\src\Blosc.jl:86
 [4] compress(::Ptr{Dates.AbstractTime}, ::Int64) at C:\Users\bernhard.konig.ROOT_MILLIMAN\.julia\packages\Blosc\Six7M\src\Blosc.jl:85
 [5] compress(::Array{Dates.AbstractTime,1}; kws::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\bernhard.konig.ROOT_MILLIMAN\.julia\packages\Blosc\Six7M\src\Blosc.jl:92
 [6] compress(::Array{Dates.AbstractTime,1}) at C:\Users\bernhard.konig.ROOT_MILLIMAN\.julia\packages\Blosc\Six7M\src\Blosc.jl:91
 [7] compress_then_write(::Array{Dates.AbstractTime,1}, ::BufferedStreams.BufferedOutputStream{IOStream}) at C:\Users\bernhard.konig.ROOT_MILLIMAN\.julia\packages\JDF\jOO9d\src\compress_then_write.jl:8
 [8] compress_then_write(::Array{Union{Missing, Date},1}, ::BufferedStreams.BufferedOutputStream{IOStream}) at C:\Users\bernhard.konig.ROOT_MILLIMAN\.julia\packages\JDF\jOO9d\src\type-writer-loader\Missing.jl:13
 [9] macro expansion at C:\Users\bernhard.konig.ROOT_MILLIMAN\.julia\packages\JDF\jOO9d\src\savejdf.jl:73 [inlined]
 [10] (::JDF.var"#61#64"{String,DataFrame,Symbol})() at .\threadingconstructs.jl:169
Stacktrace:
 [1] wait at .\task.jl:267 [inlined]
 [2] fetch(::Task) at .\task.jl:282
 [3] _broadcast_getindex_evalf at .\broadcast.jl:648 [inlined]
 [4] _broadcast_getindex at .\broadcast.jl:621 [inlined]
 [5] getindex at .\broadcast.jl:575 [inlined]
 [6] copy at .\broadcast.jl:876 [inlined]
 [7] materialize at .\broadcast.jl:837 [inlined]
 [8] save(::String, ::DataFrame; verbose::Bool) at C:\Users\bernhard.konig.ROOT_MILLIMAN\.julia\packages\JDF\jOO9d\src\savejdf.jl:80
 [9] save(::String, ::DataFrame) at C:\Users\bernhard.konig.ROOT_MILLIMAN\.julia\packages\JDF\jOO9d\src\savejdf.jl:50
 [10] top-level scope at REPL[49]:1

julia>

type_compress

maybe you could rename type_compress to compress (and imports it from CategoricalVector).
Maybe you could compress strings to PooledDataVector, not CategoricalVector (which requires a very different syntax).

Implement a pipeline system for compression

Thinking about JDF it can actually be thought of as a pipeline

raw data -> blosc compressed/rle compressed -> written

Conceivably, we can have more elaborate pipelines like

raw data -> rle compress -> blosc compress-> written

and before we start, we do not know which compression is better e.g.

raw string data -> string array -> blosc compress -> written
or
raw string data -> rle compress -> blosc compress -> written

so we can have this comparison pipeline concept. So we can compare both compress before writing to disk, but all we have to do is to remember the operations in a pipeline and and then retry them.

DateTime vector with missing entries

Issue #62 indicated a problem with Date entries. Unfortunately, DateTime is still broken as of v0.4.4.

The problem was first reported at: chipkent/DataFrameTools.jl#14

The problem can be reproduced by running docker run -it julia:1.6:

using Pkg
Pkg.add("DataFrames")
using DataFrames
Pkg.add("Dates")
using Dates

df = DataFrame()
df[!, :test] = [DateTime(2000,1,1,1,1,1), missing]
df

Pkg.add("JDF")
using JDF
JDF.savejdf("test.jdf", df)

Error:

   nested task error: Abstract type AbstractTime does not have a definite size.
    Stacktrace:
      [1] sizeof
        @ ./essentials.jl:455 [inlined]
      [2] compress!(dest::Vector{UInt8}, src::Ptr{Dates.AbstractTime}, src_size::Int64)
        @ Blosc ~/.julia/packages/Blosc/vjmKP/src/Blosc.jl:58
      [3] compress(src::Ptr{Dates.AbstractTime}, src_size::Int64; kws::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
        @ Blosc ~/.julia/packages/Blosc/vjmKP/src/Blosc.jl:86
      [4] compress(src::Ptr{Dates.AbstractTime}, src_size::Int64)
        @ Blosc ~/.julia/packages/Blosc/vjmKP/src/Blosc.jl:85
      [5] compress(src::Vector{Dates.AbstractTime}; kws::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
        @ Blosc ~/.julia/packages/Blosc/vjmKP/src/Blosc.jl:92
      [6] compress(src::Vector{Dates.AbstractTime})
        @ Blosc ~/.julia/packages/Blosc/vjmKP/src/Blosc.jl:91
      [7] compress_then_write(b::Vector{Dates.AbstractTime}, io::BufferedStreams.BufferedOutputStream{IOStream})
        @ JDF ~/.julia/packages/JDF/SMfQY/src/compress_then_write.jl:8
      [8] compress_then_write(b::Vector{Union{Missing, DateTime}}, io::BufferedStreams.BufferedOutputStream{IOStream})
        @ JDF ~/.julia/packages/JDF/SMfQY/src/type-writer-loader/Missing.jl:13
      [9] macro expansion
        @ ~/.julia/packages/JDF/SMfQY/src/savejdf.jl:75 [inlined]
     [10] (::JDF.var"#57#60"{String, DataFrame, Symbol})()
        @ JDF ./threadingconstructs.jl:169

Use a defined and easy-to-write format to store the metadata

Currently, Julia's serializer is used to store the metadata. I think should use an open format.

Read a JDFFile in chunks

Is it possible (or will someday be possible) to read a JDFFile in chunks?

I.e., something similar to:

df = DataFrame(JDF.load("iris.jdf", start=2, length=2))

to read just the 2nd and 3rd row.

Thanks!

Sync with DataFrames.jl 0.22

Can you please check if JDF.jl works with DataFrames.jl master (soon 0.22)?
It e.g. seems that you rely on DataFrames.jl re-exporting CategoricalVector which it does not any more.
Thank you!

Add to JuliaData

I think this package looks nice, but I had to stumble over it in this discourse discussion. Could it be a good idea to have it added to the JuliaData org for improved discoverability?

Better error messages for saving unsupported types

If a type is unsupported then we can print better error messages.

Mixed tabs and spaces

The code mixes tabs and spaces and it hurts me

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If JDF folder is overwritten data of 'old' columns is persitent

See example.

using JDF 
using DataFrames 

df = DataFrame(a=rand(3),b=rand(3),c=rand(3))
fi =raw"C:\temp\fi.jdf" 

@time JDF.save(fi, df)

df = DataFrame(a=rand(3),b=rand(3))

@time JDF.save(fi, df)
#the jdf folder on disk will 'keep' the data for column c
#I wonder if this is intentional, or if 'c' should be cleaned up (i.e. deleted)

This seems like an oversight that should be improved.
What is your view?

Column names mis-match when loading certain columns

Discussed in #78

^{Originally posted by Ossifragus January 26, 2022}
The relative order of the loaded columns seem to be always the same of the relative order the whole data (Is the order specified in the metadata?). For example with a data consisting two columns "col1" and "col2", the code JDF.load(data; cols = ["col1", "col2"]) loads the data correctly, but with JDF.load(data; cols = ["col2", "col1"]) the data columns and their names would not match.

Please see the following example:

using RDatasets, JDF, DataFrames

JDF.save("iris.jdf", dataset("datasets", "iris"))
a = DataFrame(JDF.load("iris.jdf"))

c1 = ["Species", "PetalWidth"]
a1 = DataFrame(JDF.load("iris.jdf"; cols = c1)) # column mis-match

julia> first(a1, 5)
5×2 DataFrame
 Row │ Species  PetalWidth 
     │ Float64  Cat…       
─────┼─────────────────────
   1 │     0.2  setosa
   2 │     0.2  setosa
   3 │     0.2  setosa
   4 │     0.2  setosa
   5 │     0.2  setosa


c2 = ["PetalWidth", "Species"]
a2 = DataFrame(JDF.load("iris.jdf"; cols = c2)) # column match

julia> first(a2, 5)
5×2 DataFrame
 Row │ PetalWidth  Species 
     │ Float64     Cat…    
─────┼─────────────────────
   1 │        0.2  setosa
   2 │        0.2  setosa
   3 │        0.2  setosa
   4 │        0.2  setosa
   5 │        0.2  setosa

Loading some categoricalarrays will have errors

Error on [email protected]

using RDatasets
iris = dataset("datasets", "iris")
savejdf(iris, "c:/data/iris.jdf")
loadjdf(iris, "c:/data/iris.jdf")

Bug: be careful about filenames

Error when writing DataFrame with column containing Vector{Int64}

Hi there,

would love to use JDF.jl with my data :-)
but:

JDF.save("df_with_array.jdf", DataFrame(a=[[1,2]]))

results in

nested task error: Type Array does not have a definite size.
Stacktrace:
 [1] sizeof(x::Type)
   @ Base ./essentials.jl:473
 [2] compress!(dest::Vector{UInt8}, src::Ptr{Vector{Int64}}, src_size::Int64)
   @ Blosc ~/.julia/packages/Blosc/jk4Np/src/Blosc.jl:74
 [3] compress(src::Ptr{Vector{Int64}}, src_size::Int64; kws::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Blosc ~/.julia/packages/Blosc/jk4Np/src/Blosc.jl:111
 [4] compress(src::Ptr{Vector{Int64}}, src_size::Int64)
   @ Blosc ~/.julia/packages/Blosc/jk4Np/src/Blosc.jl:109
 [5] compress(src::Vector{Vector{Int64}}; kws::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Blosc ~/.julia/packages/Blosc/jk4Np/src/Blosc.jl:117
 [6] compress(src::Vector{Vector{Int64}})
   @ Blosc ~/.julia/packages/Blosc/jk4Np/src/Blosc.jl:115
 [7] compress_then_write(b::Vector{Vector{Int64}}, io::BufferedStreams.BufferedOutputStream{IOStream})
   @ JDF ~/.julia/packages/JDF/SL7sz/src/compress_then_write.jl:7
 [8] macro expansion
   @ ~/.julia/packages/JDF/SL7sz/src/savejdf.jl:71 [inlined]
 [9] (::JDF.var"#47#50"{String, DataFrame, Symbol})()
   @ JDF ./threadingconstructs.jl:258

using

JDF v0.5.1

and

julia 1.8.5

Is there a workaround?

Greetings
Para

Julia independent metadata system

Save metadata as JSON or some other format.

Saving and then loading a JDF 'breaks' pooledarrays

Saving and then loading a JDF 'breaks' pooledarrays
See example below.

I understand that you compress the data on disk anyway.
But as the data is not pooled anymore, the memory footprint of the dataframe is considerably larger after loadjdf

Is this something that you can (easily) improve?
You mention that you store some metadata already, so it should be fairly simple to re-pool the data (maybe with a keyword argument?)

using JDF 
using CSV
using DataFrames 

#generate data 
n=20_000
df0=DataFrame(v=repeat(vcat("abc"),n));
allowmissing(df0)
df0.v = convert(Vector{Union{Missing,String}},df0.v)
df0.v[end] = missing

#write file 
csvfile = raw"C:\temp\afile.csv"
CSV.write(fi ,df0)

#read file 
csvSep=','
df =  CSV.read(csvfile, DataFrame,threaded=true, delim=csvSep, pool=0.05,strict=true, lazystrings=false);

#save jdf 
jdffi = raw"C:\temp\df.jdf"
jdffile = JDF.savejdf(jdffi, df)

#load jdf 
dfloaded = JDF.loadjdf(jdffi)

df.v #<- this one is pooled as expected
dfloaded.v #<- not pooled anymore 

Base.summarysize(df)/1024/1024
Base.summarysize(dfloaded)/1024/1024

Add support for DataFrames.jl v 0.21.0

Hi,

Can you consider adding support for DataFrames.jl v0.21.0 in JDF.jl? Thank you!

JDF for Tables.jl table

An idea - maybe JDF could drop dependency on DataFrames.jl altogether and just work with any Tables.jl table?

Column names mis-match when loading certain columns

The relative order of the loaded columns seem to be always the same of the relative order the whole data (Is the order specified in the metadata?). For example with a data consisting two columns "col1" and "col2", the code JDF.load(data; cols = ["col1", "col2"]) loads the data correctly, but with JDF.load(data; cols = ["col2", "col1"]) the data columns and their names would not match.

Please see the following example:

using RDatasets, JDF, DataFrames

JDF.save("iris.jdf", dataset("datasets", "iris"))
a = DataFrame(JDF.load("iris.jdf"))

c1 = ["Species", "PetalWidth"]
a1 = DataFrame(JDF.load("iris.jdf"; cols = c1)) # column mis-match

julia> first(a1, 5)
5×2 DataFrame
 Row │ Species  PetalWidth 
     │ Float64  Cat…       
─────┼─────────────────────
   1 │     0.2  setosa
   2 │     0.2  setosa
   3 │     0.2  setosa
   4 │     0.2  setosa
   5 │     0.2  setosa


c2 = ["PetalWidth", "Species"]
a2 = DataFrame(JDF.load("iris.jdf"; cols = c2)) # column match

julia> first(a2, 5)
5×2 DataFrame
 Row │ PetalWidth  Species 
     │ Float64     Cat…    
─────┼─────────────────────
   1 │        0.2  setosa
   2 │        0.2  setosa
   3 │        0.2  setosa
   4 │        0.2  setosa
   5 │        0.2  setosa

Odd issue ambiguous methoderror

I got the following error while trying to save some files; I'll upload a MWE later.

ERROR: TaskFailedException:
MethodError: compress_then_write(::Array{Any,1}, ::BufferedStreams.BufferedOutputStream{IOStream}) is ambiguous. Candidates:
  compress_then_write(b::Array{Union{Missing, T},1}, io) where T in JDF at /users/yh31/.julia/packages/JDF/jDvZp/src/type-writer-loader/Missing.jl:7
  compress_then_write(b::Array{Union{Nothing, T},1}, io) where T in JDF at /users/yh31/.julia/packages/JDF/jDvZp/src/type-writer-loader/Nothing.jl:10
To resolve the ambiguity, try making one of the methods more specific, or adding a new method more specific than any of the existing applicable methods.
Stacktrace:
 [1] macro expansion at /users/yh31/.julia/packages/JDF/jDvZp/src/savejdf.jl:69 [inlined]
 [2] (::JDF.var"#49#52"{String,DataFrame,String,Int64})() at ./threadingconstructs.jl:169
Stacktrace:
 [1] wait at ./task.jl:267 [inlined]
 [2] fetch(::Task) at ./task.jl:282
 [3] _broadcast_getindex_evalf at ./broadcast.jl:648 [inlined]
 [4] _broadcast_getindex at ./broadcast.jl:621 [inlined]
 [5] getindex at ./broadcast.jl:575 [inlined]
 [6] copyto_nonleaf!(::Array{NamedTuple,1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(fetch),Tuple{Base.Broadcast.Extruded{Array{Any,1},Tuple{Bool},Tuple{Int64}}}}, ::Base.OneTo{Int64}, ::Int64, ::Int64) at ./broadcast.jl:1026
 [7] restart_copyto_nonleaf!(::Array{NamedTuple,1}, ::Array{NamedTuple{(:string_compressed_bytes, :string_len_bytes, :rle_bytes, :rle_len, :type, :len),Tuple{Int64,Int64,Int64,Int64,DataType,Int64}},1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(fetch),Tuple{Base.Broadcast.Extruded{Array{Any,1},Tuple{Bool},Tuple{Int64}}}}, ::NamedTuple{(:len, :type),Tuple{Int64,DataType}}, ::Int64, ::Base.OneTo{Int64}, ::Int64, ::Int64) at ./broadcast.jl:1017
 [8] copyto_nonleaf!(::Array{NamedTuple{(:string_compressed_bytes, :string_len_bytes, :rle_bytes, :rle_len, :type, :len),Tuple{Int64,Int64,Int64,Int64,DataType,Int64}},1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(fetch),Tuple{Base.Broadcast.Extruded{Array{Any,1},Tuple{Bool},Tuple{Int64}}}}, ::Base.OneTo{Int64}, ::Int64, ::Int64) at ./broadcast.jl:1033
 [9] copy at ./broadcast.jl:880 [inlined]
 [10] materialize at ./broadcast.jl:837 [inlined]
 [11] savejdf(::String, ::DataFrame; verbose::Bool) at /users/yh31/.julia/packages/JDF/jDvZp/src/savejdf.jl:76
 [12] savejdf(::String, ::DataFrame) at /users/yh31/.julia/packages/JDF/jDvZp/src/savejdf.jl:48
 [13] (::var"#51#52")(::File) at ./REPL[18]:2
 [14] (::FileTrees.var"#saver#89"{var"#51#52"})(::File, ::DataFrame) at /users/yh31/.julia/packages/FileTrees/sx5xd/src/values.jl:121
 [15] (::Dagger.var"#47#48"{FileTrees.var"#saver#89"{var"#51#52"},Tuple{File,DataFrame}})() at ./threadingconstructs.jl:169
wait at ./task.jl:267 [inlined]
fetch at ./task.jl:282 [inlined]
execute!(::Dagger.ThreadProc, ::Function, ::File, ::Vararg{Any,N} where N) at /users/yh31/.julia/packages/Dagger/U857J/src/processor.jl:222
do_task(::Dagger.Context, ::Dagger.OSProc, ::Int64, ::Function, ::Tuple{File,Dagger.Chunk{Any,MemPool.DRef,Dagger.ThreadProc}}, ::Bool, ::Bool, ::Bool, ::Dagger.Sch.ThunkOptions) at /users/yh31/.julia/packages/Dagger/U857J/src/scheduler.jl:340
#137 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:354 [inlined]
run_work_thunk(::Distributed.var"#137#138"{typeof(Dagger.Sch.do_task),Tuple{Dagger.Context,Dagger.OSProc,Int64,FileTrees.var"#saver#89"{var"#51#52"},Tuple{File,Dagger.Chunk{Any,MemPool.DRef,Dagger.ThreadProc}},Bool,Bool,Bool,Dagger.Sch.ThunkOptions},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}}, ::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
remotecall_fetch(::Function, ::Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:379
remotecall_fetch(::Function, ::Distributed.LocalProcess, ::Dagger.Context, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:379
remotecall_fetch(::Function, ::Int64, ::Dagger.Context, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421 [inlined]
macro expansion at /users/yh31/.julia/packages/Dagger/U857J/src/scheduler.jl:353 [inlined]
(::Dagger.Sch.var"#26#27"{Dagger.Context,Dagger.OSProc,Int64,FileTrees.var"#saver#89"{var"#51#52"},Tuple{File,Dagger.Chunk{Any,MemPool.DRef,Dagger.ThreadProc}},Channel{Any},Bool,Bool,Bool,Dagger.Sch.ThunkOptions})() at ./task.jl:356
Stacktrace:
 [1] compute_dag(::Dagger.Context, ::Dagger.Thunk; options::Nothing) at /users/yh31/.julia/packages/Dagger/U857J/src/scheduler.jl:137
 [2] compute(::Dagger.Context, ::Dagger.Thunk; options::Nothing) at /users/yh31/.julia/packages/Dagger/U857J/src/compute.jl:32
 [3] #compute#70 at /users/yh31/.julia/packages/Dagger/U857J/src/compute.jl:5 [inlined]
 [4] compute at /users/yh31/.julia/packages/Dagger/U857J/src/compute.jl:5 [inlined]
 [5] exec(::Dagger.Thunk) at /users/yh31/.julia/packages/FileTrees/sx5xd/src/parallelism.jl:68
 [6] save(::var"#51#52", ::FileTree; lazy::Nothing, exec::Bool) at /users/yh31/.julia/packages/FileTrees/sx5xd/src/values.jl:128
 [7] save(::Function, ::FileTree) at /users/yh31/.julia/packages/FileTrees/sx5xd/src/values.jl:111
 [8] top-level scope at REPL[18]:1

type_compress!() no longer works

Hi, first of all, congratulations on the birth of your daughter, I hope your family is all safe and healthy. And thanks for this great JDF package. :-)

Using JDF 0.3.0 and 0.2.2, I can't get type_compress! to work. I'm running some of my old code that I'm 90% sure used to work. Everytime I call type type_compress! on a DataFrame I get this error message:

UndefVarError: compress not defined
 Stacktrace:
   [1] type_compress(::CategoricalArrays.CategoricalArray{Int64,1,UInt16,Int64,CategoricalArrays.CategoricalValue{Int64,UInt16},Union{}}) at E:\develop\julia\depot\packages\JDF\JhtJ5\src\type_compress.jl:182
   [2] type_compress!(::DataFrame; compress_float::Bool, verbose::Bool) at E:\develop\julia\depot\packages\JDF\JhtJ5\src\type_compress.jl:24
   [3] type_compress! at E:\develop\julia\depot\packages\JDF\JhtJ5\src\type_compress.jl:19 [inlined]
   [4] load_join_csvs(::Array{FilePathsBase.WindowsPath,1}; add_row::Bool, join_on::Array{String,1}, ignore_cols::Array{String,1}, enable_compress::Bool) at e:\develop\julia\depot\dev\Mperf\src\load_pal.jl:52

I'm not a Julia expert, maybe I'm doing something dumb, but I tried to run your package's Tests (using dev JDF, ] test). There were no error, but then I noticed that most (all?) of the tests that use type_compress! aren't run by default (I guess they take too long and they require access to your C:\data folder).

Thank you!

Version Stable Metadata Format

Hi Xiaodaigh,

Currently, the metadata of a JDF file is stored using Serialization.jl. Reading the metadata is fast and performat. However it is not guaranteed to be readable across Julia versions. From my understanding, the metadata file is crucial for reading the JDF file. Therefore, it would make sense to use a version stable serialization for the metadata file.

The main benefit of Serialization.jl is, that it already ships with Julia. There is no need to use an additional Package for reading the metadata. Using JSON/BSON for that purpose would probably decrease speed and increase compile times and dependencies. As an alternative to that, I made some experiments:

julia> metadata = (field1 = v"1.1", field2 = "HelloWorld", field3 = 42)
(field1 = v"1.1.0", field2 = "HelloWorld", field3 = 42)

julia> const metadata_t = typeof(metadata)
NamedTuple{(:field1, :field2, :field3),Tuple{VersionNumber,String,Int64}}

julia> open("C:\\Users\\me\\Desktop\\metadata.jl", "w") do io
           write(io, repr(metadata))
       end;

julia> metadata2 = include("C:\\Users\\me\\Desktop\\metadata.jl")
(field1 = v"1.1.0", field2 = "HelloWorld", field3 = 42)

julia> isa(metadata2, metadata_t)
true

What do you think about that? It uses NamedTouple syntax and would most probably be version stable.
I am curious about how this would perform compared to JSON/BSON.

TaskFailedException

hello,

it seems that JDF does not cope with massdata so well. when running the following snippet:

a = DataFrame(:a=>1:1000000000, :b=>rand(1:5,1000000000))
metadatas = savejdf("iris.jdf", a)

it raises the exception:

ERROR: TaskFailedException:
ArgumentError: data > 2147483631 bytes is not supported by Blosc
Stacktrace:
[1] compress!(::Array{UInt8,1}, ::Ptr{Int64}, ::Int64; level::Int64, shuffle::Bool, itemsize::Int64) at /home/user/.julia/packages/Blosc/lzFr0/src/Blosc.jl:86
[2] compress! at /home/user/.julia/packages/Blosc/lzFr0/src/Blosc.jl:75 [inlined]
[3] compress(::Ptr{Int64}, ::Int64; kws::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/user/.julia/packages/Blosc/lzFr0/src/Blosc.jl:103
[4] compress at /home/user/.julia/packages/Blosc/lzFr0/src/Blosc.jl:102 [inlined]
[5] macro expansion at ./gcutils.jl:105 [inlined]
[6] #compress#7 at /home/user/.julia/packages/Blosc/lzFr0/src/Blosc.jl:109 [inlined]
[7] compress at /home/user/.julia/packages/Blosc/lzFr0/src/Blosc.jl:108 [inlined]
[8] compress_then_write(::Array{Int64,1}, ::BufferedStreams.BufferedOutputStream{IOStream}) at /home/user/.julia/packages/JDF/BMdXX/src/compress_then_write.jl:13
[9] macro expansion at /home/user/.julia/packages/JDF/BMdXX/src/savejdf.jl:69 [inlined]
[10] (::JDF.var"#47#50"{String,DataFrame,Symbol,Int64})() at ./threadingconstructs.jl:113

maybe it is a good idea to hand over maximal chunks of data to the blosc compressor so that the JDF is robust against massdata.

have a great day ahead!

How to parse JDF files in Python

What packages are available in Python to read and write JDF files??