Git Product home page Git Product logo

Comments (11)

MarcusKlik avatar MarcusKlik commented on July 26, 2024

Hi @xiaodaigh, thanks! Yes I definitely need to spent time on documenting the format and perhaps more importantly, the fstlib API, so new connectors can be build!

It's not complicated, but the API will grow as computational features are added (which will run in parallel with the file IO). Providing for methods that can only be run on the master thread (such as R methods) will also have to be reflected in the API. Perhaps Rust with it's better concurrency could provide a faster connector for fstlib, that would be very interesting!

Just a question, why would you prefer a native implementation in Rust or Julia over calling the fstlib library from a Rust or Julia wrapper. Especially Julia will probably take a performance hit if used for the low-level operations that fstlib requires. Or are you referring to a native binding instead of a binding through the R-Julia interface package?

Is there an example package in Julia which could be used to model a native binding, a package using a simple C++ library for example? The binding could be made to a C or C++ API to fstlib using packages like Clang.jl or Cpp.jl, Cxx or CxxWrap perhaps? Starting a toy package early would certainly help a lot to create a uniform API that's suitable for different languages!

from fstlib.

xiaodaigh avatar xiaodaigh commented on July 26, 2024

Anyway, the first thing I would do is to use Cxx.jl to call into fstlib. But I might experiment with a pure Julia implementation at some point given the fst format is stable.

Julia has some low-level control as well but not good multi-threading at the moment. I think fstlib is good for scripting languages like Julia, R, and Python so it would be nice to actually write it in a scripting language as well. Given the format is stable, a pure Julia implementation will allow Julia programmers to contribute, not just those with C++ knowledge. But it's overall better to have all resources contribute to one library, in this case a C++ one in fstlib; I wish I know enough C++ to contribute. Learning...

Once the multi-threading story is better in Julia and there is better interop between R-Julia and Python-Julia, then you may be tempted to switch to Julia as well as the syntax is nice and simple, and it can be as fast as C/C++ in many cases.

from fstlib.

MarcusKlik avatar MarcusKlik commented on July 26, 2024

Hi @xiaodaigh,thanks, I think using Cxx.jl would be a nice solution where you only have a single code-base. It would be hard to maintain different versioning and new features across two distinct libraries in different languages (and it would cost a lot of time, currently the most valuable resource for fstlib development :-))

I would be very interested in trying to set up a fst package in Julia, please let me know if and how I can help with that!

from fstlib.

davidanthoff avatar davidanthoff commented on July 26, 2024

In general, is there a chance that fstlib might expose a pure C API, not a C++? That would make integration in other languages a lot easier.

E.g. for julia, Cxx.jl is great, but at this point installation is so tricky that it is really not an option for a widely used package. On the other hand, if fstlib just exposed a C API, one could integrate is super easily into julia.

from fstlib.

MarcusKlik avatar MarcusKlik commented on July 26, 2024

Hi @davidanthoff, thanks for your question. Basically, the fst package in R also has a C only interface when looked at from the R side (that's all R understands), so that's similar to your request. In R, the Rcpp package is used for convenience and one of the things it does is generate a C interface that can be used by R.
From those C wrappers, the underlying C++ code from fstlib is used, would it be possible to have a setup like that for Julia?

For a full implementation of fstlib in Julia, you would need:

  • DLL with a C API for write_fst, read_fst, threads_fst, compress_fst, decompress_fst and metadata_fst (or similar names) to be called from Julia
  • Implementation of fstlib's column types (defined in ifstcolumn)
  • Implementation of fstlib's IFstTable, which will be a wrapper for a table in Julia.
  • Implementation of fstlib's IColumnFactory and ITypeFactory for generating new column vectors or data types natively in Julia.

These are all abstract classes which would need an implementation based on the Julia API. So you should be able to have access to the Julia API from the DLL.

The reason for that is that fstlib is a zero-copy library. So any data structure (such as columns) needed to hold data should be created directly in Julia and not copied from an existing memory buffer. That reduces memory requirements and increases the speed.

Perhaps when you have a basic setup, I could assist you in implementing the abstract classes for Julia. It would be very interesting to see an implementation of fstlib in other languages than C++ and R!

from fstlib.

xiaodaigh avatar xiaodaigh commented on July 26, 2024

Basically for a little bit of context, we have shown via benchmarking that fst has the fastest read/write speed in the Julia/R/Python-verse. Parquet and R's serialization are the only other major one we haven't tested.

So I would be extremely to keen to be able to use fst in Julia.

from fstlib.

davidanthoff avatar davidanthoff commented on July 26, 2024

To be fair, you didn’t measure Feather perf with the R or Python packages, those might be faster than the Julia implementation (or not, who knows).

from fstlib.

MarcusKlik avatar MarcusKlik commented on July 26, 2024

Hi @xiaodaigh and @davidanthoff, that's great to hear. It would be nice to compare the various serialization options with a wide range of parameters. For example, for fstlib, the speed depends on a lot of factors:

  • The column type being serialized. Logicals and integers are very fast but character columns are much slower in general.
  • The compressibility of the (column-) data. Highly compressible data can be compressed faster and leads to a smaller amount of bytes to serialize (increasing speed)
  • Compiler flags. Using -O2 or -O3 flags (for GCC and Clang) matters a lot for the speeds measured.
  • Number of threads obviously but also the type of CPU used (I have found Xeon CPU's to process data faster than i5's for example, which makes higher compression settings relatively faster).
  • Disk speed and IOPS. Some serializers are optimized for disks with low IOPS and fstlib is optimized for disks with high IOPS.
  • Memory bandwidth. For some operations the memory bandwidth is the limiting factor. The choice of the system used for benchmarking sets the memory bandwidth. Different serializers have different dependencies on that limit.

Testing many systems is very labor intensive, but it would be very interesting to set up a benchmark that uses generated samples with various characteristics:

  • various types
  • various compressibility levels (e.g. factors can have 2 levels or 2000 and a small range of integers is easier to compress than larger ranges)
  • various sizes (fstlib shines more for large datasets, csv writers mostly scale linearly with size).

that way we could really learn about the strong and weak points of different serializers and how they relate to each other. Are your benchmarks published somewhere (or do you have plans for that) ?

thanks!

from fstlib.

xiaodaigh avatar xiaodaigh commented on July 26, 2024

Obviously that is going to be a lot of work. I think ultimately we can set up a website where people can submit benchmarks from their system via running some Julia and/or R code. For now I am slowly adding benchmarking codes to the DataBench.jl repo.

from fstlib.

MarcusKlik avatar MarcusKlik commented on July 26, 2024

Hi @davidanthoff, on your question about a Julia implementation. Perhaps it would be possible to create a package using small steps:

  • Milestone 1: a Julia package using a compiled library with a C API that returns a hello world string.

  • Milestone 2: a Julia package using a compiled library that returns meta-data about a table provided to the C API.

After milestone 2, we know that we can call the Julia API from the compiled library, that means we can implement the abstract classes from fstlib.

  • Milestone 3: the table wrapper and a single column type is implemented (for example integer columns). A 1 column table can be serialized from Julia to disk using fstlib (and read from R for example). Initial speed measurements can be taken for comparison.

  • Milestone 4: implement the other types one by one. Think about how to map the special types like Date or nanotime to the Julia world.

Would that be doable? If any special code is necessary to accommodate the Julia API, I can provide that from the fstlib library (for example, some API calls might only be allowed from the master thread like in R).

from fstlib.

xiaodaigh avatar xiaodaigh commented on July 26, 2024

Milestone 1 can be easily achieved see https://github.com/JuliaInterop/CxxWrap.jl

I don't know anything about C++ and that's the issue. I want to help here, but I traced the code to _fst_fstretrieve for reading a fst file. But I can't to figure out how to go any further.

What would help is someone familiar with C++ to do this, but if it's me, I need some speficif directions on how to compile fstlib into a .so file and which C++ functions I can call in this manner?

#include "jlcxx/jlcxx.hpp"

JLCXX_MODULE define_julia_module(jlcxx::Module& mod)
{
  mod.method("greet", &greet);
}

from fstlib.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.