Comments (11)
Hi @xiaodaigh, thanks! Yes I definitely need to spent time on documenting the format and perhaps more importantly, the fstlib
API, so new connectors can be build!
It's not complicated, but the API will grow as computational features are added (which will run in parallel with the file IO). Providing for methods that can only be run on the master thread (such as R
methods) will also have to be reflected in the API. Perhaps Rust
with it's better concurrency could provide a faster connector for fstlib
, that would be very interesting!
Just a question, why would you prefer a native implementation in Rust
or Julia
over calling the fstlib
library from a Rust
or Julia
wrapper. Especially Julia
will probably take a performance hit if used for the low-level operations that fstlib
requires. Or are you referring to a native binding instead of a binding through the R
-Julia
interface package?
Is there an example package in Julia which could be used to model a native binding, a package using a simple C++
library for example? The binding could be made to a C
or C++
API to fstlib
using packages like Clang.jl or Cpp.jl, Cxx or CxxWrap perhaps? Starting a toy package early would certainly help a lot to create a uniform API that's suitable for different languages!
from fstlib.
Anyway, the first thing I would do is to use Cxx.jl to call into fstlib. But I might experiment with a pure Julia implementation at some point given the fst format is stable.
Julia has some low-level control as well but not good multi-threading at the moment. I think fstlib is good for scripting languages like Julia, R, and Python so it would be nice to actually write it in a scripting language as well. Given the format is stable, a pure Julia implementation will allow Julia programmers to contribute, not just those with C++ knowledge. But it's overall better to have all resources contribute to one library, in this case a C++ one in fstlib; I wish I know enough C++ to contribute. Learning...
Once the multi-threading story is better in Julia and there is better interop between R-Julia and Python-Julia, then you may be tempted to switch to Julia as well as the syntax is nice and simple, and it can be as fast as C/C++ in many cases.
from fstlib.
Hi @xiaodaigh,thanks, I think using Cxx.jl
would be a nice solution where you only have a single code-base. It would be hard to maintain different versioning and new features across two distinct libraries in different languages (and it would cost a lot of time, currently the most valuable resource for fstlib
development :-))
I would be very interested in trying to set up a fst
package in Julia
, please let me know if and how I can help with that!
from fstlib.
In general, is there a chance that fstlib might expose a pure C API, not a C++? That would make integration in other languages a lot easier.
E.g. for julia, Cxx.jl is great, but at this point installation is so tricky that it is really not an option for a widely used package. On the other hand, if fstlib just exposed a C API, one could integrate is super easily into julia.
from fstlib.
Hi @davidanthoff, thanks for your question. Basically, the fst
package in R
also has a C
only interface when looked at from the R
side (that's all R
understands), so that's similar to your request. In R
, the Rcpp
package is used for convenience and one of the things it does is generate a C
interface that can be used by R
.
From those C
wrappers, the underlying C++
code from fstlib
is used, would it be possible to have a setup like that for Julia
?
For a full implementation of fstlib
in Julia
, you would need:
- DLL with a
C
API forwrite_fst
,read_fst
,threads_fst
,compress_fst
,decompress_fst
andmetadata_fst
(or similar names) to be called fromJulia
- Implementation of
fstlib
's column types (defined in ifstcolumn) - Implementation of
fstlib
'sIFstTable
, which will be a wrapper for a table inJulia
. - Implementation of
fstlib
'sIColumnFactory
andITypeFactory
for generating new column vectors or data types natively inJulia
.
These are all abstract classes which would need an implementation based on the Julia
API. So you should be able to have access to the Julia
API from the DLL.
The reason for that is that fstlib
is a zero-copy library. So any data structure (such as columns) needed to hold data should be created directly in Julia
and not copied from an existing memory buffer. That reduces memory requirements and increases the speed.
Perhaps when you have a basic setup, I could assist you in implementing the abstract classes for Julia
. It would be very interesting to see an implementation of fstlib
in other languages than C++
and R
!
from fstlib.
Basically for a little bit of context, we have shown via benchmarking that fst has the fastest read/write speed in the Julia/R/Python-verse. Parquet and R's serialization are the only other major one we haven't tested.
So I would be extremely to keen to be able to use fst in Julia.
from fstlib.
To be fair, you didn’t measure Feather perf with the R or Python packages, those might be faster than the Julia implementation (or not, who knows).
from fstlib.
Hi @xiaodaigh and @davidanthoff, that's great to hear. It would be nice to compare the various serialization options with a wide range of parameters. For example, for fstlib
, the speed depends on a lot of factors:
- The column type being serialized. Logicals and integers are very fast but character columns are much slower in general.
- The compressibility of the (column-) data. Highly compressible data can be compressed faster and leads to a smaller amount of bytes to serialize (increasing speed)
- Compiler flags. Using -O2 or -O3 flags (for GCC and Clang) matters a lot for the speeds measured.
- Number of threads obviously but also the type of CPU used (I have found Xeon CPU's to process data faster than i5's for example, which makes higher compression settings relatively faster).
- Disk speed and IOPS. Some serializers are optimized for disks with low IOPS and
fstlib
is optimized for disks with high IOPS. - Memory bandwidth. For some operations the memory bandwidth is the limiting factor. The choice of the system used for benchmarking sets the memory bandwidth. Different serializers have different dependencies on that limit.
Testing many systems is very labor intensive, but it would be very interesting to set up a benchmark that uses generated samples with various characteristics:
- various types
- various compressibility levels (e.g. factors can have 2 levels or 2000 and a small range of integers is easier to compress than larger ranges)
- various sizes (
fstlib
shines more for large datasets,csv
writers mostly scale linearly with size).
that way we could really learn about the strong and weak points of different serializers and how they relate to each other. Are your benchmarks published somewhere (or do you have plans for that) ?
thanks!
from fstlib.
Obviously that is going to be a lot of work. I think ultimately we can set up a website where people can submit benchmarks from their system via running some Julia and/or R code. For now I am slowly adding benchmarking codes to the DataBench.jl repo.
from fstlib.
Hi @davidanthoff, on your question about a Julia
implementation. Perhaps it would be possible to create a package using small steps:
-
Milestone 1: a
Julia
package using a compiled library with aC
API that returns a hello world string. -
Milestone 2: a
Julia
package using a compiled library that returns meta-data about a table provided to theC
API.
After milestone 2, we know that we can call the Julia
API from the compiled library, that means we can implement the abstract classes from fstlib
.
-
Milestone 3: the table wrapper and a single column type is implemented (for example integer columns). A 1 column table can be serialized from
Julia
to disk usingfstlib
(and read fromR
for example). Initial speed measurements can be taken for comparison. -
Milestone 4: implement the other types one by one. Think about how to map the special types like
Date
ornanotime
to theJulia
world.
Would that be doable? If any special code is necessary to accommodate the Julia
API, I can provide that from the fstlib
library (for example, some API calls might only be allowed from the master thread like in R
).
from fstlib.
Milestone 1 can be easily achieved see https://github.com/JuliaInterop/CxxWrap.jl
I don't know anything about C++ and that's the issue. I want to help here, but I traced the code to _fst_fstretrieve
for reading a fst file. But I can't to figure out how to go any further.
What would help is someone familiar with C++ to do this, but if it's me, I need some speficif directions on how to compile fstlib into a .so
file and which C++ functions I can call in this manner?
#include "jlcxx/jlcxx.hpp"
JLCXX_MODULE define_julia_module(jlcxx::Module& mod)
{
mod.method("greet", &greet);
}
from fstlib.
Related Issues (10)
- Large datasets HOT 3
- Linux support HOT 12
- Can the small factors levels limits be increased from <128? HOT 1
- Documentation on setting up a `C++` project using the `fstlib` library
- Scan multi-threaded code for false sharing
- Project readme file
- Fix coveralls
- Format specifications and C++ API docs HOT 1
- Zero-row tables are correctly serialized
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fstlib.