Git Product home page Git Product logo

fstlib's People

Contributors

marcusklik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

fstlib's Issues

Fix coveralls

To include only the lib subdirectory. Also, add a coveralls banner to the homepage.

Linux support

Is linux supported out of the box?
If yes, what is the recommended way to compile on linux?

Project readme file

Explaining the goals behind the fstlib library and the differences with the arrow/parguet philosophy.

Can the small factors levels limits be increased from <128?

I was looking into lib/factor/factor_v7.cpp and see code like if (*nrOfLevels < 128). In the comment it says

// use 1 byte per int (Na encoding takes 1 bit)

which seems to be "wasting" the other 7 bits once that one bit is used, technically can support up to 256 distinct values (including NA and NaN).

Without much background, I assume it's to do with how R encodes the values, so it's always stored as int instead of unsigned int. I know if it's too expensive in terms of performance to relax this to 256 by converting to unsigned int. I know Julia supports unsigned with its UInt8 type.

Looking forward to full description of fst format

I know it's going to be a bit of work, but a full-description of the fst format will help build connectors into it. From Julia, Python, and any other programming language. The potential is huge for such an awesome on-disk data manipulation framework!

I will try to help when I know enough C++. I secretly hope that once the format is well known, there can be independent implementation in Julia and Rust (at the risk of running out of sync with C++) but native implementations would be fun. But calling into C++ is also a good option.

Documentation on setting up a `C++` project using the `fstlib` library

Currently, there is no clear documentation on how to setup a C++ project using the fstlib library.

A sample C++ project would be a good starting point for potential users. Some ideas:

  • command line tool that can read meta-data of a fst file
  • command line tool that can provide statistics on the contents of a fst file
  • command line tool that can do some (parallel) calculations on a selected column from a fst file
  • command line tool that can convert parguet files to fst files using the arrow library as the in-memory tabular representation. That would really showcase the flexibility of the fstlib library.

Also, it would be interesting to compare the read- and write- speed of a pure C++ consumer to that of the fst package.

Scan multi-threaded code for false sharing

See for example here. To lower memory requirements, fstlib allocates larger blocks of memory that are written to by several threads. In such cases, cache line pollution must be avoided.

A solution is to make sure each sub-block has a size that is a multiple of the cache line size (64 bytes on most modern Intel processors).

Large datasets

First thanks for the library!

What is the recommended approach to write large datasets (e.g. 20+ GB csv files). Is there any way to stream reading / writing ?

I have a hard time finding documentation on how to use it. The only one I found uses data frames. I am not an expert on R but I think it is in memory only.

Also I would ideally like to use it in a rust program, which means I'll probably need to do a rust binding for the required parts. Happy to share it if you want!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.