Git Product home page Git Product logo

flatdata's People

Contributors

artemnikitin avatar boxdot avatar fermeise avatar gferon avatar gmarti avatar hallahan avatar imagovrn avatar infinitybyten avatar lupax avatar stiar avatar tianchengli avatar veaac avatar vladbologa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flatdata's Issues

Define supported & tested target platforms

I think it might make sense to test multiple platforms in CI, in-order to be able to support them officially and make sure it builds correctly.

I suppose testing on Ubuntu LTS 16.04 and CentOS 7 would already be enough to begin with.

Annotations for vectors with sentinels

Oftentimes flatdata vectors contain a sentinel in the end. It is easy to forget excluding it when iterating.

One idea would be to add an annitation @sentinel that the generator could use to create two getters for the archive:

  • get_xyz() const and get_xyz_with_sentinel() const

What do you think @boxdot @imagovrn ?

Implicit bind of vector and multivector fails

When trying to bind implicitly a vector and a multivector the generator fails. Example:

@bound_implicitly( characters: vertices, character_data )
archive Graph {
    vertices : vector< Character >;
    character_data: multivector< 32, Nickname, Description >;
}

Error:

12:41:29 - CRITICAL - app:67 - Error reading schema: Referring to incorrect type. Expected <class 'generator.tree.nodes.resources.vector.Vector'>, actual <class 'generator.tree.nodes.resources.multivector.Multivector'>

Is this expected? As far as I understand, it should be possible to bind both. In my example, I am referencing the elements of vertices and character_data with the same index.

A small sidenote: probably, the error should also refer to the line in flatdata's source, instead of the line in the source code of the generator.

The generated archive should be thread-safe

The rust implementation the flatdata::ResourceStorage is not thread safe and is missing a std::marker::Send implementation.

There are usages of RefCell and rc::Rc which probably need to be assessed for thread safety.

C++ library: Add class owning memory for a single mutable struct

flatdata supports sorting single structs in an archive (usually headers). Currently those are generated/written in C++ by creating a ``flatdata::Vector< T > v( 1 );```. Adding a class that represents this concept more closely would be more intuitive/readable.

Proposal:

  • Add a template< typename T > flatdata::Struct{ ... } for this concept

Flatdata > flatdata generator loses information

I came across this bug while generating the rust code from a flatdata schema that was already the dumped schema of a flatdata archive that was previously built.

Here's a wrongly generated macro:

define_variadic_struct!(RelationMembers, RelationMembersItemBuilder,
    IndexType32,
    0 => (.osm.NodeMember, add_.osm._nodemember),
    1 => (.osm.WayMember, add_.osm._waymember),
    2 => (.osm.RelationMember, add_.osm._relationmember));

when you should get:

define_variadic_struct!(RelationMembers, RefRelationMembers, BuilderRelationMembers,
    IndexType32,
    0 => (NodeMember, add_node_member),
    1 => (WayMember, add_way_member),
    2 => (RelationMember, add_relation_member));

Add How-To Articles

  • How to flatten nested structures
  • How to serialize heterogeneous vectors
  • How to build archive in C++
  • How to build archive in Rust

Create centralized list of flatdata testcases

Since we now have 4 languages, and each has it's one schemas for tests, I was thinking we should maybe create a central place for testcases:

  • for each feature of flatdata a sample schema
  • for each corner cases a sample schema
  • each with a comment describing what should be tested

That would make it easy to test compliance for new language implementations, and reduce redundancy of the tests.

Opinions @boxdot @imagovrn @gferon ?

flatdata-rs does not handle namespaces

Reproducible:

namespace test_structures {

struct A {
    x : u32;
}

} // namespace test_structures

namespace test_structures2 {

struct A {
    x: u32;
}

} // namespace test_structures2

Fixing this is most likely not backwards compatible, so we should prioritize this?

@boxdot I can easily fix this, but I need some guidance on what would be the most natural way for Rust to handle this? A sub-module in the generated file (users could re-export to get rid of it if they do not like it)?

Support includes in flatdata schema

Right now, a schema-file in flatdata is a single file, which can not be split into separate parts. Especially, since we support optional subarchives, it makes sense to be able to serialize a subarchive standalone, and at the same to include it in a bigger archive.

Generate long constants with separators

Example:

Instead of

pub const INFINITE: u32 = 4294967295;

generate

pub const INFINITE: u32 = 4_294_967_295;

This would remove clippy warning in Rust.

Another possibility, would be to use #![allow(clippy:all)]. In the end, this is generated code and is not meant to be very readable.

Reading enums in Rust can lead to UB in case of incorrect file content

Unlike C/C++ Rust enums are exhaustive: All possible values need to be explicitly mentioned.

That means that casting from the underlying type (e.g. 9 bit integer) to the enumeration can cause UB if the value stored in flatdata does not exist in the enumeration (e.g. corrupted file, etc).

Rust doesn't currently support C-like enums. This means that our options are a bit limited. Some ideas:

  • Create enum with an additional (hidden?) field Unknown, and check in reader (speed impact?)
  • Check enum values every time data is read and panic (speed impact)
  • Do not expose as enum, but newtype integer plus constants
  • If speed is impacted a raw reading function might be needed for fast access

Unsufficient tests for Python reader and writer.

Right now, we have generated reader tests (in Python) for the C++ byte reader. The problem is, that they always try to read 1 from the beginning of some data segment, never deeper in the data. In particular, they test that we read correcly with varying bit_size but not varying offset. Note, when reading 1, the bit_size is not really significant.

Generated writer tests are missing completely.

Add support for dynamically sized bitsets

Flattdata supports statically sized bitsets as packed members (e.g. bool:1). We do not support a use-case, though, in which the user needs to store a large set of bits (e.g. one for each entity) in one blob (except manually through raw_data)

Should we consider adding something like a bitset to archives?

Revamp generator tests

In addition to #96 I was thinking that we might want to simplify the generator tests: Each is testing input schemas vs some generated output, all in code.

Would it make sense to instead use the test schemas from #96 and just store a bunch of expectations as files?

E.g.

test-case tests/schemas/signed_enums.flatdata
expectation tests/generator/rust/generated/signed_enums.rs

Similar to the test it could be that we do not require a 1:1 match, just a "contains" test.

The python test runner would then just enumerate all test schemas, and run against the expectations.

Would that make sense? Is that something you normally do when testing python code, or is the current way preferable @boxdot @imagovrn ?

Optional struct member via sentinel values

Often we have optional struct fields implemented through special constants, e.g:

const u32 INVALID_XYZ = 0xffff;

struct A {
    // INVALID_XYZ in case it's empty
    x : u32;
}

How about supporting that in the generated code properly: exposing an Option< T > instead of T?

Schema suggestion:

const u32 INVALID_XYZ = 0xffff;

struct A {
   @null_value( INVALID_XYZ )
    x : u32;
}

What do you think @boxdot @imagovrn ?

Support nested structures

Currently flatdata is truly "flat", but we could benefit from supporting nesting structures.

Pros:

  • Easier to group data
  • Reusable grouping of variables

Cons:

  • Nested structures need implicit byte alignment, would be less efficient in case of bitfields.

Properly support multiple namespaces in Go

namespace n{
const i8 FOO = 0;
}

namespace m{
const i8 FOO = 1;
}

Currently results in

const (
...
    FOO int8 = 0
    FOO int8 = 1
)

Similarly for other constructs. This needs to be fixed if namespaces are to be properly supported by the Go backend.

FYI: @artemnikitin

Performance benchmarks

Benchmark different Flatdata implementations, perhaps also against similar libraries?

Strategy for closing when destructing ExternalVector (MultiVector).

Behavior right now: When ExternalVector is destructed, we have a debug assert, that the user has closed it. Same for MultiVector.

In situation, when a user has multiple resources (that need to be closed) encounters an error and wants to make an early return, the resources are destructed, which terminates the program in debug mode. This is obviously not perfect. We should rather try to close in the destructor. The problem is, that closing may fail and we have two possible strategies what to do then.

  1. On failure inside close, fail hard. This will terminate the program, since we are failing in a destructor.
  2. On failure inside close, do nothing (except for producing a log message).

I tend to use the second approach. Our contract specifies that the user should close resources explicitly. Otherwise, we just try to do it for her, and when we can't do it, we don't want to kill the whole program. This also follows the practice in https://doc.rust-lang.org/std/io/struct.BufReader.html.

@VeaaC, @imagovrn: If you agree, I would implement this.

Simplify importing flatdata-rs

At the moment, to import flatdata in rust, we need to write:

#[macro_use]
extern crate flatdata
  • We should try to remove this requirement.
  • Also, nightly compiler fails due to missing flatdata_intersperse macro.

Expose sort on Vector

Often flatdata data needs to be sorted. By not exposing sort methods on Vector the user has to sort another structure first before serializing. This needs special wrapper since the Vector classes just expose proxies / proxy iterators.

This is distinct from external memory sort on ExternalVector: All the data is in memory.

FYI: @boxdot

Python: high performance backend

More Efficient Python Implementation

Current flatdata-py implementation is pure python. So far we have used it only for processing smaller datasets and for inspection/debugging. It was noticed that on large datasets it performs quite slowly. It would be useful to have an implementation with performance not too far from C++ one. In order to achieve that, we could do following:

  • Benchmark two implementations on the same data, to know the gap, monitor the benchmarks in CI. #9
  • Optimize pure-python implementation.
  • Introduce parallel processing in pure python implementation (or ease integration with a library that would do it for us, like dask).
  • As an alternative approach, create flatdata-py-ext implementation which would build and use binary extensions to improve performance.

Make references more user friendly in Rust

Rust has the problem that all of flatdata's references are strongly types: They use fixed-size integer types, not std::size/std::usize.

This means Rust code is often littered with:

g.strings().substring(vertices.at(data.to_ref() as usize).name_ref() as usize)?;
g.edges().slice(vertex.edges().start as usize .. vertex.edges().end as usize);

I propose adding the following:

  • at method for u8, u16, u32, u64
  • slice method for Range<u8, u16, u32, u64>
  • substring method for u8, u16, u32, u64

This means flatdata will not support architecture that have smaller address space.... but since flatdata use memory mapped files it cannot do so anyway.

Thoughts @boxdot ?

When closing ExternalVector return a view to its data.

ExtenalVector does not provide indexed access to its data since it may not be yet written to disk. However, after the vector is closed, its data is fully written and constant from that point in time. Often, one wants to access the data after it was closed. So, it would make sense on closing to return an ArrayView to its data for convenience.

What do you think, @imagovrn, @VeaaC?
CC: @eikesr.

Add @optional( <name> ) annotation for optional fields

a common pattern used in flatdata is too use a fixed value to indicate a missing field, e.g.

const u32 INVALID_INDEX = 0xFFFFFFFF;
struct MyStruct {
    @const( INVALID_INDEX )
     index : u32;
}

I propose making this a use case that we should support properly:

  • add @optional( <const_name> ) annotation
  • generate Option<T>/boost::optional/whatever getters/setters
  • Allow fast access to raw data via special getter (e.g. field_name_raw()?)

Publish 0.3 to crates.io

0.2 still has multiple issues (lifetimes, etc), so it would be nice to update to the latest version.

Also: Can we set it up that it is either done:

  • automatically
  • or a group of people can do it
    ?

Make unittesting more stable

Right now, make test is not working if make was not executed before. This is due to the limitation of downloading the external project and needs to be fixed.

I would like also use the opportunity and a start a discussion about changing the unit testing framework. Our unit tests are fairly simple and do not require mocking, so we don't have use for GMock. On the other hand, GTest needs to be downloaded and built. Which is ok for external dependencies that are really needed for the library, but is cumbersome for unit testing which should just work out of the box.

I propose to either:

  1. Integrate GTest into our source code, or
  2. To use something less heavy like Catch2.

The latter is a header only single file unit testing library which has proven for me to work already many times.

What do you think, @imagovrn, @VeaaC?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.