heremaps / flatdata Goto Github PK

Write-once, read-many, minimal overhead binary structured file format.

License: Apache License 2.0

CMake 0.29% Python 24.84% C++ 22.31% C 0.05% Shell 0.06% Makefile 0.06% Go 8.36% Dockerfile 0.02% Rust 21.59% Roff 18.39% Jinja 4.04%

memory-mapped storage cpp python3 serialization rust

flatdata's People

Contributors

Stargazers

Watchers

flatdata's Issues

Define supported & tested target platforms

I think it might make sense to test multiple platforms in CI, in-order to be able to support them officially and make sure it builds correctly.

I suppose testing on Ubuntu LTS 16.04 and CentOS 7 would already be enough to begin with.

Annotations for vectors with sentinels

Oftentimes flatdata vectors contain a sentinel in the end. It is easy to forget excluding it when iterating.

One idea would be to add an annitation @sentinel that the generator could use to create two getters for the archive:

get_xyz() const and get_xyz_with_sentinel() const

What do you think @boxdot @imagovrn ?

Implicit bind of vector and multivector fails

When trying to bind implicitly a vector and a multivector the generator fails. Example:

@bound_implicitly( characters: vertices, character_data )
archive Graph {
    vertices : vector< Character >;
    character_data: multivector< 32, Nickname, Description >;
}

Error:

12:41:29 - CRITICAL - app:67 - Error reading schema: Referring to incorrect type. Expected <class 'generator.tree.nodes.resources.vector.Vector'>, actual <class 'generator.tree.nodes.resources.multivector.Multivector'>

Is this expected? As far as I understand, it should be possible to bind both. In my example, I am referencing the elements of vertices and character_data with the same index.

A small sidenote: probably, the error should also refer to the line in flatdata's source, instead of the line in the source code of the generator.

Flatdata creates empty directory structure when attempting to open an archive with subarchives

When trying to open an arbitrary archive with subarchives, flatdata will create an empty directory structure. The issue comes from FileResourceStorage::directory() creating things by default. Probably should not do it.

The generated archive should be thread-safe

The rust implementation the flatdata::ResourceStorage is not thread safe and is missing a std::marker::Send implementation.

There are usages of RefCell and rc::Rc which probably need to be assessed for thread safety.

Move generated schemas down in generated Rust code.

Schema definitions are not interestring to read and are hidden in the docs anyway. They should down in the file.

C++ library: Add class owning memory for a single mutable struct

flatdata supports sorting single structs in an archive (usually headers). Currently those are generated/written in C++ by creating a ``flatdata::Vector< T > v( 1 );```. Adding a class that represents this concept more closely would be more intuitive/readable.

Proposal:

Add a template< typename T > flatdata::Struct{ ... } for this concept

Flatdata > flatdata generator loses information

I came across this bug while generating the rust code from a flatdata schema that was already the dumped schema of a flatdata archive that was previously built.

Here's a wrongly generated macro:

define_variadic_struct!(RelationMembers, RelationMembersItemBuilder,
    IndexType32,
    0 => (.osm.NodeMember, add_.osm._nodemember),
    1 => (.osm.WayMember, add_.osm._waymember),
    2 => (.osm.RelationMember, add_.osm._relationmember));

when you should get:

define_variadic_struct!(RelationMembers, RefRelationMembers, BuilderRelationMembers,
    IndexType32,
    0 => (NodeMember, add_node_member),
    1 => (WayMember, add_way_member),
    2 => (RelationMember, add_relation_member));

Comments in Go are double escaped

E.g.:

// /*
//  * This is a comment about Bar
//  */
type Bar struct {
    descriptor flatdata.MemoryDescriptor
	position int
}

FYI: @artemnikitin

Add How-To Articles

How to flatten nested structures
How to serialize heterogeneous vectors
How to build archive in C++
How to build archive in Rust

[Rust] Figure out how to expose type shortcuts for Archive Resources

In C++ we have type shortcuts, e.g.: Archive::EdgesType. In Rust this is more complicated, as only traits can have associated types.

Build image automatically from `ci/Dockerfile` on each update of the file

This is needed since right now, the docker image is stored under my account, and so cannot be controlled by other flatdata maintainers.

Generator: Add support for multiple output files

Add support for multiple output files especially to python generator, to output proper package with respected namespaces. Adapt CMake for easier integration into C++

Create centralized list of flatdata testcases

Since we now have 4 languages, and each has it's one schemas for tests, I was thinking we should maybe create a central place for testcases:

for each feature of flatdata a sample schema
for each corner cases a sample schema
each with a comment describing what should be tested

That would make it easy to test compliance for new language implementations, and reduce redundancy of the tests.

Opinions @boxdot @imagovrn @gferon ?

flatdata-rs does not handle namespaces

Reproducible:

namespace test_structures {

struct A {
    x : u32;
}

} // namespace test_structures

namespace test_structures2 {

struct A {
    x: u32;
}

} // namespace test_structures2

Fixing this is most likely not backwards compatible, so we should prioritize this?

@boxdot I can easily fix this, but I need some guidance on what would be the most natural way for Rust to handle this? A sub-module in the generated file (users could re-export to get rid of it if they do not like it)?

Support includes in flatdata schema

Right now, a schema-file in flatdata is a single file, which can not be split into separate parts. Especially, since we support optional subarchives, it makes sense to be able to serialize a subarchive standalone, and at the same to include it in a bigger archive.

Generate long constants with separators

Example:

Instead of

pub const INFINITE: u32 = 4294967295;

generate

pub const INFINITE: u32 = 4_294_967_295;

This would remove clippy warning in Rust.

Another possibility, would be to use #![allow(clippy:all)]. In the end, this is generated code and is not meant to be very readable.

Reading enums in Rust can lead to UB in case of incorrect file content

Unlike C/C++ Rust enums are exhaustive: All possible values need to be explicitly mentioned.

That means that casting from the underlying type (e.g. 9 bit integer) to the enumeration can cause UB if the value stored in flatdata does not exist in the enumeration (e.g. corrupted file, etc).

Rust doesn't currently support C-like enums. This means that our options are a bit limited. Some ideas:

Create enum with an additional (hidden?) field Unknown, and check in reader (speed impact?)
Check enum values every time data is read and panic (speed impact)
Do not expose as enum, but newtype integer plus constants
If speed is impacted a raw reading function might be needed for fast access

Add sphinx documentation

Add autodoc-, doxygen-generated and manually written docs. Publish on GitHub.io

Add support for enumerations to python backend

C++ and Rust support enumerations, but python readers to not yet

Simplify flatdata-rs macros

Remove need for empty lists from archive macro
Use trailing commas (mandatory)
Split long macros

Unsufficient tests for Python reader and writer.

Right now, we have generated reader tests (in Python) for the C++ byte reader. The problem is, that they always try to read 1 from the beginning of some data segment, never deeper in the data. In particular, they test that we read correcly with varying bit_size but not varying offset. Note, when reading 1, the bit_size is not really significant.

Generated writer tests are missing completely.

Add support for dynamically sized bitsets

Flattdata supports statically sized bitsets as packed members (e.g. bool:1). We do not support a use-case, though, in which the user needs to store a large set of bits (e.g. one for each entity) in one blob (except manually through raw_data)

Should we consider adding something like a bitset to archives?

Code generator should escape dangerous symbols

Based on issue in #31

In Go:

Raw string literals are character sequences between back quotes, as in `foo`. 
Within the quotes, any character may appear except back quote.

https://golang.org/ref/spec#String_literals

Probably, there are more such cases.

Add support for serialization of a single struct.

Technically, we have this already. See https://github.com/heremaps/flatdata/pull/21/files/4d14db178ea36e9a96fa945996b9d7a0fe156760#diff-03983e4479e8a13251972767b3d281c1:

flatdata::Vector< SomeStruct > data( 1 );
auto obj = data[ 0 ];
// set fields of obj
builder.set_some_struct( obj );

As discussed, it would make sense to add support for that, such we don't have to use a workaround with Vector.

Add support for enums (Python, Go, Rust)

It would be nice to have a built-in support for enum in flatdata. We could also consider to support a special variant of enums: bitfield enums.

Modernize generated C++ code

Code contains things like:

typedef instead of using
missing move / revalue in some places
etc

Migrate Github services

See https://developer.github.com/changes/2018-04-25-github-services-deprecation/

Revamp generator tests

In addition to #96 I was thinking that we might want to simplify the generator tests: Each is testing input schemas vs some generated output, all in code.

Would it make sense to instead use the test schemas from #96 and just store a bunch of expectations as files?

E.g.

test-case tests/schemas/signed_enums.flatdata
expectation tests/generator/rust/generated/signed_enums.rs

Similar to the test it could be that we do not require a 1:1 match, just a "contains" test.

The python test runner would then just enumerate all test schemas, and run against the expectations.

Would that make sense? Is that something you normally do when testing python code, or is the current way preferable @boxdot @imagovrn ?

Make inspector user friendly when used from source

@boxdot was mentioning that it is very hard to use at the moment after @gferon changes to make it publishable. We should try to simplify a bit.

Optional struct member via sentinel values

Often we have optional struct fields implemented through special constants, e.g:

const u32 INVALID_XYZ = 0xffff;

struct A {
    // INVALID_XYZ in case it's empty
    x : u32;
}

How about supporting that in the generated code properly: exposing an Option< T > instead of T?

Schema suggestion:

const u32 INVALID_XYZ = 0xffff;

struct A {
   @null_value( INVALID_XYZ )
    x : u32;
}

What do you think @boxdot @imagovrn ?

Rust generator number serialization does not work for python < 3.6

#85

Support nested structures

Currently flatdata is truly "flat", but we could benefit from supporting nesting structures.

Pros:

Easier to group data
Reusable grouping of variables

Cons:

Nested structures need implicit byte alignment, would be less efficient in case of bitfields.

Make extracting strings from raw_data more convinient in Rust

Typical use cases:

Retrieve 0-terminated string by index
Retrieve string by index + size

Travis CI and Coveralls

Setup

Travis CI
Coveralls

Properly support multiple namespaces in Go

namespace n{
const i8 FOO = 0;
}

namespace m{
const i8 FOO = 1;
}

Currently results in

const (
...
    FOO int8 = 0
    FOO int8 = 1
)

Similarly for other constructs. This needs to be fixed if namespaces are to be properly supported by the Go backend.

FYI: @artemnikitin

Performance benchmarks

Benchmark different Flatdata implementations, perhaps also against similar libraries?

inspect_archive does not support printing of struct resources

It just prints out the struct definition.

Package generator in flatdata-rs crate (crates.io)

So that users can easily run the generator in their build.rs?

@gferon @boxdot

Strategy for closing when destructing ExternalVector (MultiVector).

Behavior right now: When ExternalVector is destructed, we have a debug assert, that the user has closed it. Same for MultiVector.

In situation, when a user has multiple resources (that need to be closed) encounters an error and wants to make an early return, the resources are destructed, which terminates the program in debug mode. This is obviously not perfect. We should rather try to close in the destructor. The problem is, that closing may fail and we have two possible strategies what to do then.

On failure inside close, fail hard. This will terminate the program, since we are failing in a destructor.
On failure inside close, do nothing (except for producing a log message).

I tend to use the second approach. Our contract specifies that the user should close resources explicitly. Otherwise, we just try to do it for her, and when we can't do it, we don't want to kill the whole program. This also follows the practice in https://doc.rust-lang.org/std/io/struct.BufReader.html.

@VeaaC, @imagovrn: If you agree, I would implement this.

Simplify importing flatdata-rs

At the moment, to import flatdata in rust, we need to write:

#[macro_use]
extern crate flatdata

We should try to remove this requirement.
Also, nightly compiler fails due to missing flatdata_intersperse macro.

Expose sort on Vector

Often flatdata data needs to be sorted. By not exposing sort methods on Vector the user has to sort another structure first before serializing. This needs special wrapper since the Vector classes just expose proxies / proxy iterators.

This is distinct from external memory sort on ExternalVector: All the data is in memory.

FYI: @boxdot

Python: high performance backend

More Efficient Python Implementation

Current flatdata-py implementation is pure python. So far we have used it only for processing smaller datasets and for inspection/debugging. It was noticed that on large datasets it performs quite slowly. It would be useful to have an implementation with performance not too far from C++ one. In order to achieve that, we could do following:

Benchmark two implementations on the same data, to know the gap, monitor the benchmarks in CI. #9
Optimize pure-python implementation.
Introduce parallel processing in pure python implementation (or ease integration with a library that would do it for us, like dask).
As an alternative approach, create flatdata-py-ext implementation which would build and use binary extensions to improve performance.

Make references more user friendly in Rust

Rust has the problem that all of flatdata's references are strongly types: They use fixed-size integer types, not std::size/std::usize.

This means Rust code is often littered with:

g.strings().substring(vertices.at(data.to_ref() as usize).name_ref() as usize)?;
g.edges().slice(vertex.edges().start as usize .. vertex.edges().end as usize);

I propose adding the following:

at method for u8, u16, u32, u64
slice method for Range<u8, u16, u32, u64>
substring method for u8, u16, u32, u64

This means flatdata will not support architecture that have smaller address space.... but since flatdata use memory mapped files it cannot do so anyway.

Thoughts @boxdot ?

When closing ExternalVector return a view to its data.

ExtenalVector does not provide indexed access to its data since it may not be yet written to disk. However, after the vector is closed, its data is fully written and constant from that point in time. Often, one wants to access the data after it was closed. So, it would make sense on closing to return an ArrayView to its data for convenience.

What do you think, @imagovrn, @VeaaC?
CC: @eikesr.

Add @optional( <name> ) annotation for optional fields

a common pattern used in flatdata is too use a fixed value to indicate a missing field, e.g.

const u32 INVALID_INDEX = 0xFFFFFFFF;
struct MyStruct {
    @const( INVALID_INDEX )
     index : u32;
}

I propose making this a use case that we should support properly:

add @optional( <const_name> ) annotation
generate Option<T>/boost::optional/whatever getters/setters
Allow fast access to raw data via special getter (e.g. field_name_raw()?)

Publish 0.3 to crates.io

0.2 still has multiple issues (lifetimes, etc), so it would be nice to update to the latest version.

Also: Can we set it up that it is either done:

automatically
or a group of people can do it
?

Make unittesting more stable

Right now, make test is not working if make was not executed before. This is due to the limitation of downloading the external project and needs to be fixed.

I would like also use the opportunity and a start a discussion about changing the unit testing framework. Our unit tests are fairly simple and do not require mocking, so we don't have use for GMock. On the other hand, GTest needs to be downloaded and built. Which is ok for external dependencies that are really needed for the library, but is cumbersome for unit testing which should just work out of the box.

I propose to either: