near / borsh Goto Github PK

View Code? Open in Web Editor NEW

477.0 477.0 41.0 111.48 MB

Binary Object Representation Serializer for Hashing

Home Page: https://borsh.io/

borsh's People

Contributors

Stargazers

Watchers

borsh's Issues

Discord invite link is not working

On https://borsh.io link to discord server is not working.

Line: https://github.com/near/borsh/blob/master/docs/index.html#L394

Implement a customizable borsh_serialize in derive macros

Currently we can't do much to affect how derive(BorshSerialize, Deserialize) works besides borsh_skip, however one case is very useful. Consider this big structure in wasmer:

pub struct ModuleInfo {
    pub memories: Map<LocalMemoryIndex, MemoryDescriptor>,
    pub globals: Map<LocalGlobalIndex, GlobalInit>,
    pub tables: Map<LocalTableIndex, TableDescriptor>,
    pub imported_functions: Map<ImportedFuncIndex, ImportName>,
    pub imported_memories: Map<ImportedMemoryIndex, (ImportName, MemoryDescriptor)>,
    pub imported_tables: Map<ImportedTableIndex, (ImportName, TableDescriptor)>,
    pub imported_globals: Map<ImportedGlobalIndex, (ImportName, GlobalDescriptor)>,
    pub exports: IndexMap<String, ExportIndex>,
    pub data_initializers: Vec<DataInitializer>,
    pub elem_initializers: Vec<TableInitializer>,
    pub start_func: Option<FuncIndex>,
    pub func_assoc: Map<FuncIndex, SigIndex>,
    pub signatures: Map<SigIndex, FuncSig>,
    pub backend: String,
    pub namespace_table: StringTable<NamespaceIndex>,
    pub name_table: StringTable<NameIndex>,
    pub em_symbol_map: Option<HashMap<u32, String>>,
    pub custom_sections: HashMap<String, Vec<Vec<u8>>>,
    pub generate_debug_info: bool,
    #[borsh_skip]
    pub(crate) debug_info_manager: jit_debug::JitCodeDebugInfoManager,
}

Every fields in this giant struct can derive BorshSerialize and BorshDeserialize, except one: IndexMap<String, ExportIndex>, because IndexMap isn't a type defined in this crate, nor it's a type defined in std or borsh, so you cannot
impl BorshSerialize for IndexMap, but due to this one field, you cannot derive BorshSerialize of the giant struct. There's two workaround of this:

make pub exports: IndexMap<String, ExportIndex>, to pub exports: ExportsMap and define a struct ExportsMap enclosing IndexMap, so you can impl BorshSerialize/Deserialize on ExportsMap and make ModuleInfo Borsh-derivable. But this cause any reference to exports become exports.inner or exports.0
implement BorshSerialize/Deserialize on ModuleInfo manually.
Either one is a big inconvenience or cause structural change hacks. So i propose a borsh_serializer/borsh_deserializer macro:

fn borsh_serialize_index_map<K:BorshSerialize, V:BorshSerialize, W: Write>(index_map: &IndexMap<K,V>, writer: &mut W) -> std::io::Result<()> {
...
}

#[borsh_serializer(borsh_serialize_index_map)]
#[borsh_deserializer(borsh_deserialize_index_map)]
pub exports: IndexMap<String, ExportIndex>,

With help of these macros, user can specify a customize borsh serializer/deserializer to a field of struct, making the whole struct borsh-derivable

Any chance of supporting Golang?

Im looking for something to replace bincode. It would be great if bosh have a first class Go support since it is my only blocker! Thanks

Clean up the bloat from repository

It is painful to clone the repository (it took me 10 minutes lately).

/docs/criterion folder is 200MB. Can we trim it down?

Replace writer/reader with slices

UTF-8 Consistency

As suggested by @vgrichina we need to make sure Rust and JS implementations fail and succeed on exactly the same UTF-8 sequences.

Also, we need to make sure specification explains that implementation must disallow illegal UTF-8 to allow for deterministic roundtrip.

Arrays: serialization does not match schema

I think this line:

borsh/borsh-rs/borsh/src/schema.rs

Line 162 in 50c3c5d

impl_arrays!(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 32 64 65);

Should be changed to be the same as this line:

borsh/borsh-rs/borsh/src/ser/mod.rs

Line 347 in 50c3c5d

 impl_arrays!(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 64 65 128 256 512 1024 2048); 

but with the addition of ‘0’ as the first element as per the schema line.

Implement faster ser/de for [u8;*] types

It'll greatly benefit all implementations for PublicKeys and CryptoHash structs in nearcore

Thoughts on borrowing from the input

Is it within the purview of this library to support borrowing byte slices from the deserialization input? zero-copy would be nice, but it'd make codegen ever so slightly more complicated. What's the stance of borsh-rs on features vs size creep?

Investigate safety of trait objects

Make sure Borsh either does not work with trait objects entirely (because we don't know the type that we need to deserialize into) or if it works it works correctly.

Deserialization fails if slice is longer then the length of the serialized data

The following check assumes the length of the serialized slice equals the length of the serialized data: https://github.com/nearprotocol/borsh/blob/c5693fcb8af4636878fa13e8fc622953cf9b4e1e/borsh-rs/borsh/src/de/mod.rs#L15

There are cases where the serialized data may only occupy the first x bytes of a slice (fixed-size data packets for example). In these cases, deserialization will fail and it's impossible for the receiver to know to what size to prune the slice (how much of the slice contain serialized data). For comparison, bincode allows passing slices that are larger than the serialized data.

Can this restriction be lifted?

Add `#[borsh_optional]` marker for backwards compatibility

Motivation

Suppose we have rust structure:

#[derive(BorshSerialize, BorshDeserialize)]
struct A {
  f1: T1,
  f2: T2
}

Suppose we have serialized into some data (e.g. on disk in rocksdb, in contract state, or circulating in network). Then we want to upgrade this structure by adding another field:

#[derive(BorshSerialize, BorshDeserialize)]
struct A {
  f1: T1,
  f2: T2,
  f3: T3
}

It would be extremely convenient for upgradability if we could deserialize old data using new Rust type.

Proposal

We can introduce #[borsh_optional] decorator that can be used like this:

#[derive(BorshSerialize, BorshDeserialize)]
struct A {
  f1: T1,
  f2: T2,
  #[borsh_optional]
  f3: Option<T3>
}

Then when we deserialize old data with this structure f3 will be None, but when we deserialize new data using this structure it will be Some.

It will only work if optional fields are included at the back:

#[derive(BorshSerialize, BorshDeserialize)]
struct A {
  f1: T1,
  f2: T2,
  #[borsh_optional]
  f3: Option<T3>,
  #[borsh_optional]
  f4: Option<T4>,
  #[borsh_optional]
  f5: Option<T5>
}

And the compilation should fail if the following situations:

#[derive(BorshSerialize, BorshDeserialize)]
struct A {
  f1: T1,
  f2: T2,
  #[borsh_optional]
  f3: Option<T3>,
  f4: Option<T4>,
  #[borsh_optional]
  f5: Option<T5>
}

CC @mfornet Since it might be relevant to near/NEPs#95

Serde compatibiity

Hi, I'm investigating using the Near network for a project I'm working on. Looking through the examples on smart contracts and seeing Borsh, it looks like a really cool serialization format. I'm a bit curious if it plays nicely with Serde-based structs at all.

My use-cases are for using things like chrono::Datetime and url::Url which come with serde implementations. I suppose I could wrap these in newtypes and implement Borsh by hand, but I think it would be much easier if Borsh could work on top of serde (as well as having its own derive macros). This would make it easier to use the format with existing libraries and projects. I understand that Borsh layers some new features on top of its own implementation so obviously those would not be available in a serde-driven version.

I'm curious if this is a possibility for the projects near future. Thank you!

Fuzz testing

We need to write the following fuzz tests for borsh:

A) Generate random type. Creating an object of the type filled with random data. Then serialize it and deserialize it, and compare that structure before and after are the same;
B) Generate random type. Creating an object of the type filled with random data. Serialize it. Randomly flip a subset of bits in the serialized structure. Try deserializing it and assert that it does not panic, but instead either deserializes or returns an error.

The two difficult things to implement would be:

Generating random type;
Creating an object of the type filled with random data;

As an option, I suggest we do both using procedural macros. We can have a macro random_type!(Name, X, Y, seed) that generates a token stream corresponding to a declaration of some type Name using https://doc.rust-lang.org/reference/procedural-macros.html#function-like-procedural-macros where X would be the max depth (e.g. if we have nested structures) and Y is the max width of each node (e.g. max number of fields in a struct or max number of variants in an enum).

Each type would also be decorated with #[derive(RandomInit)] which implements trait

trait RandomInit {
random_init() -> Self
}

for the type, just like we do with serializers. We then would implement RandomInit for basic types and collections, just like we do with serializers.

Then our test would be something like:

random_type!(T0, 1, 1, 42);
...
random_type!(T42, 10, 12, 42);

#[test]
fn test0() {
  for _ in 0..100 {
   let t0 = T0::init_random();
   let out_t0: T0 = try_from_slice(&t0.try_to_vec().unwrap()).unwrap();
   assert_eq!(t0, out_t0);
  }
}

Note should also look at the fuzzing tools that sigma prime wrote for our borsh, we might not need to write it ourselves.

Serializing fixed sized arrays

After #33 borsh can serialize fixed sized arrays of any type, but deserialize only byte arrays.
That's inconsistent

JSON example of the new schema format

@vgrichina requested a JSON example of the new schema format

Capacity allocation for vector deserialization

Currently because we don't know the size of the buffer inside the deserialize method we can't predict if the length is way too large and should error out (right now it would seg fault due to memory allocation error).

See test_invalid_length for example.

Optimize borsh deserialization of Vec<T>

Vec seems to take too much gas. An easy optimization can be Vec implementation with buffered read similar to strings, but it's unclear how to handle non fixed size types, e.g. Vec

Benchmarking

We can use struct from https://github.com/koute/serde-bench/blob/master/src/lib.rs
and https://github.com/erickt/rust-serialization-benchmarks/tree/master/rust (cc @frol ) to compare with existing setups.

We can later add benchmarking of BORsh to repos there, but first let's add benchmarking suit here, including some large data structs (e.g. repeated of repeated of repeated fields) to monitor speed.

Move JS implementation

Currently JS implementation is in nearlib.

Check whether replacing Read with slice or using byteorder, or both speeds us up

We should also address #26 . The change that I did to array was very suboptimal.

Extract borsh solidity and make a borsh-solidity library

borsh solidity implemented here: https://github.com/near/rainbow-bridge/blob/master/libs-sol/nearbridge/contracts/Borsh.sol
And also need a code generator to generate deserialize/serialize struct XXX. Generated solidity code should look like:
https://github.com/near/rainbow-bridge/blob/master/libs-sol/nearbridge/contracts/NearDecoder.sol#L52

Improve borsh solidity, reduce gas use and increase performance by solidity assembly

Byte Arrays

use borsh::{BorshSerialize, BorshDeserialize};

#[derive(BorshSerialize, BorshDeserialize, PartialEq, Debug)]
struct B {
    x: [u8; 20],
    y: [u8; 100],
    z: String,
}

fn test_simple_struct() {
    let b = B {
        x: [0; 20],
        y: [0; 100],
        z: "liber primus".to_string(),
    };
    let encoded_b = b.try_to_vec().unwrap();
    let decoded_b = B::try_from_slice(&encoded_b).unwrap();
    assert_eq!(b, decoded_b);
}

fn main() {
    test_simple_struct();
}

~/BORSH/borsh-test/src$ cargo run
   Compiling borsh-test v0.1.0 (/Users/mrsmith/BORSH/borsh-test)
error[E0277]: the trait bound `[u8; 100]: borsh::BorshDeserialize` is not satisfied
 --> src/main.rs:4:26
  |
4 | #[derive(BorshSerialize, BorshDeserialize, PartialEq, Debug)]
  |                          ^^^^^^^^^^^^^^^^ the trait `borsh::BorshDeserialize` is not implemented for `[u8; 100]`
  |
  = help: the following implementations were found:
            <[T; 0] as borsh::BorshDeserialize>
            <[T; 1024] as borsh::BorshDeserialize>
            <[T; 10] as borsh::BorshDeserialize>
            <[T; 11] as borsh::BorshDeserialize>
          and 36 others
  = help: see issue #48214
  = note: this error originates in a derive macro (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0277]: the trait bound `[u8; 100]: borsh::BorshSerialize` is not satisfied
 --> src/main.rs:4:10
  |
4 | #[derive(BorshSerialize, BorshDeserialize, PartialEq, Debug)]
  |          ^^^^^^^^^^^^^^ the trait `borsh::BorshSerialize` is not implemented for `[u8; 100]`
  |
  = help: the following implementations were found:
            <[T; 0] as borsh::BorshSerialize>
            <[T; 1024] as borsh::BorshSerialize>
            <[T; 10] as borsh::BorshSerialize>
            <[T; 11] as borsh::BorshSerialize>
          and 37 others
  = help: see issue #48214
  = note: this error originates in a derive macro (in Nightly builds, run with -Z macro-backtrace for more info)

error: aborting due to 2 previous errors

For more information about this error, try `rustc --explain E0277`.
error: could not compile `borsh-test`.

To learn more, run the command again with --verbose.

Code coverage

borsh-derive-internals is designed for testability. It'd be great to see how well the tests are doing by including code coverage (e.g. using tarpaulin + codecov/coveralls).

Deserialization security

We should add a suit of tests for security of deserialization, specifically:

Pass invalid Enum value
cc @frol Pass MAX INT length for arrays / strings / hashmaps (the memory allocation issues are preventable by simple check that if len > buf.len() { return Err() } )
non utf-8 string
missing part of the message
extra bytes after the message

Export enough borsh-rs primitives to C

Such as serialize and deserialize unsigned int, int of all size, float, etc. So it's enough to do borsh serialize/deserialize in C.

contract compilation fails due to borsh update

https://travis-ci.com/github/near/create-near-app/jobs/356517030#L670-L709

Compiling status-message v0.1.0 (/home/travis/build/near/create-near-app/tmp-project/contract)
error[E0277]: the trait bound `Welcome: borsh::de::BorshDeserialize` is not satisfied

while dependabot is trying to bump borsh from 0.6.2 to 0.7.0 in github/near/create-near-app/common/contracts/rust

not sure what to do about this

Compile and test borsh-rs in ledger, in a c source file

After #84 or better after #85 , Compile either borsh-rs or borsh-c to a static library, link it with a c library, test it's working on a ledger device

Extend borsh-js, borsh-ts and borsh-rs to support schema based workflow

As described in #83. They're more all less implemented but we want to fully test them and fix bugs encountered during test

Support Box. Should help for for large enums

[Discussion] General workflow (common in all borsh implementations) of borsh

I propose of a common set of operations that should be implemented by all borsh implementations in different programming languages. Some implementation, like rust has macros, can have additional features such as derive[BorshSerialize]

From borsh user, they're going to use borsh in this way.

write a borsh schema definition that would be common in any language, currently it's in json, but json isn't very friendly human writable, so we may consider yaml, toml, or a rust type definition like DSL. They're all equivalent and can trivially convert to each other. This defines type they want to serialize/deserialize.
borsh should be able to generate these from schema:
- language native type definitions (class for python/js, struct for c, golang, rust, solidity)
- function/methods to serialize and deserialize these types
People then use the generated source code.

With this schema based approach, each language's borsh implementation is:

(optional) convert between json schema to other format of borsh definition
cli to generate source code from schema definition
small util functions to deser/ser certain primitive types, strings, vecs, unsigned integers, etc.
(for dynamic typed languages) a dynamic version of deser/ser function that takes a schema and a bytes/object, return deserialized object/serialized bytes

@nearmax WDYT? is this how borsh schema suppose to work?

Use fully-qualified names for types in macroses

See PR which fixes only part of the issue: #42

Implement serialization for std::borrow::Cow

How hard would it be to implement Cow support?

When you deserialize, you just always use the Owned variant, otherwise, you only need as_ref() to read the data.

/cc @nearmax

`try_to_vec` never returns Result::Err()

Since the writer is vector, it will never return Result::Err(). Since that we're able to not return an Result. We use the interface quite often everywhere. Right now code is bloated with unnecessary unwrap()s.

try_to_vec https://github.com/nearprotocol/borsh/blob/af0e6d87142559150dc86e14c6f469cc40c8909e/borsh-rs/borsh/src/ser/mod.rs#L11

A Vec Write implementation:
https://doc.rust-lang.org/src/std/io/impls.rs.html#339

What do you think @nearmax ?

--no-default-features doesn't compile

Doesn't work, because of dependencies on Vec and HashMap:

cargo test --no-default-features

Implement borsh-py

a python borsh implementation

Single byte copies slowing things down

Disclaimer: The following performance tests were done using Rust on BPF which is under development.

I noticed that upgrading Borsh from 2.4 to 2.5 is causing a large increase in the number of BPF instructions it takes to serialize and deserialize byte arrays. From a few thousand instructions to 20k+ for a 32 byte array. The performance went from beating to being far worse than bincode.

It looks like the single-byte copies introduced in this PR is the culprit: #20

Instead of copying the entire array with an exend_from_slice, extend_from_slice is performed for each byte. It's possible that other rustc targets are optimizing this better then the BPF backend is.

Fix BorshDeserialize for bool and Option

BorshDeserialize for bool and Option accepts arbitrary values where it should only allow 0 or 1.
This makes it possible for an object to have multiple representations which can potentially allow attacks on our usage of borsh.

Implement BorshSerialize, BorshDeserialize traits for PhantomData

They should do nothing.

Implement borsh-c

Besides #84 , borsh-c should generate header file / serialization / deserialization c source code given existing schema file. The result c source code can be compiled in a C project. Note, borsh-c itself is not necessarily implemented in C, instead implement in rust probably faster

Experimenting import borsh-c in python or nodejs and benchmark

Ideally gives a better performance compare to language native implementation. It should be an optional feature (native extension) that people can use in borsh-js and borsh-python

Rewrite `BorshSchema` to use const functions

Currently BorshSchema has static but not constant methods. These methods when compiled to Wasm occupy significant space. Also, they create a significant execution overhead when self-described borsh deserialization/serialization is called using https://docs.rs/borsh/0.6.2/borsh/schema_helpers/index.html

To fix it we need to implement const version of BorshSchema:: schema_container(). Unfortunately, it means two things:

We can only use types allowed by the const context;
schema_container() cannot return a type that requires allocation.

We currently intend to serialize BorshSchemaContainer using either borsh or JSON. Therefore we can have two versions of schema_container():

schema_container_borsh() -> &[u8];
schema_container_json() -> &str;
Both return container in already serialized form. Then we can allow reconstructing BorshSchemaContainer from it, if necessary.
schema_container_json internally would define const variables for each type and recursively concatenate them using std::concat. As the result, each type will have a compile-time computed schema. Similar technique can be used with byte slices and schema_container_borsh.

This will improve performance in the following way:

If we want to serialize a self-described borsh type using https://docs.rs/borsh/0.6.2/borsh/schema_helpers/index.html the helper will use schema_container_borsh to prepend an already generated sequence of bytes in front the borsh serialized object. Upon deserialization of that object the helper will check that the schema in the self-described type matches the schema of the type is deserializes it into.
Wasm will have hardcoded schemas instead of the code that generates them.

Set up testing with miri to catch undefined behavior bugs

Since Borsh is heavily focused on security, we should use all the available tooling to ensure that we catch as many corner cases as possible.

Miri is an interpreter for Rust's mid-level intermediate representation.

Using Miri is as simple as cargo miri test, but there are a few quirks:

Miri is only available with Nightly toolchain (not a problem, just saying)
Miri does not support workspaces, so we need to run it against the specific crates (not a problem either)
Compilation step with miri is quite RAM hungry - I could not succeed compiling Borsh tests with 23GB of RAM (16GB RAM + 7GB swap)

Enforce code formatting in CI

Ideally, there would be

clippy and rustfmt for Rust
eslint (with typescript plugin) for [tj]s

and that'd be checked in CI

Derive failure for `&[T]`

If I have the following structs

#[derive(BorshSerialize]
struct A {}
#[derive(BorshSerialize)]
struct B<'a> {
  a: &'a [A]
}

The derive for B will fail even though we have derive for [T] and &T if T implements BorshSerialize. It fails with "size for value cannot be known at compile time. If I change [A] to Vec<A>, it works.

test issue

Schema generation for JS and TS, and other languages

Borsh is not a self-descriptive language, so some languages like JS need to either generate schema (which is currently consumed by borsh-js) or generate the full class implementation in JS.

We could take the following approach. Write cargo extension that would provide command like cargo generate-borsh js that generates JS classes from Rust classes by walking over the crate looking for types decorated with #[derive(BorshSerialize, BorshDeserialize)] and dump the generate JS analog while preserving the directory structure. We need to decide what do we want to do with types that implement BorshSerialize, BorshDeserialize explicitly. I suggest we skip them delegating JS code to the user. We can use https://crates.io/crates/syn for that.

Another question to discuss. CC @ilblackdragon @vgrichina What is the advantage of generating a schema that is later consumed by BinaryReader over generating the full class implementation in JS? That it is more human-readable? Is there any performance disadvantage?

Write borsh spec suite

Write a full test suite and a supplementary borsh spec documentation. All borsh implementation should pass this suite.

Document the specification

Add specification of the format to README

near / borsh Goto Github PK

borsh's People

Contributors

Stargazers

Watchers

Forkers

borsh's Issues

Motivation

Proposal

Recommend Projects

Recommend Topics

Recommend Org