Git Product home page Git Product logo

borsh's People

Contributors

ailisp avatar alecmocatta avatar alexfilatov avatar bowenwang1996 avatar dhruvja avatar etodanik avatar evgenykuzyakov avatar frol avatar gagdiez avatar ilblackdragon avatar k06a avatar lexfrl avatar magicrb avatar maksymzavershynskyi avatar marcus-pousette avatar mikedotexe avatar mikhailok avatar nhynes avatar shadeglare avatar snjax avatar stolkerve avatar vgrichina avatar volovyks avatar zicklag avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

borsh's Issues

Deserialization security

We should add a suit of tests for security of deserialization, specifically:

  • Pass invalid Enum value
  • cc @frol Pass MAX INT length for arrays / strings / hashmaps (the memory allocation issues are preventable by simple check that if len > buf.len() { return Err() } )
  • non utf-8 string
  • missing part of the message
  • extra bytes after the message

Code coverage

borsh-derive-internals is designed for testability. It'd be great to see how well the tests are doing by including code coverage (e.g. using tarpaulin + codecov/coveralls).

UTF-8 Consistency

As suggested by @vgrichina we need to make sure Rust and JS implementations fail and succeed on exactly the same UTF-8 sequences.

Also, we need to make sure specification explains that implementation must disallow illegal UTF-8 to allow for deterministic roundtrip.

Set up testing with miri to catch undefined behavior bugs

Since Borsh is heavily focused on security, we should use all the available tooling to ensure that we catch as many corner cases as possible.

Miri is an interpreter for Rust's mid-level intermediate representation.

Using Miri is as simple as cargo miri test, but there are a few quirks:

  • Miri is only available with Nightly toolchain (not a problem, just saying)
  • Miri does not support workspaces, so we need to run it against the specific crates (not a problem either)
  • Compilation step with miri is quite RAM hungry - I could not succeed compiling Borsh tests with 23GB of RAM (16GB RAM + 7GB swap)

Thoughts on borrowing from the input

Is it within the purview of this library to support borrowing byte slices from the deserialization input? zero-copy would be nice, but it'd make codegen ever so slightly more complicated. What's the stance of borsh-rs on features vs size creep?

Any chance of supporting Golang?

Im looking for something to replace bincode. It would be great if bosh have a first class Go support since it is my only blocker! Thanks

Add `#[borsh_optional]` marker for backwards compatibility

Motivation

Suppose we have rust structure:

#[derive(BorshSerialize, BorshDeserialize)]
struct A {
  f1: T1,
  f2: T2
}

Suppose we have serialized into some data (e.g. on disk in rocksdb, in contract state, or circulating in network). Then we want to upgrade this structure by adding another field:

#[derive(BorshSerialize, BorshDeserialize)]
struct A {
  f1: T1,
  f2: T2,
  f3: T3
}

It would be extremely convenient for upgradability if we could deserialize old data using new Rust type.

Proposal

We can introduce #[borsh_optional] decorator that can be used like this:

#[derive(BorshSerialize, BorshDeserialize)]
struct A {
  f1: T1,
  f2: T2,
  #[borsh_optional]
  f3: Option<T3>
}

Then when we deserialize old data with this structure f3 will be None, but when we deserialize new data using this structure it will be Some.

It will only work if optional fields are included at the back:

#[derive(BorshSerialize, BorshDeserialize)]
struct A {
  f1: T1,
  f2: T2,
  #[borsh_optional]
  f3: Option<T3>,
  #[borsh_optional]
  f4: Option<T4>,
  #[borsh_optional]
  f5: Option<T5>
}

And the compilation should fail if the following situations:

#[derive(BorshSerialize, BorshDeserialize)]
struct A {
  f1: T1,
  f2: T2,
  #[borsh_optional]
  f3: Option<T3>,
  f4: Option<T4>,
  #[borsh_optional]
  f5: Option<T5>
}

CC @mfornet Since it might be relevant to near/NEPs#95

[Discussion] General workflow (common in all borsh implementations) of borsh

I propose of a common set of operations that should be implemented by all borsh implementations in different programming languages. Some implementation, like rust has macros, can have additional features such as derive[BorshSerialize]

From borsh user, they're going to use borsh in this way.

  • write a borsh schema definition that would be common in any language, currently it's in json, but json isn't very friendly human writable, so we may consider yaml, toml, or a rust type definition like DSL. They're all equivalent and can trivially convert to each other. This defines type they want to serialize/deserialize.

  • borsh should be able to generate these from schema:

    • language native type definitions (class for python/js, struct for c, golang, rust, solidity)
    • function/methods to serialize and deserialize these types
  • People then use the generated source code.

With this schema based approach, each language's borsh implementation is:

  • (optional) convert between json schema to other format of borsh definition
  • cli to generate source code from schema definition
  • small util functions to deser/ser certain primitive types, strings, vecs, unsigned integers, etc.
  • (for dynamic typed languages) a dynamic version of deser/ser function that takes a schema and a bytes/object, return deserialized object/serialized bytes

@nearmax WDYT? is this how borsh schema suppose to work?

Investigate safety of trait objects

Make sure Borsh either does not work with trait objects entirely (because we don't know the type that we need to deserialize into) or if it works it works correctly.

Rewrite `BorshSchema` to use const functions

Currently BorshSchema has static but not constant methods. These methods when compiled to Wasm occupy significant space. Also, they create a significant execution overhead when self-described borsh deserialization/serialization is called using https://docs.rs/borsh/0.6.2/borsh/schema_helpers/index.html

To fix it we need to implement const version of BorshSchema:: schema_container(). Unfortunately, it means two things:

  • We can only use types allowed by the const context;
  • schema_container() cannot return a type that requires allocation.

We currently intend to serialize BorshSchemaContainer using either borsh or JSON. Therefore we can have two versions of schema_container():

  • schema_container_borsh() -> &[u8];
  • schema_container_json() -> &str;
    Both return container in already serialized form. Then we can allow reconstructing BorshSchemaContainer from it, if necessary.
    schema_container_json internally would define const variables for each type and recursively concatenate them using std::concat. As the result, each type will have a compile-time computed schema. Similar technique can be used with byte slices and schema_container_borsh.

This will improve performance in the following way:

  • If we want to serialize a self-described borsh type using https://docs.rs/borsh/0.6.2/borsh/schema_helpers/index.html the helper will use schema_container_borsh to prepend an already generated sequence of bytes in front the borsh serialized object. Upon deserialization of that object the helper will check that the schema in the self-described type matches the schema of the type is deserializes it into.
  • Wasm will have hardcoded schemas instead of the code that generates them.

Single byte copies slowing things down

Disclaimer: The following performance tests were done using Rust on BPF which is under development.

I noticed that upgrading Borsh from 2.4 to 2.5 is causing a large increase in the number of BPF instructions it takes to serialize and deserialize byte arrays. From a few thousand instructions to 20k+ for a 32 byte array. The performance went from beating to being far worse than bincode.

It looks like the single-byte copies introduced in this PR is the culprit: #20

Instead of copying the entire array with an exend_from_slice, extend_from_slice is performed for each byte. It's possible that other rustc targets are optimizing this better then the BPF backend is.

Implement a customizable borsh_serialize in derive macros

Currently we can't do much to affect how derive(BorshSerialize, Deserialize) works besides borsh_skip, however one case is very useful. Consider this big structure in wasmer:

pub struct ModuleInfo {
    pub memories: Map<LocalMemoryIndex, MemoryDescriptor>,
    pub globals: Map<LocalGlobalIndex, GlobalInit>,
    pub tables: Map<LocalTableIndex, TableDescriptor>,
    pub imported_functions: Map<ImportedFuncIndex, ImportName>,
    pub imported_memories: Map<ImportedMemoryIndex, (ImportName, MemoryDescriptor)>,
    pub imported_tables: Map<ImportedTableIndex, (ImportName, TableDescriptor)>,
    pub imported_globals: Map<ImportedGlobalIndex, (ImportName, GlobalDescriptor)>,
    pub exports: IndexMap<String, ExportIndex>,
    pub data_initializers: Vec<DataInitializer>,
    pub elem_initializers: Vec<TableInitializer>,
    pub start_func: Option<FuncIndex>,
    pub func_assoc: Map<FuncIndex, SigIndex>,
    pub signatures: Map<SigIndex, FuncSig>,
    pub backend: String,
    pub namespace_table: StringTable<NamespaceIndex>,
    pub name_table: StringTable<NameIndex>,
    pub em_symbol_map: Option<HashMap<u32, String>>,
    pub custom_sections: HashMap<String, Vec<Vec<u8>>>,
    pub generate_debug_info: bool,
    #[borsh_skip]
    pub(crate) debug_info_manager: jit_debug::JitCodeDebugInfoManager,
}

Every fields in this giant struct can derive BorshSerialize and BorshDeserialize, except one: IndexMap<String, ExportIndex>, because IndexMap isn't a type defined in this crate, nor it's a type defined in std or borsh, so you cannot
impl BorshSerialize for IndexMap, but due to this one field, you cannot derive BorshSerialize of the giant struct. There's two workaround of this:

  1. make pub exports: IndexMap<String, ExportIndex>, to pub exports: ExportsMap and define a struct ExportsMap enclosing IndexMap, so you can impl BorshSerialize/Deserialize on ExportsMap and make ModuleInfo Borsh-derivable. But this cause any reference to exports become exports.inner or exports.0
  2. implement BorshSerialize/Deserialize on ModuleInfo manually.
    Either one is a big inconvenience or cause structural change hacks. So i propose a borsh_serializer/borsh_deserializer macro:
fn borsh_serialize_index_map<K:BorshSerialize, V:BorshSerialize, W: Write>(index_map: &IndexMap<K,V>, writer: &mut W) -> std::io::Result<()> {
...
}

#[borsh_serializer(borsh_serialize_index_map)]
#[borsh_deserializer(borsh_deserialize_index_map)]
pub exports: IndexMap<String, ExportIndex>,

With help of these macros, user can specify a customize borsh serializer/deserializer to a field of struct, making the whole struct borsh-derivable

Implement borsh-c

Besides #84 , borsh-c should generate header file / serialization / deserialization c source code given existing schema file. The result c source code can be compiled in a C project. Note, borsh-c itself is not necessarily implemented in C, instead implement in rust probably faster

Optimize borsh deserialization of Vec<T>

Vec seems to take too much gas. An easy optimization can be Vec implementation with buffered read similar to strings, but it's unclear how to handle non fixed size types, e.g. Vec

Schema generation for JS and TS, and other languages

Borsh is not a self-descriptive language, so some languages like JS need to either generate schema (which is currently consumed by borsh-js) or generate the full class implementation in JS.

We could take the following approach. Write cargo extension that would provide command like cargo generate-borsh js that generates JS classes from Rust classes by walking over the crate looking for types decorated with #[derive(BorshSerialize, BorshDeserialize)] and dump the generate JS analog while preserving the directory structure. We need to decide what do we want to do with types that implement BorshSerialize, BorshDeserialize explicitly. I suggest we skip them delegating JS code to the user. We can use https://crates.io/crates/syn for that.

Another question to discuss. CC @ilblackdragon @vgrichina What is the advantage of generating a schema that is later consumed by BinaryReader over generating the full class implementation in JS? That it is more human-readable? Is there any performance disadvantage?

Fix BorshDeserialize for bool and Option

BorshDeserialize for bool and Option accepts arbitrary values where it should only allow 0 or 1.
This makes it possible for an object to have multiple representations which can potentially allow attacks on our usage of borsh.

Capacity allocation for vector deserialization

Currently because we don't know the size of the buffer inside the deserialize method we can't predict if the length is way too large and should error out (right now it would seg fault due to memory allocation error).

See test_invalid_length for example.

Fuzz testing

We need to write the following fuzz tests for borsh:

A) Generate random type. Creating an object of the type filled with random data. Then serialize it and deserialize it, and compare that structure before and after are the same;
B) Generate random type. Creating an object of the type filled with random data. Serialize it. Randomly flip a subset of bits in the serialized structure. Try deserializing it and assert that it does not panic, but instead either deserializes or returns an error.

The two difficult things to implement would be:

  • Generating random type;
  • Creating an object of the type filled with random data;

As an option, I suggest we do both using procedural macros. We can have a macro random_type!(Name, X, Y, seed) that generates a token stream corresponding to a declaration of some type Name using https://doc.rust-lang.org/reference/procedural-macros.html#function-like-procedural-macros where X would be the max depth (e.g. if we have nested structures) and Y is the max width of each node (e.g. max number of fields in a struct or max number of variants in an enum).

Each type would also be decorated with #[derive(RandomInit)] which implements trait

trait RandomInit {
random_init() -> Self
}

for the type, just like we do with serializers. We then would implement RandomInit for basic types and collections, just like we do with serializers.

Then our test would be something like:

random_type!(T0, 1, 1, 42);
...
random_type!(T42, 10, 12, 42);

#[test]
fn test0() {
  for _ in 0..100 {
   let t0 = T0::init_random();
   let out_t0: T0 = try_from_slice(&t0.try_to_vec().unwrap()).unwrap();
   assert_eq!(t0, out_t0);
  }
}

Note should also look at the fuzzing tools that sigma prime wrote for our borsh, we might not need to write it ourselves.

Byte Arrays

use borsh::{BorshSerialize, BorshDeserialize};

#[derive(BorshSerialize, BorshDeserialize, PartialEq, Debug)]
struct B {
    x: [u8; 20],
    y: [u8; 100],
    z: String,
}

fn test_simple_struct() {
    let b = B {
        x: [0; 20],
        y: [0; 100],
        z: "liber primus".to_string(),
    };
    let encoded_b = b.try_to_vec().unwrap();
    let decoded_b = B::try_from_slice(&encoded_b).unwrap();
    assert_eq!(b, decoded_b);
}

fn main() {
    test_simple_struct();
}
~/BORSH/borsh-test/src$ cargo run
   Compiling borsh-test v0.1.0 (/Users/mrsmith/BORSH/borsh-test)
error[E0277]: the trait bound `[u8; 100]: borsh::BorshDeserialize` is not satisfied
 --> src/main.rs:4:26
  |
4 | #[derive(BorshSerialize, BorshDeserialize, PartialEq, Debug)]
  |                          ^^^^^^^^^^^^^^^^ the trait `borsh::BorshDeserialize` is not implemented for `[u8; 100]`
  |
  = help: the following implementations were found:
            <[T; 0] as borsh::BorshDeserialize>
            <[T; 1024] as borsh::BorshDeserialize>
            <[T; 10] as borsh::BorshDeserialize>
            <[T; 11] as borsh::BorshDeserialize>
          and 36 others
  = help: see issue #48214
  = note: this error originates in a derive macro (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0277]: the trait bound `[u8; 100]: borsh::BorshSerialize` is not satisfied
 --> src/main.rs:4:10
  |
4 | #[derive(BorshSerialize, BorshDeserialize, PartialEq, Debug)]
  |          ^^^^^^^^^^^^^^ the trait `borsh::BorshSerialize` is not implemented for `[u8; 100]`
  |
  = help: the following implementations were found:
            <[T; 0] as borsh::BorshSerialize>
            <[T; 1024] as borsh::BorshSerialize>
            <[T; 10] as borsh::BorshSerialize>
            <[T; 11] as borsh::BorshSerialize>
          and 37 others
  = help: see issue #48214
  = note: this error originates in a derive macro (in Nightly builds, run with -Z macro-backtrace for more info)

error: aborting due to 2 previous errors

For more information about this error, try `rustc --explain E0277`.
error: could not compile `borsh-test`.

To learn more, run the command again with --verbose.

Arrays: serialization does not match schema

I think this line:

impl_arrays!(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 32 64 65);

Should be changed to be the same as this line:

impl_arrays!(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 64 65 128 256 512 1024 2048);

but with the addition of โ€˜0โ€™ as the first element as per the schema line.

Derive failure for `&[T]`

If I have the following structs

#[derive(BorshSerialize]
struct A {}
#[derive(BorshSerialize)]
struct B<'a> {
  a: &'a [A]
}

The derive for B will fail even though we have derive for [T] and &T if T implements BorshSerialize. It fails with "size for value cannot be known at compile time. If I change [A] to Vec<A>, it works.

Write borsh spec suite

Write a full test suite and a supplementary borsh spec documentation. All borsh implementation should pass this suite.

Serde compatibiity

Hi, I'm investigating using the Near network for a project I'm working on. Looking through the examples on smart contracts and seeing Borsh, it looks like a really cool serialization format. I'm a bit curious if it plays nicely with Serde-based structs at all.

My use-cases are for using things like chrono::Datetime and url::Url which come with serde implementations. I suppose I could wrap these in newtypes and implement Borsh by hand, but I think it would be much easier if Borsh could work on top of serde (as well as having its own derive macros). This would make it easier to use the format with existing libraries and projects. I understand that Borsh layers some new features on top of its own implementation so obviously those would not be available in a serde-driven version.

I'm curious if this is a possibility for the projects near future. Thank you!

Enforce code formatting in CI

Ideally, there would be

  • clippy and rustfmt for Rust
  • eslint (with typescript plugin) for [tj]s

and that'd be checked in CI

Deserialization fails if slice is longer then the length of the serialized data

The following check assumes the length of the serialized slice equals the length of the serialized data: https://github.com/nearprotocol/borsh/blob/c5693fcb8af4636878fa13e8fc622953cf9b4e1e/borsh-rs/borsh/src/de/mod.rs#L15

There are cases where the serialized data may only occupy the first x bytes of a slice (fixed-size data packets for example). In these cases, deserialization will fail and it's impossible for the receiver to know to what size to prune the slice (how much of the slice contain serialized data). For comparison, bincode allows passing slices that are larger than the serialized data.

Can this restriction be lifted?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.