Git Product home page Git Product logo

bytes's Introduction

Bytes

Work with densely packed sequences of bytes.

The goal of this package is to support network protocols such as ProtoBuf. Or to put it another way, the goal is to have packages like elm/http send fewer bytes over the wire.

Please read it!

Example

This package lets you create encoders and decoders for working with sequences of bytes. Here is an example for converting between Point and Bytes values:

import Bytes exposing (Endianness(..))
import Bytes.Encode as Encode exposing (Encoder)
import Bytes.Decode as Decode exposing (Decoder)


-- POINT

type alias Point =
  { x : Float
  , y : Float
  , z : Float
  }

toPointEncoder : Point -> Encoder
toPointEncoder point =
  Encode.sequence
    [ Encode.float32 BE point.x
    , Encode.float32 BE point.y
    , Encode.float32 BE point.z
    ]

pointDecoder : Decoder Point
pointDecoder =
  Decode.map3 Point
    (Decode.float32 BE)
    (Decode.float32 BE)
    (Decode.float32 BE)

Rather than writing this by hand in client or server code, the hope is that folks implement things like ProtoBuf compilers for Elm.

Again, the overall plan is described in A vision for data interchange in Elm!

Scope

This API is not intended to work like Int8Array or Uint16Array in JavaScript. If you have a concrete scenario in which you want to interpret bytes as densely packed arrays of integers or floats, please describe it on https://discourse.elm-lang.org/ in a friendly and high-level way. What is the project about? What do densely packed arrays do for that project? Is it about perf? What kind of algorithms are you using? Etc.

If some scenarios require the mutation of entries in place, special care will be required in designing a nice API. All values in Elm are immutable, so the particular API that works well for us will probably depend a lot on the particulars of what folks are trying to do.

bytes's People

Contributors

danmarcab avatar drathier avatar eriktim avatar evancz avatar owanturist avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bytes's Issues

Issue with Decoder.bytes: References previous DataView buffer (instead of copying it)

In certain situations, Decode.loop uses the state from the previous run.
Update: The issue seems to be with Decoder.bytes (thanks @eriktim)


Code that shows when it doesn't work:
https://ellie-app.com/45kPVqsrmPta1 (see logs)
Chunks are equal when they shouldn't be.

Code that shows the decoder actually does work:
https://ellie-app.com/45mTT24dtvMa1 (see logs)
Chunks are different, as it should be.

The difference between these examples, is that the first one defines an additional decoder that uses Decode.loop (through the buildChunksDecoder function).


Simplified example:

-- Decoders to split Bytes in chunks
(decoder512, decoder32) =
    ( buildChunksDecoder (512 // 8)
    , buildChunksDecoder (32 // 8)
    )

-- Create chunks
chunks =
    "61626364656667686263646566676869636465666768696A6465666768696A6B65666768696A6B6C666768696A6B6C6D6768696A6B6C6D6E68696A6B6C6D6E6F696A6B6C6D6E6F706A6B6C6D6E6F70716B6C6D6E6F7071726C6D6E6F707172736D6E6F70717273746E6F70717273747580000000000000000000000000000380"
        |> Hex.toBytes
        |> Maybe.andThen (\bytes -> Decode.decode (decoder512 bytes) bytes)
        |> Maybe.map (List.map (\block -> Decode.decode (decoder32 block) block))
        |> Maybe.andThen Maybe.combine
        |> Maybe.map (List.map (List.map Hex.fromBytes))
         
-- Proof that it doesn't work:
-- THESE SHOULD NOT BE EQUAL
fst = Debug.log "First chunk" (List.getAt 0 chunks)
snd = Debug.log "Second chunk" (List.getAt 1 chunks)



{-| Chunky decoder builder.

    Divides `Bytes` into chunks of x bytes.

-}
buildChunksDecoder : Int -> Bytes -> Decoder (List Bytes)
buildChunksDecoder bytesInChunk bytes =
    Decode.loop
        { chunksLeft = ceiling (toFloat (Bytes.width bytes) / toFloat bytesInChunk)
        , chunks = []
        }
        (\state ->
            if state.chunksLeft <= 0 then
                Decode.succeed (Done state.chunks)

            else
                Decode.map
                    (\chunk ->
                        Loop
                            { chunksLeft = state.chunksLeft - 1
                            , chunks = List.append state.chunks [ chunk ]
                            }
                    )
                    (Decode.bytes bytesInChunk)
        )

It's a fairly complex issue, so let me know if I need to reduce it further.
Thanks! ✌️

Support for simple serialization to more generally useful type

I was working with this library in order to implement bidirectional communication over grpc-web using a new elm protocol buffer library that makes use of this.

In order to get the bytes that the elm-protocol-buffers library created to the javascript library I used a port. But I found that taking this type from Elm to the outside world was sort of difficult.

In the end I used loop and unsignedUint8 to change it from Bytes.Bytes to List Int so I could throw it through the port at which point I had some thoughts:

  1. Given the state of Elm 0.19 with respect to kernel code in order to support more complicated protocols (grpc) the bytes type will almost certainly need to be given to a port so maybe it should become one of the natively supported pass-through types listed here: https://guide.elm-lang.org/interop/flags.html
  2. Even if the above is done it's probably still required to be able to serialize this to JSON easily. Though I understand that JSON has no native bytes representation so we'd have to pick our poison to solve that.
  3. In lieu of the above two options maybe just a built-in way to marshal between List Int and the Bytes type?

Appreciate any thoughts on this issue.

Decode and encode single bits

Hi @evancz

I am working on a PNG decoding/encoding library and in PNG pixels can have a depth of 1, 2, 4, 8 and 16 bits.
Would it be possible to encode and decode single bits?

I see the case where a scanline has a length in bits that cannot be decoded as bytes.

Comparing two `Byte` values always result in `True`, regardless of the actual bytes

> import Bytes.Encode
> b0 = [] |> List.map Bytes.Encode.unsignedInt8 |> Bytes.Encode.sequence |> Bytes.Encode.encode
<0 bytes> : Bytes.Bytes
> b1 = [ 125, 211, 143, 67, 78, 89, 125, 24, 100, 73, 61, 190, 172, 133, 160, 82, 150, 234, 82, 197, 97, 146, 67, 85, 53, 203, 134, 236, 168, 180, 179, 239 ] |> List.map Bytes.Encode.unsignedInt8 |> Bytes.Encode.sequence |> Bytes.Encode.encode
<32 bytes> : Bytes.Bytes
> (b0 == b1)
True : Bool

I'm not sure if it was ever intended to be able to compare two Bytes values, but the result of such comparison is very surprising.

Decoding long strings can lead to large heap footprint

The issue occurs when …

  1. Using Decode.string on bytes for longer expected strings will result in JavaScript engines (V8 in this case) utilizing an internal string concatenation performance optimization.
  2. This was initially observed when utilizing the eriktim/elm-protocol-buffers package, which uses the elm/bytes decoding for strings.

The data we utilize can potentially have many strings that are quite long. A heap snapshot under Chrome indicates at least 20 bytes of overhead for each individual character that is concatenated during string construction. In our case this leads to ~200mB consumed in the heap by concatenated strings.

Why does the issue occur?

V8 strings appear to internally have 2 possible representations (https://gist.github.com/mraleph/3397008):

  • flat strings are immutable arrays of characters
  • cons strings are pairs of strings, result of concatenation.

If you concat a and b you get a cons-string (a, b) that represents result of concatenation. If you later concat d to that you get another cons-string ((a, b), d).

Indexing into such a "tree-like" string is not O(1) so to make it faster V8 flattens the string when you index: copies all characters into a flat string.

In this case, it appears the implementation of _Bytes_read_string may likely end up using the "Cons string" type in the JS engine.

Could you show an example?

module Main exposing (main)

import Bytes exposing (..)
import Bytes.Decode as Decode
import Bytes.Encode as Encode

main : Program () () ()
main =
    let
        encodedBytes =
            "This is a longer string for testing special JS concatenated strings"
            |> Encode.string
            |> Encode.encode

        decodedBytesStr =
            encodedBytes
            |> Decode.decode (Decode.string (Bytes.width encodedBytes))
    in
    Platform.worker
        { init = \_ -> ( (), Cmd.none )
        , update = \_ _ -> ( (), Cmd.none )
        , subscriptions = \_ -> Sub.none
        }

SSCCE

Once the above code runs:

  1. Open Chome devtools
  2. Memory tab
  3. Select "Heap snapshot" then press "Take snapshot"
  4. Filter snapshot by "(concatenated string)"
  5. Observe the many instances of the string "This is a longer string for ..."
    a. The shallow size reported for each individual character cons is 20 bytes
    b. The final retained size is 1348 bytes

How can we solve this?

According to this post (https://gist.github.com/mraleph/3397008), indexing into a cons string forces it to be flattened.

My approach would be to simply perform this once the string has been completely decoded:

var _Bytes_read_string = F3(function(len, bytes, offset)
{
	var string = '';
	var end = offset + len;
	for (; offset < end;)
	{
		~snip~
	}

	// Force JS engines to flatten the string if it's internally represented by a "cons string"
	string[0];

	return __Utils_Tuple2(offset, string);
}

Possibly for very large strings this index operation could be done in the loop itself (possibly on every X iteration), but there are additional tradeoffs on performance of evaluating the condition.

Note: The above index may not work, as it seems some JS engines try to avoid it still. Even the npm flatstr package (https://www.npmjs.com/package/flatstr) current approach doesn't seem to work:

string | 0

However, effectively doing a string reverse reverse seems to workaround the cons string.

Final notes

I struggled a bit on where the solution for this issue should lie (either in elm/bytes, eriktim/elm-protocol-buffers or in our own code). I decided on opening this here, as I think a solution at the lowest level would be the most appropriate and resolve this unexpected memory consumption for other libraries/applications that may use the string decoding. I don't see a use-case for keeping an internal cons string representation once the decoding has completed.

Unclear documentation for Decode.decode

The current function documentation is:

decode : Decoder a -> Bytes -> Maybe a

The Decoder specifies exactly how this should happen. This process may fail if the sequence of bytes is corrupted or unexpected somehow. The examples above show a case where there are not enough bytes.

What does "corrupted sequence of bytes" mean here?
signedInt8 should work on any Bytes given to it. Is there a case where using it in decode yields Nothing?

If there is, I think a more explicit warning would help (and ideally some links about possible issues). If there isn't, the documentation should say that the only reason for decoding to fail is by passing the decoder a sequence of bytes that it is unable to parse.

Offset mismatch when reading bytes in a decoder

The issue occurs when …

  1. You use a decoder that uses Decode.bytes and the bytes you want don't have an offset of 0.
    (eg. in our example, we discard the first byte)
  2. You use another decoder that uses Decode.bytes.

Note: With "using a decoder" I mean, using the decoder with the decode function.

Why does the issue occur?

  • When you use the decode function, the offset is "reset" to 0.
  • The bytes's ArrayBuffer is still the original slice of bytes, not the slice we want.

Could you show an example?

let
  decoders =
    { skipOne       = Decode.andThen (\_ -> Decode.bytes 1) (Decode.bytes 1)
    , maybeOneUint8 = Decode.map (Decode.decode Decode.unsignedInt8) (Decode.bytes 1)
    }
in
[ Encode.unsignedInt8 1
, Encode.unsignedInt8 2
]
  |> Encode.sequence
  |> Encode.encode
  |> Decode.decode decoders.skipOne
  |> Maybe.andThen (Decode.decode decoders.maybeOneUint8)
  |> Debug.log "should be `Just (Just 2)`"

SSCCE

Which use-cases does this have?

I stumbled onto this issue when writing a SHA-2 implementation. I needed to divide my bytes into chunks of 64 bytes and then divide those chunks into smaller chunks of 4 bytes. I used multiple decode loops to do this.

How can we solve this?

Myself, and @eriktim, found two ways to solve this:

  1. Track the previous offset, which means: Use bytes.byteOffset as done in #10
  2. Make a new buffer (see below)
var _Bytes_read_bytes = F3(function(len, bytes, offset)
{
	return __Utils_Tuple2(
		offset + len,
		new DataView(bytes.buffer.slice(bytes.byteOffset), offset, len)
	);
});

Important difference between (1) and (2):

  1. Keeps reference to old buffer, no bytes are discarded when using decode.
  2. Makes new instance of buffer each time, no references, bytes are discarded when using decode.

Final notes

I'm not sure what the correct solution is here.
Let me know if I need to make a PR for the second case presented ☝️

Thanks!

Decoder: detect end of content

When parsing a protocol chunk by chunk, we need to stop successfully with every thing we decoded so far, and keep the remaining bytes for later.
The current decoder API provides no way to do such a thing.

Two examples of what I need to do when parsing the NATS protocol in Elm Nats:

  • a decoder that read a string up to a delimiter or the end of content. When parsing a message, if the end of line is found then the message is complete. If not, I want to get everything up to the end of the bytes, and keep it for later.
  • a decoder that reads up to N bytes. This is for reading a binary payload: I know the payload size, but it may not be complete. If not, I need to keep it for later.

To workaround this I need to have multiple decode phases, cut the bytes to decode the tail, and I get a complicated recursive function instead of a single Decoder.

ReferenceError: Can't find variable: _Bytes_read_bytes

Looks like it may just be spelled _Bytes_read_Bytes in the kernel code.
Also, the implementation looks strange, the documentation says:

"Copy a given number of bytes into a new Bytes sequence."

But the kernel code looks like it's trying to read an initial length word, ignoring the length provided.

module Main exposing (main)

import Bytes exposing (Endianness(..))
import Bytes.Decode as BD
import Bytes.Encode as BE
import Html


main =
    Html.text (Debug.toString firstByte)


fourBytes =
    BE.encode (BE.unsignedInt32 BE 0xFF995511)


firstByte =
    BD.decode (BD.bytes 1) fourBytes

Display undecoded byte values in elm app debugger

Not sure if this is the appropriate place for it, but if you store a Bytes value on your elm model without decoding it, in the debugger (when you compile it with --debug) it displays with a pair of empty curly braces. I had to check the length explicitly to verify the data was there.

This is very confusing! Maybe we can display it like a list of integers or like a hex dump or something?

I would gladly take a swing at this if someone can point me where to look in the code.

Eval order of map2 arguments when chained with andMap is reversed of what I expected

This implementation I wrote of <*> or andMap decodes arguments in order right-to-left:

andMap : Decoder a -> Decoder (a -> b) -> Decoder b
andMap =
    map2 (|>)

It's the exact same implementation as Json.Decode.Extra.andMap, but with other imports.

This implementation behaves as I expect it to, decoding arguments left-to-right:

andMap : Decoder a -> Decoder (a -> b) -> Decoder b
andMap d d2 =
    andThen (\v -> map v d) d2

I used this for decoding custom types in applicative-style, e.g. succeed T3 |> andMap int |> andMap int |> andMap int.

Elm compares different byte sequences as equal

SSCCE

module Main exposing (main)

import Browser
import Html exposing (Html, button, div, text)
import Html.Events exposing (onClick)

import Bytes exposing (Bytes)
import Bytes.Decode as Decode
import Bytes.Encode as Encode


lhs =
    Encode.encode
            (Encode.sequence [Encode.unsignedInt8 0xC0, Encode.unsignedInt8 0])
rhs =
    Encode.encode
            (Encode.sequence [Encode.unsignedInt8 0])


update : () -> () -> ()
update () () = ()


view : () -> Html never
view () =
    div []
        [ div [] [ text <| Debug.toString (lhs == rhs) ]
        ]


main : Program () () ()
main =
    Browser.sandbox
        { init = ()
        , view = view
        , update = update
        }

Should print

False

Actually prints

True

Ellie

Decoders with custom offset

Hi!

first of all: thanks a lot for bringing Elm into the world! This language kept me motivated digging more and more into functional programming, whereas with other languages I felt overwhelmed and discouraged pretty quickly.

Short

Recently I've been writing a parser for midi files in elm, using elm/bytes. Overall it worked pretty nicely. But there was one thing I was missing: writing a custom decoder where an offset could be provided.

Issue

Parts of a midi file contain a list-like structure. Items are (potentially) compressed in a way that one byte needs to be read in order to know how to process this same and the following bytes. Depending on the most significant bit, this byte might contain some meta-information. If it doesn't this means the current list-item has the same meta-information as the latest item, so it is just dropped here. Implying that the byte that was just read is not to be read as just a single byte, but in various ways, depending on the previous meta-information. This leads to a situation where some kind of lookahead would be useful. Something that goes like: Ok the byte is of this structure, so keep it and read it as two nibbles, which define how to read the following bytes. Or: Oh the byte is of this other structure, so just forget about it and use the most recent meta-information and continue like normal, BUT start where the byte we just read started, because it is not the meta-info byte, but already part of the data. So we are basically one byte off.

Solution (possibly)

I think in this case it would be nice to just set back the offset, so we effectively forget about the byte we just decoded. Because there is no way to reset the offset when using andThen I carried this already read byte around and needed to provided it to the following decoders, where I had to prepend it conditionally. This made the code harder to read, understand and reuse.

In the source code of elm/bytes andThen and mapN use an offset internally. I guess exposing the data constructor Decoder (Bytes -> Int -> (Int, a)), instead of just the type constructor Decoder a would be all that is needed to be able to build custom map / andThen decoders for doing this kind of lookahead decoding.

Illustration

Maybe my explanation was a little confusing, so this code hopefully makes it easier to understand

Bytes.Decode.unsignedInt8
    |> Bytes.Decode.andThen
        (\currentPotentialStatusByte ->
            let
                isCompressed =
                    currentPotentialStatusByte < 128

                currentStatusByte =
                    if isCompressed then
                        previousStatusByte

                    else
                        currentPotentialStatusByte

                ( mEventName, channel ) =
                    statusByteToNibbles currentStatusByte

                readFirstByte =
                    if isCompressed then
                        Bytes.Decode.succeed currentPotentialStatusByte

                    else
                        Bytes.Decode.unsignedInt8
...
            readFirstByte |> Bytes.Decode.andThen preReadVariableLengthValueDecoder |> Bytes.Decode.andThen Bytes.Decode.string |> Bytes.Decode.map ((++) "System Exclusive Begin Event" >> NotYetSupportedEvent)

and then I need to carry around readFirstByte and map all the following decoders. But I think this would be nicer:

Bytes.Decode.Decoder
    (\bites offset ->
        let
            (Bytes.Decode.Decoder uint8Decode) =
                Bytes.Decode.unsignedInt8

            (currentPotentialStatusByte, newOffset) =
                uint8Decode bites offset

            isCompressed =
                currentPotentialStatusByte < 128

            currentStatusByte =
                if isCompressed then
                    previousStatusByte

                else
                    currentPotentialStatusByte

            ( mEventName, channel ) =
                statusByteToNibbles currentStatusByte

            nextOffset =
                if isCompressed then
                    offset

                else
                    newOffset
...
        withOffset nextOffset readVariableLengthValueDecoder |> Bytes.Decode.andThen Bytes.Decode.string |> Bytes.Decode.map ((++) "System Exclusive Begin Event" >> NotYetSupportedEvent)

Where withOffset would be something like

withOffset : Int -> Bytes.Decode.Decoder a
withOffset offset (Bytes.Decode.Decoder decode) =
	Decoder <| \bites _ -> decode bites offset

So here I could just use readVariableLengthValueDecoder, which is also used in other places. And I would not need to create a preReadVariableLengthValueDecoder - which does the same thing but either reads the first byte, or doesn't depending on the given argument

edit: during the past days I watched a bunch of elm talks (mostly held by Richard Feldman). I now understand much better why the type Decoder is opaque. Still I think having a way to adjust the offset would be very nice. Maybe through adding a utility function like withOffset.

Bitwise

Hello, is planned that core/Bitwise works with Bytes instead of Int? Or support both?

Invalid utf8 can be (wrongly) decoded into a string

Bytes.Decode.string will decode bytes that are not valid utf8 and produces an nonsense string. Instead it should fail. Thanks @jhbrown94 for helping me verify this.

SSCCE

module Main exposing (main)

import Browser
import Html exposing (Html, button, div, text)
import Html.Events exposing (onClick)

import Bytes exposing (Bytes)
import Bytes.Decode as Decode
import Bytes.Encode as Encode


bytes : Bytes
bytes =
    Encode.encode
        (Encode.sequence [Encode.unsignedInt8 0xC0, Encode.unsignedInt8 0])

string : Maybe String
string =
    Decode.decode (Decode.string 2) bytes


update : () -> () -> ()
update () () = ()


view : () -> Html never
view () =
    div []
        [ div [] [ text <| Debug.toString string ]
        ]


main : Program () () ()
main =
    Browser.sandbox
        { init = ()
        , view = view
        , update = update
        }

Prints

Just "\0"

Should Print

Nothing

Ellie

Confirmation that b"\xC0\x00" is not unicode

Readme "someone things"

Rather than writing this by hand in client or sever code, the hope is that someone things like ProtoBuf compilers for Elm.

"Someone things like" sounds like there is something missing to me :)

And there is a typo in "sever" that should be "server".

Change example

Rather than writing this by hand in client or server code, the hope is that someone could create ProtoBuf compilers for Elm.

Decoder suppresses stack overflow

A decoder I wrote returned Nothing for large byte sequences, but it was due to a stack overflow in a Bytes.Decode.map function I used.

Here's a simple example:

import Bytes.Decode as D
import Bytes.Encode as E

-- This causes a stack overflow for any input
stackOverflow x =
    case [x] of
        hd :: _ ->
            hd :: stackOverflow x
        [] ->
            []

-- This is a single byte (00000001)
byte = E.unsignedInt8 1

-- This decoder should raise an exception
overflowDecoder = D.map stackOverflow D.unsignedInt8
> stackOverflow 1
RangeError: Maximum call stack size exceeded

> D.decode overflowDecoder byte
Nothing

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.