elm / bytes Goto Github PK
View Code? Open in Web Editor NEWWork with bytes and implement network protocols
Home Page: https://package.elm-lang.org/packages/elm/bytes/latest/
License: BSD 3-Clause "New" or "Revised" License
Work with bytes and implement network protocols
Home Page: https://package.elm-lang.org/packages/elm/bytes/latest/
License: BSD 3-Clause "New" or "Revised" License
Looks like it may just be spelled _Bytes_read_Bytes
in the kernel code.
Also, the implementation looks strange, the documentation says:
"Copy a given number of bytes into a new Bytes sequence."
But the kernel code looks like it's trying to read an initial length word, ignoring the length provided.
module Main exposing (main)
import Bytes exposing (Endianness(..))
import Bytes.Decode as BD
import Bytes.Encode as BE
import Html
main =
Html.text (Debug.toString firstByte)
fourBytes =
BE.encode (BE.unsignedInt32 BE 0xFF995511)
firstByte =
BD.decode (BD.bytes 1) fourBytes
> import Bytes.Encode
> b0 = [] |> List.map Bytes.Encode.unsignedInt8 |> Bytes.Encode.sequence |> Bytes.Encode.encode
<0 bytes> : Bytes.Bytes
> b1 = [ 125, 211, 143, 67, 78, 89, 125, 24, 100, 73, 61, 190, 172, 133, 160, 82, 150, 234, 82, 197, 97, 146, 67, 85, 53, 203, 134, 236, 168, 180, 179, 239 ] |> List.map Bytes.Encode.unsignedInt8 |> Bytes.Encode.sequence |> Bytes.Encode.encode
<32 bytes> : Bytes.Bytes
> (b0 == b1)
True : Bool
I'm not sure if it was ever intended to be able to compare two Bytes
values, but the result of such comparison is very surprising.
Not sure if this is the appropriate place for it, but if you store a Bytes value on your elm model without decoding it, in the debugger (when you compile it with --debug) it displays with a pair of empty curly braces. I had to check the length explicitly to verify the data was there.
This is very confusing! Maybe we can display it like a list of integers or like a hex dump or something?
I would gladly take a swing at this if someone can point me where to look in the code.
Decode.bytes
and the bytes you want don't have an offset of 0
.Decode.bytes
.Note: With "using a decoder" I mean, using the decoder with the decode
function.
decode
function, the offset is "reset" to 0.let
decoders =
{ skipOne = Decode.andThen (\_ -> Decode.bytes 1) (Decode.bytes 1)
, maybeOneUint8 = Decode.map (Decode.decode Decode.unsignedInt8) (Decode.bytes 1)
}
in
[ Encode.unsignedInt8 1
, Encode.unsignedInt8 2
]
|> Encode.sequence
|> Encode.encode
|> Decode.decode decoders.skipOne
|> Maybe.andThen (Decode.decode decoders.maybeOneUint8)
|> Debug.log "should be `Just (Just 2)`"
I stumbled onto this issue when writing a SHA-2 implementation. I needed to divide my bytes into chunks of 64 bytes and then divide those chunks into smaller chunks of 4 bytes. I used multiple decode loops to do this.
Myself, and @eriktim, found two ways to solve this:
bytes.byteOffset
as done in #10var _Bytes_read_bytes = F3(function(len, bytes, offset)
{
return __Utils_Tuple2(
offset + len,
new DataView(bytes.buffer.slice(bytes.byteOffset), offset, len)
);
});
Important difference between (1) and (2):
decode
.decode
.I'm not sure what the correct solution is here.
Let me know if I need to make a PR for the second case presented ☝️
Thanks!
I was working with this library in order to implement bidirectional communication over grpc-web using a new elm protocol buffer library that makes use of this.
In order to get the bytes that the elm-protocol-buffers library created to the javascript library I used a port. But I found that taking this type from Elm to the outside world was sort of difficult.
In the end I used loop
and unsignedUint8
to change it from Bytes.Bytes
to List Int
so I could throw it through the port at which point I had some thoughts:
Appreciate any thoughts on this issue.
Decode.string
on bytes for longer expected strings will result in JavaScript engines (V8 in this case) utilizing an internal string concatenation performance optimization.eriktim/elm-protocol-buffers
package, which uses the elm/bytes
decoding for strings.The data we utilize can potentially have many strings that are quite long. A heap snapshot under Chrome indicates at least 20 bytes of overhead for each individual character that is concatenated during string construction. In our case this leads to ~200mB consumed in the heap by concatenated strings.
V8 strings appear to internally have 2 possible representations (https://gist.github.com/mraleph/3397008):
- flat strings are immutable arrays of characters
- cons strings are pairs of strings, result of concatenation.
If you concat a and b you get a cons-string (a, b) that represents result of concatenation. If you later concat d to that you get another cons-string ((a, b), d).
Indexing into such a "tree-like" string is not O(1) so to make it faster V8 flattens the string when you index: copies all characters into a flat string.
In this case, it appears the implementation of _Bytes_read_string
may likely end up using the "Cons string" type in the JS engine.
module Main exposing (main)
import Bytes exposing (..)
import Bytes.Decode as Decode
import Bytes.Encode as Encode
main : Program () () ()
main =
let
encodedBytes =
"This is a longer string for testing special JS concatenated strings"
|> Encode.string
|> Encode.encode
decodedBytesStr =
encodedBytes
|> Decode.decode (Decode.string (Bytes.width encodedBytes))
in
Platform.worker
{ init = \_ -> ( (), Cmd.none )
, update = \_ _ -> ( (), Cmd.none )
, subscriptions = \_ -> Sub.none
}
Once the above code runs:
According to this post (https://gist.github.com/mraleph/3397008), indexing into a cons string forces it to be flattened.
My approach would be to simply perform this once the string has been completely decoded:
var _Bytes_read_string = F3(function(len, bytes, offset)
{
var string = '';
var end = offset + len;
for (; offset < end;)
{
~snip~
}
// Force JS engines to flatten the string if it's internally represented by a "cons string"
string[0];
return __Utils_Tuple2(offset, string);
}
Possibly for very large strings this index operation could be done in the loop itself (possibly on every X iteration), but there are additional tradeoffs on performance of evaluating the condition.
Note: The above index may not work, as it seems some JS engines try to avoid it still. Even the npm flatstr package (https://www.npmjs.com/package/flatstr) current approach doesn't seem to work:
string | 0
However, effectively doing a string reverse reverse seems to workaround the cons string.
I struggled a bit on where the solution for this issue should lie (either in elm/bytes
, eriktim/elm-protocol-buffers
or in our own code). I decided on opening this here, as I think a solution at the lowest level would be the most appropriate and resolve this unexpected memory consumption for other libraries/applications that may use the string decoding. I don't see a use-case for keeping an internal cons string representation once the decoding has completed.
Hi @evancz
I am working on a PNG decoding/encoding library and in PNG pixels can have a depth of 1, 2, 4, 8 and 16 bits.
Would it be possible to encode and decode single bits?
I see the case where a scanline has a length in bits that cannot be decoded as bytes.
A Bytes
type is just a DataView
so the implementation should be pretty straightforward, just like File.decoder
.
Do you think this approach is correct? Would you like me to open a PR with this change?
Use case: I have data in a Uint8Array in JS and I want to use it with an elm-protocol-buffers-like package but I can't send it over a port because there's no way to decode into Bytes
.
module Main exposing (main)
import Browser
import Html exposing (Html, button, div, text)
import Html.Events exposing (onClick)
import Bytes exposing (Bytes)
import Bytes.Decode as Decode
import Bytes.Encode as Encode
lhs =
Encode.encode
(Encode.sequence [Encode.unsignedInt8 0xC0, Encode.unsignedInt8 0])
rhs =
Encode.encode
(Encode.sequence [Encode.unsignedInt8 0])
update : () -> () -> ()
update () () = ()
view : () -> Html never
view () =
div []
[ div [] [ text <| Debug.toString (lhs == rhs) ]
]
main : Program () () ()
main =
Browser.sandbox
{ init = ()
, view = view
, update = update
}
False
True
Rather than writing this by hand in client or sever code, the hope is that someone things like ProtoBuf compilers for Elm.
"Someone things like" sounds like there is something missing to me :)
And there is a typo in "sever" that should be "server".
Rather than writing this by hand in client or server code, the hope is that someone could create ProtoBuf compilers for Elm.
When parsing a protocol chunk by chunk, we need to stop successfully with every thing we decoded so far, and keep the remaining bytes for later.
The current decoder API provides no way to do such a thing.
Two examples of what I need to do when parsing the NATS protocol in Elm Nats:
To workaround this I need to have multiple decode phases, cut the bytes to decode the tail, and I get a complicated recursive function instead of a single Decoder.
The current function documentation is:
decode : Decoder a -> Bytes -> Maybe a
The
Decoder
specifies exactly how this should happen. This process may fail if the sequence of bytes is corrupted or unexpected somehow. The examples above show a case where there are not enough bytes.
What does "corrupted sequence of bytes" mean here?
signedInt8
should work on any Bytes
given to it. Is there a case where using it in decode
yields Nothing
?
If there is, I think a more explicit warning would help (and ideally some links about possible issues). If there isn't, the documentation should say that the only reason for decoding to fail is by passing the decoder a sequence of bytes that it is unable to parse.
This implementation I wrote of <*>
or andMap
decodes arguments in order right-to-left:
andMap : Decoder a -> Decoder (a -> b) -> Decoder b
andMap =
map2 (|>)
It's the exact same implementation as Json.Decode.Extra.andMap
, but with other imports.
This implementation behaves as I expect it to, decoding arguments left-to-right:
andMap : Decoder a -> Decoder (a -> b) -> Decoder b
andMap d d2 =
andThen (\v -> map v d) d2
I used this for decoding custom types in applicative-style, e.g. succeed T3 |> andMap int |> andMap int |> andMap int
.
Bytes.Decode.string
will decode bytes that are not valid utf8 and produces an nonsense string. Instead it should fail. Thanks @jhbrown94 for helping me verify this.
module Main exposing (main)
import Browser
import Html exposing (Html, button, div, text)
import Html.Events exposing (onClick)
import Bytes exposing (Bytes)
import Bytes.Decode as Decode
import Bytes.Encode as Encode
bytes : Bytes
bytes =
Encode.encode
(Encode.sequence [Encode.unsignedInt8 0xC0, Encode.unsignedInt8 0])
string : Maybe String
string =
Decode.decode (Decode.string 2) bytes
update : () -> () -> ()
update () () = ()
view : () -> Html never
view () =
div []
[ div [] [ text <| Debug.toString string ]
]
main : Program () () ()
main =
Browser.sandbox
{ init = ()
, view = view
, update = update
}
Just "\0"
Nothing
b"\xC0\x00"
is not unicodeHello, is planned that core/Bitwise
works with Bytes instead of Int? Or support both?
In certain situations, Decode.loop
uses the state from the previous run.
Update: The issue seems to be with Decoder.bytes
(thanks @eriktim)
Code that shows when it doesn't work:
https://ellie-app.com/45kPVqsrmPta1 (see logs)
Chunks are equal when they shouldn't be.
Code that shows the decoder actually does work:
https://ellie-app.com/45mTT24dtvMa1 (see logs)
Chunks are different, as it should be.
The difference between these examples, is that the first one defines an additional decoder that uses Decode.loop
(through the buildChunksDecoder
function).
Simplified example:
-- Decoders to split Bytes in chunks
(decoder512, decoder32) =
( buildChunksDecoder (512 // 8)
, buildChunksDecoder (32 // 8)
)
-- Create chunks
chunks =
"61626364656667686263646566676869636465666768696A6465666768696A6B65666768696A6B6C666768696A6B6C6D6768696A6B6C6D6E68696A6B6C6D6E6F696A6B6C6D6E6F706A6B6C6D6E6F70716B6C6D6E6F7071726C6D6E6F707172736D6E6F70717273746E6F70717273747580000000000000000000000000000380"
|> Hex.toBytes
|> Maybe.andThen (\bytes -> Decode.decode (decoder512 bytes) bytes)
|> Maybe.map (List.map (\block -> Decode.decode (decoder32 block) block))
|> Maybe.andThen Maybe.combine
|> Maybe.map (List.map (List.map Hex.fromBytes))
-- Proof that it doesn't work:
-- THESE SHOULD NOT BE EQUAL
fst = Debug.log "First chunk" (List.getAt 0 chunks)
snd = Debug.log "Second chunk" (List.getAt 1 chunks)
{-| Chunky decoder builder.
Divides `Bytes` into chunks of x bytes.
-}
buildChunksDecoder : Int -> Bytes -> Decoder (List Bytes)
buildChunksDecoder bytesInChunk bytes =
Decode.loop
{ chunksLeft = ceiling (toFloat (Bytes.width bytes) / toFloat bytesInChunk)
, chunks = []
}
(\state ->
if state.chunksLeft <= 0 then
Decode.succeed (Done state.chunks)
else
Decode.map
(\chunk ->
Loop
{ chunksLeft = state.chunksLeft - 1
, chunks = List.append state.chunks [ chunk ]
}
)
(Decode.bytes bytesInChunk)
)
It's a fairly complex issue, so let me know if I need to reduce it further.
Thanks! ✌️
Hi!
first of all: thanks a lot for bringing Elm into the world! This language kept me motivated digging more and more into functional programming, whereas with other languages I felt overwhelmed and discouraged pretty quickly.
Recently I've been writing a parser for midi files in elm, using elm/bytes. Overall it worked pretty nicely. But there was one thing I was missing: writing a custom decoder where an offset could be provided.
Parts of a midi file contain a list-like structure. Items are (potentially) compressed in a way that one byte needs to be read in order to know how to process this same and the following bytes. Depending on the most significant bit, this byte might contain some meta-information. If it doesn't this means the current list-item has the same meta-information as the latest item, so it is just dropped here. Implying that the byte that was just read is not to be read as just a single byte, but in various ways, depending on the previous meta-information. This leads to a situation where some kind of lookahead would be useful. Something that goes like: Ok the byte is of this structure, so keep it and read it as two nibbles, which define how to read the following bytes. Or: Oh the byte is of this other structure, so just forget about it and use the most recent meta-information and continue like normal, BUT start where the byte we just read started, because it is not the meta-info byte, but already part of the data. So we are basically one byte off.
I think in this case it would be nice to just set back the offset, so we effectively forget about the byte we just decoded. Because there is no way to reset the offset when using andThen
I carried this already read byte around and needed to provided it to the following decoders, where I had to prepend it conditionally. This made the code harder to read, understand and reuse.
In the source code of elm/bytes andThen
and mapN
use an offset internally. I guess exposing the data constructor Decoder (Bytes -> Int -> (Int, a))
, instead of just the type constructor Decoder a
would be all that is needed to be able to build custom map
/ andThen
decoders for doing this kind of lookahead decoding.
Maybe my explanation was a little confusing, so this code hopefully makes it easier to understand
Bytes.Decode.unsignedInt8
|> Bytes.Decode.andThen
(\currentPotentialStatusByte ->
let
isCompressed =
currentPotentialStatusByte < 128
currentStatusByte =
if isCompressed then
previousStatusByte
else
currentPotentialStatusByte
( mEventName, channel ) =
statusByteToNibbles currentStatusByte
readFirstByte =
if isCompressed then
Bytes.Decode.succeed currentPotentialStatusByte
else
Bytes.Decode.unsignedInt8
...
readFirstByte |> Bytes.Decode.andThen preReadVariableLengthValueDecoder |> Bytes.Decode.andThen Bytes.Decode.string |> Bytes.Decode.map ((++) "System Exclusive Begin Event" >> NotYetSupportedEvent)
and then I need to carry around readFirstByte
and map all the following decoders. But I think this would be nicer:
Bytes.Decode.Decoder
(\bites offset ->
let
(Bytes.Decode.Decoder uint8Decode) =
Bytes.Decode.unsignedInt8
(currentPotentialStatusByte, newOffset) =
uint8Decode bites offset
isCompressed =
currentPotentialStatusByte < 128
currentStatusByte =
if isCompressed then
previousStatusByte
else
currentPotentialStatusByte
( mEventName, channel ) =
statusByteToNibbles currentStatusByte
nextOffset =
if isCompressed then
offset
else
newOffset
...
withOffset nextOffset readVariableLengthValueDecoder |> Bytes.Decode.andThen Bytes.Decode.string |> Bytes.Decode.map ((++) "System Exclusive Begin Event" >> NotYetSupportedEvent)
Where withOffset
would be something like
withOffset : Int -> Bytes.Decode.Decoder a
withOffset offset (Bytes.Decode.Decoder decode) =
Decoder <| \bites _ -> decode bites offset
So here I could just use readVariableLengthValueDecoder
, which is also used in other places. And I would not need to create a preReadVariableLengthValueDecoder
- which does the same thing but either reads the first byte, or doesn't depending on the given argument
edit: during the past days I watched a bunch of elm talks (mostly held by Richard Feldman). I now understand much better why the type Decoder
is opaque. Still I think having a way to adjust the offset would be very nice. Maybe through adding a utility function like withOffset
.
A decoder I wrote returned Nothing
for large byte sequences, but it was due to a stack overflow in a Bytes.Decode.map function I used.
Here's a simple example:
import Bytes.Decode as D
import Bytes.Encode as E
-- This causes a stack overflow for any input
stackOverflow x =
case [x] of
hd :: _ ->
hd :: stackOverflow x
[] ->
[]
-- This is a single byte (00000001)
byte = E.unsignedInt8 1
-- This decoder should raise an exception
overflowDecoder = D.map stackOverflow D.unsignedInt8
> stackOverflow 1
RangeError: Maximum call stack size exceeded
> D.decode overflowDecoder byte
Nothing
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.