Git Product home page Git Product logo

fs2-data's People

Contributors

armanbilge avatar brbrown25 avatar cquiroz avatar daenyth avatar lenguyenthanh avatar redraw-dawn avatar rossabaker avatar rpiaggio avatar satabin avatar scala-steward avatar ubaldop avatar ybasket avatar zetashift avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fs2-data's Issues

CBOR doesn't round trip bignums

When encoding large integers to bytes and back, the CBOR (de)serialisation doesn't produce the same as the input as it handles bignum tags only when encoding, not on the way back:

import cats.effect.SyncIO
import fs2.*
import fs2.data.cbor.*
import fs2.data.cbor.high.*

val value = CborValue.Integer(BigInt("-739421513997118914047232662242593364"))

val out =  Stream(value).covary[SyncIO].through(toBinary).through(values).compile.onlyOrError.unsafeRunSync()

println(value) // Integer(-739421513997118914047232662242593364)
println(out) // Tagged(3,ByteString(ByteVector(16 bytes, 0x008e684b9913062199945815948c0a53)))

Scastie: https://scastie.scala-lang.org/VuVAwL2wQiqrcWiSeyCIeg

Add JSON selector DSL

Currently there are three ways to build JSON selectors:

  1. build them using the case class constructors directly;
  2. use the selector parser
  3. use the selector interpolator

They all work pretty well but depending on the context might not be the best suited ones:

  1. not really intuitive to write
  2. parser returns the result in an effect
  3. interpolator adds an extra dependency

A fourth way would be a small DSL, allowing to write the selectors in scala in a more convenient way than using the constructors directly. As an inspiration, one can think about the circe ACursor class.

This DSL can offer a convenient, pure way to build selectors without any extra dependency.

Rethink and rework the CSV pipes

The CSV module was the first one written in this project and suffers from a really cluttered and surprising API. For the 1.0 release, let's rework it completely to make it easier to use.

Support nested case classes for CsvRow*coder derivation

When deriving shapeless implicits for very wide case classes, sometimes the compiler fails with a stack overflow error. One technique to avoid this is to nest the case classes, which is isomorphic to the flattened structure.

Something like this:

case class Sub(
  @CsvName("first-name") firstName: String,
  @CsvName("the-age") age: Int
)
case class Whole(
  @CsvName("last-name") lastName: String,
  @CsvEmbed sub: Sub
)
// last-name, first-name, the-age
// flintstone, fred, 42

Context: https://gitter.im/fs2-data/general?at=5f159b91a28d973192e7d07a

Doobie supports a similar structure

Custom java.time formats

CellDecoder and CellEncoder have implicit instances for common java.time formats in CSVs, but there's no way to use custom formats yet. This could be exposed on the companion objects (non-implicit methods), but should provide a safe and sane signature despite the exceptions involved in date parsing through java.time.

Inspired from a question on Gitter: https://matrix.to/#/!GUjuKyYmWZJiiqOfRW:gitter.im/$8WXkYg1Q5SQvNHAwAId14j3_wvoA5DvI9GqZxbNNBnI?via=gitter.im&via=matrix.org

Csv decoding should fail when row width does not match header width

Consider an ill-formed file such as this: (spaces around \t for readability; assume they aren't there)

user-generated-text \t some-number
First line is ok \t 1
Unescaped\t raw tab in here \t 2

Note that the last line there has an extra tab character because the source of the file did not properly escape data from user input

Currently what happens is that a CsvRowDecoder coincidentally fails because for this one schema, the "shifted" column ends up putting raw tab in here into the some-number header column, which is using a CellDecoder[Int]

But if our CellDecoder had instead been expecting to receive a string there, I believe it would pass without error.

When this condition happens (row.length != headers.length), the entire stream should abort with an error, rather than continuing to emit potentially malformed results. This check would also catch a case where a user-input \n in the middle of an unescaped field would cause nonsense lines.

Add jq style queries

It can be interesting in a json stream to get only tokens corresponding to values matching a given filter. jq has nice query language, from which a subset can be borrowed to implement these filters.

Record JSON selection history for better feedback

Currently, when a JSON selector fails only the immediate context is given. Inspired by the circe HCursor class, which records history traversed from the document root, the selector pipe could add this information when selecting tokens, in orer for the user to understand better and more easily where the problem occurred.

Add support for a "formatted rendering" XML sink

Rendering an a stream of XML events to a formatted string (with indentation and newlines) can be useful for printing some pieces of data or writing them into files.

The result of rendering a stream in a formatted manner should roughly look like this :

<a> 
  <b>
     SomeText
  </b> 
</a> 

instead of

<a><b>SomeText</b></a>

Workaround :

Right now the workaround is to use a scala-xml sink and rendering the result. Obviously this implies an additional dependency.

Notes :

  • the tracking of the indentation level can probably be achieved via a simplescan on the fs2 stream
  • the complexity can be that the xml text elements would have to be "interpreted"

Publish for scala.js

I'm using fs2-data-xml for library that I'd like to cross publish to scala.js.
It seems that it should be possible given the dependencies.
I may try myself but I don't know much about using mill.
Tests are problematic though as they use files and that would be harder on Scala.js

Add more Json Selector Pipes

Currently one can use Json Selectors to only keep some values or transform a Json value at the selected position into another one. Interesting other use cases would be:

  • decide to keep or filter out a value at a given path based on some predicate
  • having an effectful transformation in case it might fail

This can be achieved by adding following Pipes:

def transformF[F[_], Json](sel: Selector, f: Json => F[Json]): Pipe[F, Token, Token]

def transformOpt[F[_], Json](sel: Selector, f: Json => Option[Json]): Pipe[F, Token, Token]

Make it possible to add a CSV column to a row

The CSV row update function does not add a column to the row if it wasn't present. In some use cases one might want to had a column (the same way we can delete one).

It should be possible to have a behavior that replaces or add if not present. Behavior is defined as:

  • When setting by header name, add the header to the end
  • It is not possible to have the same behavior for index based access, as this opens for weird behaviors when setting a column by index on rows with headers

When performing this in the stream, it should be up to the caller to ensure that the resulting stream is valid (i.e., all rows have the same columns in the result).

Wrong line number when CSV contains empty lines

I'm using the fs2.data.csv.lowlevel.rows pipe to parse CSV files. I wanted to provide meaningful error information to my users and was hoping that I can use the line information from the RowF class. However, the provided line number may be incorrect if the CSV file contains empty lines. I understand that the rows pipe skips empty lines, but I didn't expect that the line numbers of subsequent lines do not correspond with the line numbers of the file. I'm not sure whether this is the intended behaviour.

Example:

import fs2._
import fs2.data.csv._
import cats.implicits._

val input = """A,B,C
              |D,E,F
              |
              |G,H,I
              |
              |J,K,L
              |""".stripMargin

Stream.emit(input)
      .covary[Fallible]
      .through(lowlevel.rows[Fallible, String]())
      .map(r => s"""${r.line.orEmpty}: ${r.values.toList.mkString}""")
      .compile
      .toList  

// actual result  : Right(List(1: ABC, 2: DEF, 3: GHI, 4: JKL))
// my expectation : Right(List(1: ABC, 2: DEF, 4: GHI, 6: JKL))

fs2-data version: 1.6.1

Add cookbooks

Some patterns come often when using the library, the website should have a section for this.
Among common patterns:

  • read from CSV sources to be merged and write back
  • JSON/XML formatting

It would also be nice to add a simple way to submit new ones to the website, maybe using PR templates.

csv.lowlevel.headers pipe does not preserve line numbers

Version: 1.3.0
Scala: 3.1.1-RC

Example:

import cats.effect._
import cats.syntax.all._
import fs2._
import fs2.io.file._
import fs2.data.csv

object TestApp extends IOApp.Simple {
  override def run: IO[Unit] =
    Files[IO]
      .readAll(Path("some_valid_csv_path"))
      .through(text.utf8.decode)
      .through(csv.lowlevel.rows())
      .evalTap { row =>
        IO(println(s"[1] ${row.line}"))
      }
      .through(csv.lowlevel.headers[IO, String])
      .evalTap { row =>
        IO(println(s"[2] ${row.line}"))
      }
      .compile
      .drain
}

Actual output:

[1] Some(1)
[1] Some(2)
[2] None
[1] Some(3)
[2] None
[1] Some(4)
[2] None
[1] Some(6)
[2] None
...

Expected output:

[1] Some(1)
[1] Some(2)
[2] Some(2)
[1] Some(3)
[2] Some(3)
[1] Some(4)
[2] Some(4)
[1] Some(6)
[2] Some(6)
...

Add Json Stream Wrappers

A use case for fs2-data might be to generate Json data in a streaming manner and not transform parsed data. I this case it can be interesting to have pipes to wrap the produced data into some array/object structure in a safe way (i.e. while ensuring the produced sequence of tokens is valid).

This is based on a question asked on gitter.

tokens duplicates text if stream has chunks that end with escape character

I ran into an issue where reading in json file of filepaths would sometimes duplicate text when running it through tokens. The following code when run with the attached sample text reproduces the issue. The second token is parsed as \pathpath\to\some\other\file.txt instead of \path\to\some\other\file.txt with chunkSize of 55

val chunkSize = 55
val stream: fs2.Stream[IO, Byte] = fs2.io.readInputStream(IO(getClass.getResourceAsStream("/Sample.txt")), chunkSize)
stream.map(_.toChar)
  .through(tokens[IO, Char])
  .map((token: Token) => {
    println(s"token: ${token}")
    token
  }).compile.toList.unsafeRunSync()

Sample.txt

I added some code to print out the chunks to see where they start and stop and found that when the chunks end with an escape character the text just before it gets duplicated on the next chunk.

val chunkSize = 55
val stream: fs2.Stream[IO, Byte] = fs2.io.readInputStream(IO(getClass.getResourceAsStream("/Sample.txt")), chunkSize)
stream.map(_.toChar).chunkN(chunkSize)
  .map((chunk: Chunk[Char]) => {
    println(s"chunk: ${chunk.toList.toString()}")
    chunk
  })
  .flatMap(c => fs2.Stream.chunk(c))
  .through(tokens[IO, Char])
  .map((token: Token) => {
    println(s"token: ${token}")
    token
  }).compile.toList.unsafeRunSync()

CellDecoder performance

I tried to use the data-csv library to parse some large files. Reading them into the RowF case class was quite fast but I did get some poor performance during the cell decoder stage. It did occur for both the semi-auto and custom-written decoders. I've located a code path in the RowF.as[T] function where it does recreate the byHeader value for each field that's extracted. I could not think of a fix for it without any breaking API changes. I'm happy to provide a pull request if we can find a suitable solution.

A temporary workaround for me was to create a wrapper and cache the byHeader value and provide a custom decoder. Here's an example:

class CachedHeaderRow(val origin: csv.RowF[Some, String])(implicit hasHeaders: HasHeaders[Some, String]) {

  private lazy val byHeader = origin.headers.get.toList.zip(origin.values.toList).toMap

  def as[T](header: String)(implicit decoder: CellDecoder[T]): DecoderResult[T] =
    byHeader.get(header) match {
      case Some(v) => decoder(v)
      case None    => Left(new DecoderError(s"unknown field $header"))
    }
}

This decreased the pipeline of reading, parsing, and decoding of 1 000 000 rows with 30 cells from ~100 seconds to ~10 seconds.

`fs2.data.xml.XmlException: character 'ʿ' cannot start a NCName`

Via http4s/http4s-scala-xml#25 (comment).

//> using scala "3.1.2"
//> using lib "org.gnieh::fs2-data-xml-scala::1.4.1"

import cats.effect.*
import fs2.*
import scala.xml.*

val xml = """<Ẵ줐샃뗧饜孫 悊頃ふ퉞="ꨍ邭䋒↏᲎ừ" 듸괎:ʿक턻뽜="촏"/>"""

object App extends IOApp.Simple {

  def run = for
    _ <- IO(XML.loadString(xml)) *> IO.println("scala-xml works")
    _ <- Stream.emit(xml).covary[IO].through(fs2.data.xml.events()).compile.drain *> IO.println("fs2-data works")
  yield ()

}
scala-xml works
fs2.data.xml.XmlException: character 'ʿ' cannot start a NCName
        at fs2.data.xml.internals.EventParser$.fail$1$$anonfun$1(EventParser.scala:40)
        at fs2.Pull$$anon$2.cont(Pull.scala:183)
        at fs2.Pull$BindBind.cont(Pull.scala:701)
        at fs2.Pull$ContP.apply(Pull.scala:649)
        at fs2.Pull$ContP.apply$(Pull.scala:648)
        at fs2.Pull$Bind.apply(Pull.scala:657)
        at fs2.Pull$Bind.apply(Pull.scala:657)
        at fs2.Pull$.go$1$$anonfun$1(Pull.scala:1207)
        at fs2.Pull$.interruptGuard$1$$anonfun$1(Pull.scala:933)
        at get @ fs2.internal.Scope.openScope(Scope.scala:281)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Pull$.goCloseScope$1$$anonfun$1$$anonfun$3(Pull.scala:1187)
        at update @ fs2.internal.Scope.releaseChildScope(Scope.scala:227)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at modify @ fs2.internal.Scope.close(Scope.scala:262)
        at flatMap @ fs2.Compiler$Target.flatMap(Compiler.scala:162)
        at flatMap @ fs2.Pull$.goCloseScope$1$$anonfun$1(Pull.scala:1188)
        at handleErrorWith @ fs2.Compiler$Target.handleErrorWith(Compiler.scala:160)
        at flatMap @ fs2.Pull$.goCloseScope$1(Pull.scala:1195)
        at get @ fs2.internal.Scope.openScope(Scope.scala:281)

Support magnolia-based instance derivation

Magnolia instance derivation is preferable to shapeless induction for a few reasons:

  • It's designed for this use case, so the api is more directly applicable than shapeless' very general one
  • It compiles a lot faster, which is beneficial to projects with a lot of data mappings
  • The debugging when instances fail to derive is friendly, informative, and actionable (https://magnolia.work/opensource/magnolia#debugging)

Adding this kind of derivation in a new submodule would be a great win for making the library more beginner-friendly and useful to fs2+typelevel-stack newcomers

Support for non-UTF `encoding` in xml parser

Very excited about the enhanced XML support in 1.4.0 :) I've been experimenting with it in http4s/http4s-scala-xml#25 and running into trouble with non UTF encodings. FTR I'm no expert in these things :)

For example this request:

Content-Type: application/xml

<?xml version="1.0" encoding="iso-8859-1"?><hello name="Günther"/>

as used in this test:
https://github.com/http4s/http4s-scala-xml/blob/1ca64f2ab7ef500d384d2ec5f8caf88df600e6a6/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala#L198-L209

Furthermore the RFC specifies:

Since the charset parameter is not provided in the Content-Type
header and there is no overriding BOM, conformant XML processors must
treat the "iso-8859-1" encoding as authoritative.  Conformant XML-
unaware MIME processors should make no assumptions about the
character encoding of the XML MIME entity.

https://datatracker.ietf.org/doc/html/rfc7303#section-8.3

I'm not sure if there is a way to support this without an XML parser that operates directly on bytes instead of chars/strings 😕 any thoughts? Thanks!

Add support for some regexes

In some scenario, it is interesting to extract textual data based on some regular expression.
This can be an interesting fs2-data module that handles this with corner cases (e.g. when the match spans over several elements and/or chunks).

Using a non-backtracking regular expression implementation (such as re2) can help us leverage this problem in a streaming context.

This PhD thesis can be interesting in this context.

Add CBOR <-> JSON module

The RFC defines conversion back and forth to/from JSON values. By leveraging the Builder/Tokenizer concept, we can add a module that allows for parsing/serializing CBOR streams to/from JSON values.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.