Git Product home page Git Product logo

delimited's People

Contributors

danking avatar johnynek avatar scala-steward avatar thomas-stripe avatar tixxit avatar travisbrown avatar travisbrown-stripe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

delimited's Issues

Look for BOM at the beginning of files

Excel produces UTF-8 files with BOMs in them, so we should probably just cover the gamut and always look for BOMs when reading files, unless explicitly given a character set.

Inferred DelimitedFormat verification step

Right now we make a good guess at the format based on frequency counts. However, we don't do any validation that our choice was correct. Mainly, we can do 2 types of verification right away:

  1. Check that there are no errors when parsing the sample - or very few, relative to other candidates
  2. Check that the rows all have ~the same number of cells in them

The idea would be that rather than inferring 1 candidate DelimitedFormat, we would return a Stream[DelimitedFormat] of potential formats, ranked based on some sort of score. We would then have a validation step that would sample them until it found a successful one.

Add Iteratee support (Cats)

Add a delimited-iteratee package that uses Travis Brown's iteratee library.

Ideally we would have an Enumeratee that can map chunks of character data to rows. I'm not super sure this is possible... but it would be nice to avoid a proliferation of Enumerators for different ways of getting rows out of sources of chunks of character data.

Iteratee support should include dumping CSVs to files, possibly with a BOM marker.

Support chunked input when guessing

Rather than requiring a String, we could actually take chunked input, such as what's accumulated in inferDelimitedFormat in the iteratee module. This would avoid a possibly large string concatenation. It should really make things slower or be particularly difficult to do.

Partially parse delimited files to LazyRow

The idea here is that instead of having the parser produce fully parsed rows, we may be able parse things a bit quicker by simple parsing the entire row, without breaking it down to cells, then using a new LazyRow type to defer actually parsing the row fully. The idea here is that we may be able to speed up a single threaded reader by only partially parsing the rows, then allowing a chunk of the work required to further parse the rows to be done concurrently.

The hope here is that we can keep the reader IO bound, if at all possible.

Robust validation data set for format inference

We should have a decent set of delimited documents sourced from the internet that have a variety of styles. This includes files produced by some "standard libs" in various programming languages, Excel, etc.

Add Scala.js support

This isn't currently an easy option because Scala.js (as of 0.6.16) doesn't implement java.io.PushbackReader, but in the future that might change, or someone might decide to provide an alternative Scala.js-specific implementation of GuessDelimitedFormat that doesn't use PushbackReader.

Add column-level schema inference

It would be nice to include some support for inferring column-level, detailed schemas for parsed CSVs too. This comes up often and usually gets hacked around. The goal would be to provide a fairly robust way of gathering statistics per column, including:

  • inferred type: text, category, ordinal, integer, continuous, boolean, etc
  • required vs optional
  • how missing values are marked (empty, NULL, N/A, NaN, etc)
  • ratio of unique values / rows

Ideal proof-of-concept would include a tool that given a CSV will produce a SQL file that includes the CREATE TABLE statement along with an INSERT statement for the values - at least supporting Postgres or maybe allowing pluggable backends if that is feasible (without rewriting 80% of the tool).

Support hard limits on row sizes

If you allowRowDelimInQuotes is true, a malformed CSV can potentially OOM the JVM. That is pretty bad. We should allow people to provide a maximum row size that is acceptable. This can be used to just be a safety mechanism. In the case of inference, the buffer size used is sort of an implicit hard limit on row sizes, since if rows can be larger than the buffer, than we have no hope of inferring the row delimiter (at least).

Ideally this would just be part of the initial parser creation:

DelimitedParser(format, maxRowSize = 64 * 1024)

Binary compatiblity problem in scala 2.12.3

When using delimited 0.9.0 with scala 2.12.3, Im seeing:

@ val parser: DelimitedParser = DelimitedParser(DelimitedFormat.CSV)
java.lang.BootstrapMethodError: java.lang.NoSuchMethodError: scala.math.Ordering$.$anonfun$by$1$adapted(Lscala/Function1;Lscala/math/Ordering;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;
  net.tixxit.delimited.RowDelim$.<init>(RowDelim.scala:28)
...

root cause suspected to be scala/bug#10489

Support for escaping the quote escape character

Is it possible to add support for escaping the quote escape character. I have some mysql dumps that use " as a quote character and \ as an escape character but there are some values in the form of like "value\" so the last quote is wrongly interpreted as an escaped quote. If it was possible to escape the escape character \ could be treated as an escaped \ so the value would be valid.

Thanks

How to work with Rows

Hi, I was trying out the library and I'm a bit confused by the way Row type behaves. I see that it inherits from IndexedSeq, but it seems that any inherited method returns IndexedSeq instead of a Row:

@ import $ivy.`net.tixxit::delimited-core:0.9.0`, net.tixxit.delimited._

@ Row("a", "b", "c") ++ Row("d", "e", "f")
res1: IndexedSeq[String] = Vector("a", "b", "c", "d", "e", "f")

@ Row("a", "b", "c") :+ "d"
res2: IndexedSeq[String] = Vector("a", "b", "c", "d")

So, how one is supposed to work with Rows?

java.lang.StringIndexOutOfBoundsException in 0.6.0 but not it 0.5.5

Here's the file (TSV)
sample.txt.zip

Code producing the exception:

import java.io._

import net.tixxit.delimited._

object Main extends App{

  val fileName = "/Users/dw/Desktop/sample.txt"
  val bufferedReader = new BufferedReader(
    new FileReader(fileName)
  )
  try{

    val iterator = DelimitedParser(DelimitedFormat.TSV).parseReader(bufferedReader)
    for{
      parsed <- iterator
    } {
      println(parsed)
    }


  } catch {
    case e: Throwable =>
      e.printStackTrace
      throw e
  } finally{
    bufferedReader.close
  }


}

stacktrace:

 java.lang.StringIndexOutOfBoundsException: String index out of range: 66100
[error]     at java.lang.String.charAt(String.java:646)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.getChar(DelimitedParserImpl.scala:96)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.loop$2(DelimitedParserImpl.scala:110)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.isFlag(DelimitedParserImpl.scala:118)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.isSeparator$1(DelimitedParserImpl.scala:140)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.isEndOfCell$1(DelimitedParserImpl.scala:146)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.loop$3(DelimitedParserImpl.scala:166)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.unquotedCell$1(DelimitedParserImpl.scala:179)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.cell$1(DelimitedParserImpl.scala:221)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.row$1(DelimitedParserImpl.scala:263)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.parse(DelimitedParserImpl.scala:287)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl.loop$1(DelimitedParserImpl.scala:46)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl.parseChunk(DelimitedParserImpl.scala:31)
[error]     at net.tixxit.delimited.DelimitedParser$$anonfun$parseAll$1.apply(DelimitedParser.scala:51)
[error]     at net.tixxit.delimited.DelimitedParser$$anonfun$parseAll$1.apply(DelimitedParser.scala:50)
[error]     at scala.collection.Iterator$$anon$15.next(Iterator.scala:499)
[error]     at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
[error]     at scala.collection.Iterator$class.foreach(Iterator.scala:742)
[error]     at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
[error]     at september.cj.Main$.delayedEndpoint$september$cj$Main$1(Main.scala:18)
[error]     at september.cj.Main$delayedInit$body.apply(Main.scala:8)
[error]     at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
[error]     at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
[error]     at scala.App$$anonfun$main$1.apply(App.scala:76)
[error]     at scala.App$$anonfun$main$1.apply(App.scala:76)
[error]     at scala.collection.immutable.List.foreach(List.scala:381)
[error]     at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
[error]     at scala.App$class.main(App.scala:76)
[error]     at september.cj.Main$.main(Main.scala:8)
[error]     at september.cj.Main.main(Main.scala)
[error] Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 66100
[error]     at java.lang.String.charAt(String.java:646)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.getChar(DelimitedParserImpl.scala:96)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.loop$2(DelimitedParserImpl.scala:110)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$InputBuffer.isFlag(DelimitedParserImpl.scala:118)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.isSeparator$1(DelimitedParserImpl.scala:140)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.isEndOfCell$1(DelimitedParserImpl.scala:146)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.loop$3(DelimitedParserImpl.scala:166)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.unquotedCell$1(DelimitedParserImpl.scala:179)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.cell$1(DelimitedParserImpl.scala:221)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.row$1(DelimitedParserImpl.scala:263)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl$.parse(DelimitedParserImpl.scala:287)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl.loop$1(DelimitedParserImpl.scala:46)
[error]     at net.tixxit.delimited.parser.DelimitedParserImpl.parseChunk(DelimitedParserImpl.scala:31)
[error]     at net.tixxit.delimited.DelimitedParser$$anonfun$parseAll$1.apply(DelimitedParser.scala:51)
[error]     at net.tixxit.delimited.DelimitedParser$$anonfun$parseAll$1.apply(DelimitedParser.scala:50)
[error]     at scala.collection.Iterator$$anon$15.next(Iterator.scala:499)
[error]     at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
[error]     at scala.collection.Iterator$class.foreach(Iterator.scala:742)
[error]     at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
[error]     at september.cj.Main$.delayedEndpoint$september$cj$Main$1(Main.scala:18)
[error]     at september.cj.Main$delayedInit$body.apply(Main.scala:8)
[error]     at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
[error]     at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
[error]     at scala.App$$anonfun$main$1.apply(App.scala:76)
[error]     at scala.App$$anonfun$main$1.apply(App.scala:76)
[error]     at scala.collection.immutable.List.foreach(List.scala:381)
[error]     at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
[error]     at scala.App$class.main(App.scala:76)
[error]     at september.cj.Main$.main(Main.scala:8)
[error]     at september.cj.Main.main(Main.scala)
java.lang.RuntimeException: Nonzero exit code returned from runner: 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.