univocity / univocity-parsers Goto Github PK

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.

HTML 0.47% Java 99.53%

univocity-parsers's People

Contributors

Stargazers

Watchers

Forkers

nkabir adessaigne cquezadav lin-zhao deejmore xiaoleiy twh270 jackwu2013 birchbox kirhgoff navkast backoffbelief mitchjust shubhaat tailorware firatkarakusoglu dartok-sd adeelnm willwotechnologies sshelake reiberandras rafaelrviana fengshao0907 caporossi newsky nareshmiriyala rebaz88 helt stevesnod ashutoshmimani h-toru04 ibelievelove allenfancy felixongati harlixxy sakuratw sanjay528 pescarcena bwrega mbreslow piyushnagar jy4618272 mohammadshahidkhan richardy2012 mgaouar michael-ancestor gspandy sefersezer georgvolkert lkasler bigfreecoder 292388900 desperado1992 gautamraojangili liinnux cjy9492 salansun ttjeremy mosoft521 tu-ngo lingreed xingganfengxing danielvdende hj5 isigma rodolfocruzbsb ivywan alessio1991 shotishu ep1804 satendrakumar pkofficial cquliaoli kasecato tonyrso atianfan jsolmon mcrassin-movesol singhamxiao adiu19 johnraja emrul dmmop ganeshan beyond2016 ziscloud stwitte hadasschwinger prashant-uxd vrdate tomar-vrdate ephung01 awsp sloshy powerfulxxr zarkob vmaldosan ashokblend esimionato shresthaujjwal

univocity-parsers's Issues

Javadoc fails on JDK8

Build fails on JDK8 as javadoc tool no longer supports empty

Error processing input

I have a csv file with a data such as this:

H23135.502,H17108.503,X55362.504,R53941.505,R98189.506,R53936.507,H01677.508,Z49199.509,D90188.510,H09599.511,H82631.512,M22382.513,T47562.514,T56604.515,M59807.516,U30827.517,M23410.518,M60484.519,T95048.520,H07899.521,T88902.522,M96839.523,H11650.524,U20659.525,M95678.526,U31215.527,R20804.528

The parser gives me a " processing input error "
Error,com.univocity.parsers.common.TextParsingException: Error processing input: , line=0, char=5542. Content parsed: [null]
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:260)
com.univocity.parsers.common.AbstractParser.parse(AbstractParser.java:127)

It appears that the parseRecord() generates a java ArrayIndexOutOfBoundsException: 512 in the AbstractParser.java

I use a FileInputStream for my reader (in scala)
private def getReader(relativePath: String): java.io.Reader = {
try {
new InputStreamReader(new FileInputStream(relativePath))
} catch {
case e: Exception => {
throw new EngineException("Unable to read input", true)
}
}
}
Please let me know if you would need the csv file for debugging. Thanks!

CsvParser does not report certain serious errors.

CsvParser suppresses certain serious exceptions instead of throwing or returning them to the client. For example, when using a BeanListProcessor, when CsvParser encounters a bean field for which a corresponding column name does not exist in the CSV file, CsvParser creates a TextParsingException containing the following message, but it neither throws this exception nor returns it to the client:

java.lang.IllegalStateException: Could not find field with name 'Account Concurrency Code' in input. Names found: [Report Type, Filer Name, Filer Address, Filer City, Filer State/Province/Region, Filer Country, Filer GIIN, Filer TIN, Sponsored Entity Name, Sponsored Entity Address, Sponsored Entity City, Sponsored Entity State/Province/Region, Sponsored Entity Country, Sponsored Entity GIIN, Sponsored Entity TIN, Account Holder Name, Account Holder Address, Account Holder City, Account Holder State/Province/Region, Account Holder Country, Account Holder TIN, Account Holder Type, Owner Name, Owner Address, Owner City, Owner State/Province/Region, Owner Country, Owner TIN, Account Number, Account Currency Code, Account Balance, Interest, Gross proceeds/Redemptions, Dividends, Other, Pooled Reporting Type, Number of Accounts, Aggregate Payment Amount, Aggregate Account Balance, Pooled Currency Code]

The source of the problem appears to be in the catch block of method AbstractParser#parse(Reader). Though the call to method AbstractParser#handleException(Throwable) returns a TextParsingException which it subsequently passes to method AbstractParser#stopParsing(Throwable), none of these methods actually throw the exception (AbstractParser#stopParsing(Throwable) throws it, but only if it encounters another exception) and AbstractParser#parse(Reader) doesn't return the exception to its client. When such an exception happens, BeanListProcessor#getBeans() also does not fail, but instead returns an empty list. Consequently, CsvParser effectively suppresses serious exceptions without giving the client any clue as to why BeanListProcessor produces an empty result.

Using multiple field delimiters with BeanListProcessor

Our users insist on using any one of ,, ;, |, # and tabs as the delimiting character for text files. This creates a problem for us when using a Univocity CSV parser because the parser accepts only one delimiter at a time (CsvParserSettings.getFormat().setDelimiter(...)). Each file uses only one delimiter at a time (that is, the same file does not contain different delimiters per row or column).

Is there a way to instruct the parser to use any one out of multiple characters as the delimiter by examining the file?

Add Enum native conversion?

Can you please add a native generic enum conversion? I think it's one of the most basic data types and it should be included in the library.

I did try to create my own, but I couldn't make it generically since I can only pass Strings to the @convert annotation args, so I ended creating one conversion class per enum type, which is verbose. I could have passed the class name in the String, I would suppose.

Maybe you should allow an Object[] args for @convert annotation instead of just String[]?

Cannot parse content with comment and already new line separators

Consider the following configuration:

separator: \r\n
normalized line separator: \n

When you read the following file (with already normalized new line symbols)

# this is a comment line\n
A,B,C\n
1,2,3\n

Then you have an exception telling you that it cannot skip one line.

PS: I cannot fix this one, since I cannot run the tests on my machine from some reason.

Problem with CR line separator?

If I have a file with carriage returns as the line separator, like:

a,b,c,d 1,2,3,4 5,6,7,8

I end up with 1 row of 10 elements, rather than 3 rows of 4 (two get swallowed by the CR). For other test files, I get an EOFException. Here is a failing unit test:

  public void testCarriageReturn() throws Exception {
    byte[] bytes = "a,b,c,d\r1,2,3,4\r5,6,7,8".getBytes(StandardCharsets.UTF_8);
    InputStream is = new ByteArrayInputStream(bytes);
    CsvParserSettings settings = new CsvParserSettings();
    settings.getFormat().setLineSeparator("\r");
    //settings.getFormat().setNormalizedNewline('\r');
    CsvParser parser = new CsvParser(settings);
    List<String[]> rows = parser.parseAll(new InputStreamReader(is));
    Assert.assertEquals(rows.size(), 3);
    Assert.assertEquals(rows.get(0).length, 4);
  }

The problem is alleviated by setting the normalizedNewline to the same value as the lineSeparator. This feels a little strange, and certainly seems contrary to what the documentation says. After reading http://docs.univocity.com/parsers/1.0.2/com/univocity/parsers/common/Format.html, I would think setting just lineSeparator to CR would be sufficient for this case. When would one need to specify normalizedNewline?

Am I misunderstanding the purpose of normalizedNewline and lineSeparator, or is this a bug? Or perhaps it's both?

Thanks!

Let users of CsvWriter determine how to handle unquoted values that contain the quote character

When writing a value such as

    A, my "precious" value, B

the quotes are not escaped because the value won't start with a quote and it will therefore be read as-is. This is faster to read and fine for our parser and many others, however it might cause trouble if the consumer of this input is picky. Let's add a configuration to allow the output to be written as:

    A, "my ""precious"" value", B

Column validation mode.

Presently, when BeanListProcessor encounters a data conversion error, it throws an exception and stops processing. The parser could benefit from a validation mode where instead of stopping after it encounters an error, it continues processing and collections all the conversion (or validation) errors. This validation could involve performing a conversion, catching and recording the exception, and continuing to process the input while collecting all the errors in a row. Or, it might test input against regular expression patterns and record all the cells in a row whose values don't match these patterns. The result would be a list of pairs where each pair may contain a set of column errors or a valid bean (i.e. List<Either<Set<Error>, Bean>>) .

Avoid using classes from package java.beans.*;

Apparently Android does not support the classes from this package. A suggestion to use com.googlecode.openbeans has been given in this pull request.

We can probably write some code to not even use the classes there and make everybody happy.

Release of master source code

Hi
There are few fixes such as 'accessing Super class fields in bean conversion', 'Ignoring Parsed annotation if header not specified' are fixed in Master source but not yet released.
Is it possible for you to release next version and by when?

TsvParser skips lines which start with #

When reading a tsv file with the TsvParser it somehow cannot handle lines which start with the # character. We compared the results of the parser with the results of a simple line-wise reading with a FileReader in a BufferedReader. While the BufferedReader list of line contained all lines from the file, the list created by the parseAll method of the TsvParser did not contain an entry for those lines.

Add CsvFieldOrder annotation

The one thing I really liked about this library is the Bean annotation feature! With it, I was able to write a generic toCsv and fromCsv routines similar to how you could use Jackson without passing metadata (unlike SuperCsv). The only quirk I found was declaring the headers to determine which fields to write and in what order. I believe this can be enhanced by introducing a CsvFieldOrder annotation (which I did) at the class level of the Bean.

I did it like this (it's in Scala, but I think you get the idea):

@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.TYPE)
public @interface CsvFieldOrder {
    String[] fields();
}

    def toCsv[T: Manifest](records: Seq[T], filename: String, headers: Option[Seq[String]] = None)
        (implicit tag: ClassTag[T]): Unit = {

        val cls = tag.runtimeClass.asInstanceOf[Class[T]]
        val processor = new BeanWriterProcessor[T](cls)

        val settings = new CsvWriterSettings
        settings.setRowWriterProcessor(processor)
        headers.map { h =>
            settings.setHeaders(h: _*)
        } getOrElse {
            val order = cls.getAnnotation(classOf[CsvFieldOrder])
            if (order == null) {
                throw new IllegalArgumentException(
                    "CsvFieldOrder annotation not found and headers is none"
                )
            } else {
                settings.setHeaders(order.fields():_*)
            }
        }

        val outputStream = new FileOutputStream(filename)
        val outputWriter = new OutputStreamWriter(outputStream)
        val w = new CsvWriter(outputWriter, settings)
        w.writeHeaders()
        w.processRecordsAndClose(records.asJava)

    }

Provide default parse/write routines for common use cases.

As suggested by mlvn23 I think it would be great if we have a generic toCsv and fromCsv routine that uses some sane default settings that one could use out of the box (or maybe specify some predefined settings like settings.Excel). It would be very handy, and from PR perspective, it's very appealing to serialize / de-serialize in one line of code:

val x = CsvUtils.fromCsv[TradeLogBean]("test.csv")
CsvUtils.toCsv[TradeLogBean](x, "test1.csv")

When BeanLisProcessor fails to convert a cell value, record name of invalid column in TextParsingException.

When the BeanListProcessor fails to convert the value in a row and column, it should record not only the row number, but also the name of the invalid column in TextParsingException. Presently, the column name is not available in TextParsingException.

Though the invalid column index is available in the parser context given to TextParsingException, this column index is zero when the parser throws the exception because by the time the parser begins conversion, it has read all columns in the row and resets the column index to zero. Consequently, the index of the invalid column is no longer available in the parser context when conversion takes place. The invalid column index is available in FieldConversionMapping#applyConversions(int,String), but the IllegalStateException that this methods creates to report the error has no field to hold the invalid column index.

Release version 1.2.0

Release uniVocity 1.2.0 with the latest improvements.

Missing OSGi bundle information in MANIFEST.MF

In order to be able to use univocity-parsers in an OSGi environment, it must provides bundle information in the META-INF/MANIFEST.MF file.

Numeric conversion classes must validate the ParsePosition when using multiple formats

It is also a pain in the ass to work with the decimal formats if we are trying to manually instantiate conversion classes. Review all conversion classes and add a better API for them to allow advanced customization.

*FormattedBigDecimalConversion * :

    FormattedBigDecimalConversion conversion = new FormattedBigDecimalConversion("0.00", "0,00");

    // WTF?
    AnnotationHelper.applyFormatSettings(conversion.getFormatterObjects()[0], new String[]{"decimalSeparator=."});

    // WTF?
    AnnotationHelper.applyFormatSettings(conversion.getFormatterObjects()[1], new String[]{"decimalSeparator=."});

When using such conversion to parse numbers in multiple formats such as in the following pipe-separated input::

    1.99|10.0|2.189\n1,99|10,0|2,189"

As the ParsePosition is not validated by the conversion class, the first conversion is always executed, producing:

    [1.99, 10.0, 2.189]
    [1, 10, 2]

Where the expected output would be:

    [1.99, 10.0, 2.189]
    [1.99, 10.0, 2.189]

Invalid ParsingContext#currentLine() for the last line

It happens when the file has no new line at the end. ParsingContext#currentLine() returns n-1 where n is actual number of lines.

When reusing a parser instance, headers are not extracted after the first run.


public static void main(String... args) {

        CsvParserSettings parserSettings = new CsvParserSettings();
        parserSettings.detectFormatAutomatically();

        parserSettings.setHeaderExtractionEnabled(true);

        CsvParser parser = new CsvParser(parserSettings);

        List<String[]> rows = parser.parseAll(new StringReader("Amount,Tax,Total\n1.99,10.0,2.189\n5,20.0,6"));
        for (Object[] row : rows) {
            System.out.println(Arrays.toString(row));
        }

        rows = parser.parseAll(new StringReader("Amount;Tax;Total\n1,99;10,0;2,189\n5;20,0;6"));
        for (Object[] row : rows) {
            System.out.println(Arrays.toString(row));
        }
    }

Produces:

    [1.99, 10.0, 2.189]
    [5, 20.0, 6]
    [Amount, Tax, Total]  // SHOULD NOT BE HERE
    [1,99, 10,0, 2,189]
    [5, 20,0, 6]

Process quote escape sequences on unquoted CSV values

Introduce a new configuration option to allow processing of escape sequences on unquoted CSV values.

Currently, if the parser finds a unquoted value, the characters will be read as-is, so if you have:

A""BC -> the value will be read as A""BC

By introducing support for escape processing, the result should be A"BC instead

Requesting to allow for autodetecting the line separators based on a file to be parsed

Hi,

We are switching from opencsv parser to univocity cvs + tsv parsers.
As of 1.3.2 version, there is no way to detect the line end character for files and thus automatically choose a line separator that would commonly work for all our cvs and tsv files independent of the OS. We would like to put a feature request to enable univocity parsers, given a file, to auto detect line terminator format and enable parsing.

Thanks!

Allow parsing quoted CSV fields without applying quote escape

Some users reported the need to parsing/writing quoted CSV values as-is. For example:

when parsing: if an escape sequence appears inside a quoted value, it should not be replaced by the quote character. The original content should be returned
when writing: if a value is quoted and it contains a quote character it should not be escaped - we assume the user is providing an escaped value here otherwise this will produce invalid CSV.

BeanListProcessor cannot read fields selectively

It looks like there's something missing on the read path of the BeanListProcessor. Using the same Bean:

    @BeanInfo
    @Headers(sequence = Array(
        "id", "timestamp", "symbol", "quantity", "isComplete", "datetime", "number"
    ))
    class BasicTypes() extends ToJsonString {

        @Parsed var id: Int = _
        @Parsed var quantity: Double = _
        @Parsed var timestamp: Long = _
        @Parsed var symbol: String = _
        @Parsed var isComplete: Boolean = _
        @Parsed var number: Numerals = _

        @Parsed
        @Convert(conversionClass = classOf[DateOptTimeConversion])
        var datetime: DateTime = _

        override def toString: String = toJsonString

    }

I have this partial csv:

        "quantity,symbol,id\r\n" +
        "23.4,IBM,1\r\n" +
        "9.55,WMT,9\r\n" +
        "79.7,the quick brown fox,100\r\n"

And this is my reader:

    def readCsv[T: Manifest](reader: Reader)
        (implicit tag: ClassTag[T]): Seq[T] = {

        val cls = tag.runtimeClass.asInstanceOf[Class[T]]
        val processor = new BeanListProcessor[T](cls)
        //val processor = new RowListProcessor()

        val settings = new CsvParserSettings
        settings.setRowProcessor(processor)
        settings.setHeaderExtractionEnabled(true)
        //settings.setHeaders("quantity")
        //settings.selectFields("quantity")

        val parser = new CsvParser(settings)
        parser.parse(reader)

        //processor.getBeans().asScala

        println(processor.getBeans())

        Seq.empty

    }

I'm getting this error:

[info]   com.univocity.parsers.common.DataProcessingException: Unexpected error processing input row [23.4, IBM, 1] using RowProcessor com.univocity.parsers.common.processor.BeanListProcessor.
[info] Internal state when error was thrown: line=2, charIndex=33, headers=[quantity, symbol, id], row=[23.4, IBM, 1]
[info]   at com.univocity.parsers.common.AbstractParser.rowProcessed(AbstractParser.java:479)
[info]   at com.univocity.parsers.common.AbstractParser.parse(AbstractParser.java:93)
[info]   at com.extrategic.util.csv.univocity.ScalaCsv$.readCsv(ScalaCsv.scala:31)
[info]   at com.extrategic.util.csv.univocity.ScalaCsv$.fromCsvStr(ScalaCsv.scala:66)
[info]   at com.extrategic.util.csv.univocity.ScalaCsvTest$$anonfun$5.apply$mcV$sp(ScalaCsvTest.scala:77)
[info]   at com.extrategic.util.csv.univocity.ScalaCsvTest$$anonfun$5.apply(ScalaCsvTest.scala:76)
[info]   at com.extrategic.util.csv.univocity.ScalaCsvTest$$anonfun$5.apply(ScalaCsvTest.scala:76)
[info]   at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   ...
[info]   Cause: java.lang.IllegalStateException: Unknown field names: [timestamp]. Available fields are: [quantity, symbol, id]
[info]   at com.univocity.parsers.common.fields.FieldNameSelector.getFieldIndexes(FieldNameSelector.java:54)
[info]   at com.univocity.parsers.common.fields.AbstractConversionMapping.prepareExecution(FieldConversionMapping.java:283)
[info]   at com.univocity.parsers.common.fields.FieldConversionMapping.prepareExecution(FieldConversionMapping.java:96)
[info]   at com.univocity.parsers.common.ConversionProcessor.initializeConversions(ConversionProcessor.java:100)
[info]   at com.univocity.parsers.common.ConversionProcessor.applyConversions(ConversionProcessor.java:127)
[info]   at com.univocity.parsers.common.processor.BeanConversionProcessor.createBean(BeanConversionProcessor.java:365)
[info]   at com.univocity.parsers.common.processor.BeanProcessor.createBean(BeanProcessor.java:36)
[info]   at com.univocity.parsers.common.processor.BeanProcessor.rowProcessed(BeanProcessor.java:51)
[info]   at com.univocity.parsers.common.AbstractParser.rowProcessed(AbstractParser.java:471)
[info]   at com.univocity.parsers.common.AbstractParser.parse(AbstractParser.java:93)

I understand that it's desirable to always get a fully formed bean most of the time, but maybe we can add an option that says "partial" is okay? I tried the selectFields / setHeaders, but it didn't help. Anyway, even if it did, I think it's still better if the parser could "auto-detect" the columns.

My use-case is for stock / commodity prices. Format could be:

[date,close]
[date,time,close]
[date,open,high,low,close,volume]
[date,open,high,low,close,volume, openInt]
[timestamp,open,high,low,close,volume]
[timestamp,open,high,low,close,volume,openInt]

I was hoping I could define one bean to cover all of them.

Thanks again in advance.

Selective header / fields conflict?

Thanks for this great library! It's very flexible so I continue to come back to it.

I'm currently using 1.6.0-SNAPSHOT, and I've encountered an issue when using the setHeaders / selectFields.

I have a test bean like this (in Scala):

    @BeanInfo
    @Headers(sequence = Array(
        "id", "timestamp", "symbol", "quantity", "isComplete", "datetime", "number"
    ))
    class BasicTypes() extends ToJsonString {

        @Parsed var id: Int = _
        @Parsed var quantity: Double = _
        @Parsed var timestamp: Long = _
        @Parsed var symbol: String = _
        @Parsed var isComplete: Boolean = _
        @Parsed var number: Numerals = _

        @Parsed
        @Convert(conversionClass = classOf[DateOptTimeConversion])
        var datetime: DateTime = _

        override def toString: String = toJsonString

    }

And my writer is like this:

    def writeCsv[T: Manifest](
        records: Seq[T], writer: Writer, headers: Option[Seq[String]] = None)
        (implicit tag: ClassTag[T]): Unit = {

        val cls = tag.runtimeClass.asInstanceOf[Class[T]]
        val processor = new BeanWriterProcessor[T](cls)

        val settings = new CsvWriterSettings
        settings.setRowWriterProcessor(processor)
        headers.foreach { h =>
            println("XXX=", h)
            //settings.setHeaders(h: _*)
            //settings.setHeaders("id")
            //settings.selectFields("timestamp")
        }

        val w = new CsvWriter(writer, settings)
        w.writeHeaders()
        w.processRecordsAndClose(records.asJava)

    }

If I don't override the setHeaders and selectFields, then it works.
If I selective pick fields in a different order (e.g., quantity, symbol, number), then I would get an error.

Calling settings.selectFields("id") works:

id,timestamp,symbol,quantity,isComplete,datetime,number
1,,,,,,
9,,,,,,
100,,,,,,

Calling settings.selectFields("timestamp") doesn't work:

[info]   com.univocity.parsers.common.DataProcessingException: Error processing data conversions
[info] Internal state when error was thrown: charIndex=0, row=[null], columnIndex=0
[info]   at com.univocity.parsers.common.ConversionProcessor.handleConversionError(ConversionProcessor.java:206)
[info]   at com.univocity.parsers.common.ConversionProcessor.reverseConversions(ConversionProcessor.java:192)
[info]   at com.univocity.parsers.common.processor.BeanConversionProcessor.reverseConversions(BeanConversionProcessor.java:373)
[info]   at com.univocity.parsers.common.processor.BeanWriterProcessor.write(BeanWriterProcessor.java:57)
[info]   at com.univocity.parsers.common.AbstractWriter.processRecord(AbstractWriter.java:390)
[info]   at com.univocity.parsers.common.AbstractWriter.processRecords(AbstractWriter.java:345)
[info]   at com.univocity.parsers.common.AbstractWriter.processRecordsAndClose(AbstractWriter.java:316)
[info]   at com.extrategic.util.csv.univocity.ScalaCsv$.writeCsv(ScalaCsv.scala:52)
[info]   at com.extrategic.util.csv.univocity.ScalaCsv$.toCsvStr(ScalaCsv.scala:68)
[info]   at com.extrategic.util.csv.univocity.ScalaCsvTest$$anonfun$3.apply$mcV$sp(ScalaCsvTest.scala:55)
[info]   ...
[info]   Cause: java.lang.ArrayIndexOutOfBoundsException: 1
[info]   at com.univocity.parsers.common.ConversionProcessor.reverseConversions(ConversionProcessor.java:188)
[info]   at com.univocity.parsers.common.processor.BeanConversionProcessor.reverseConversions(BeanConversionProcessor.java:373)
[info]   at com.univocity.parsers.common.processor.BeanWriterProcessor.write(BeanWriterProcessor.java:57)
[info]   at com.univocity.parsers.common.AbstractWriter.processRecord(AbstractWriter.java:390)
[info]   at com.univocity.parsers.common.AbstractWriter.processRecords(AbstractWriter.java:345)
[info]   at com.univocity.parsers.common.AbstractWriter.processRecordsAndClose(AbstractWriter.java:316)
[info]   at com.extrategic.util.csv.univocity.ScalaCsv$.writeCsv(ScalaCsv.scala:52)
[info]   at com.extrategic.util.csv.univocity.ScalaCsv$.toCsvStr(ScalaCsv.scala:68)
[info]   at com.extrategic.util.csv.univocity.ScalaCsvTest$$anonfun$3.apply$mcV$sp(ScalaCsvTest.scala:55)

Calling setHeaders("id") doesn't work:

[info]   java.lang.ArrayIndexOutOfBoundsException: -1
[info]   at com.univocity.parsers.common.processor.BeanConversionProcessor.mapFieldIndexes(BeanConversionProcessor.java:244)
[info]   at com.univocity.parsers.common.processor.BeanConversionProcessor.mapFieldsToValues(BeanConversionProcessor.java:326)
[info]   at com.univocity.parsers.common.processor.BeanConversionProcessor.reverseConversions(BeanConversionProcessor.java:361)
[info]   at com.univocity.parsers.common.processor.BeanWriterProcessor.write(BeanWriterProcessor.java:57)
[info]   at com.univocity.parsers.common.AbstractWriter.processRecord(AbstractWriter.java:390)
[info]   at com.univocity.parsers.common.AbstractWriter.processRecords(AbstractWriter.java:345)
[info]   at com.univocity.parsers.common.AbstractWriter.processRecordsAndClose(AbstractWriter.java:316)
[info]   at com.extrategic.util.csv.univocity.ScalaCsv$.writeCsv(ScalaCsv.scala:52)
[info]   at com.extrategic.util.csv.univocity.ScalaCsv$.toCsvStr(ScalaCsv.scala:68)
[info]   at com.extrategic.util.csv.univocity.ScalaCsvTest$$anonfun$3.apply$mcV$sp(ScalaCsvTest.scala:55)

Introduce support for inputs with different row formats

For example, fixed-width input with a master-detail format:

N#123123 1888858    58888548
111222       3000FOO                               10
333444       2000BAR                              60

Or a CSV with bank accounts of clients, formatted like this:

Client, 1, Foo
Account,  23234, HSBC, 123433-000, HSBCAUS
Account,  11234, HSBC, 222343-130, HSBCCAD
Client, 2, BAR
Account,  1234, CITI, 213343-130, CITICAD

The input should be parsed in a single pass.

Allow fields parsed by BeanListProcessor to be marked as optional

If a field in a Java class is annotated with @Parsed but does not have a corresponding column in the CSV file, BeanListProcessor fails with the exception:

com.univocity.parsers.common.TextParsingException: Error processing input: java.lang.IllegalStateException - Could not find field with name

We have an existing code base where the same Java class is used to read information for multiple similar entity types and this error is causing problems when the user provides CSV files that do not have columns corresponding to all the fields in the Java class.

It will be useful to mark a field as optional so that the parser does not raise an exception if the CSV file does not contain a column corresponding to the field. The easiest option would be to annotate the field as @Parsed(optional = true).

excludeIndexes can exclude too much

Consider the following csv:

column 1,column 2,column 3
first,second,third,fourth
1,2,3,4

If I call settings.excludeIndexes(1); (and settings.setColumnReorderingEnabled(false);), nothing beyond column 3 gets parsed. Peeking through the code, I see it turns the exclusion list into an inclusion list, assuming that there is a first row with column headings. My files, sadly, are not so nicely constructed.

(Nice library, btw. I've found it very easy to integrate into my own project; I'm just hoping to squeeze some more performance out.)

Allow configurable handling of normalization/denormalization of line separators in CSV

While the default handling of line separators works for the general case, in some situations it might be needed to never touch the line endings inside quoted values of CSV data, otherwise we are at risk of changing the original data and mess up the results. Parsing perfectly valid inputs such as the following:


1, Value 1, " something in one line \n something in another line "\r
2, Value 2, " something else in one line \r\n something in another line "\r

is perfectly valid. Line endings (for records) is '\r' and whatever is inside the quoted values should remain as they are.

To hande this, we need to introduce configuration options for both CSV parser and writer, allowing users to switch off the default normalization of line separators inside quoted values.

Add support for text alignment when writing fixed-width files

Allow users to determine the position of the text (right or left) when writing columns of a fixed-width output.

Default Boolean parsing should be case insensitive

It took me an hour to figure this out, so I thought it might be a good idea to make the default Boolean parser to be case insensitive.

At first, I tried to create my own case-insensitive MyBooleanConversion, but apparently the library would fail because it's trying to apply its own BooleanConversion to the output of MyBooleanConversion, which is no longer a String:

Caused by: java.lang.IllegalStateException: Error converting value 'true' using conversion com.univocity.parsers.conversions.BooleanConversion
        at com.univocity.parsers.common.fields.FieldConversionMapping.applyConversions(FieldConversionMapping.java:153)
        at com.univocity.parsers.common.processor.ConversionProcessor.applyConversions(ConversionProcessor.java:129)
        at com.univocity.parsers.common.processor.BeanConversionProcessor.createBean(BeanConversionProcessor.java:283)
        at com.univocity.parsers.common.processor.BeanProcessor.rowProcessed(BeanProcessor.java:51)
        at com.univocity.parsers.common.AbstractParser.parse(AbstractParser.java:89)
        at CsvUtils$.fromCsv(CsvUtils.scala:24)
        at UnivocityTest$.delayedEndpoint$UnivocityTest$1(UnivocityTest.scala:5)
        at UnivocityTest$delayedInit$body.apply(UnivocityTest.scala:3)
        at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
        at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
        at scala.App$class.main(App.scala:76)
        at UnivocityTest$.main(UnivocityTest.scala:3)
        at UnivocityTest.main(UnivocityTest.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:483)
Caused by: java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.lang.String
        at com.univocity.parsers.conversions.ObjectConversion.execute(ObjectConversion.java:29)
        at com.univocity.parsers.common.fields.FieldConversionMapping.applyConversions(FieldConversionMapping.java:151)
        at com.univocity.parsers.common.processor.ConversionProcessor.applyConversions(ConversionProcessor.java:129)
        at com.univocity.parsers.common.processor.BeanConversionProcessor.createBean(BeanConversionProcessor.java:283)

I finally figured out that there's couple of options: use @lowercase or enumerate them using @booleanstring or BooleanConversion, etc. But I think it's more intuitive if the library can handle it without any annotations except for @parsed. IIRC, this is how Jackson behaves.

convertAll is not working

I have the following code to convert "T" to 1.0 and "F" to 0.0. However, convertAll method does not convert the values but convertFields does.

            CsvParserSettings parserSettings = new CsvParserSettings();
            parserSettings.setLineSeparatorDetectionEnabled(true);
            parserSettings.setHeaderExtractionEnabled(true);

            ObjectColumnProcessor objectColumnProcessor = new ObjectColumnProcessor();
            BooleanStringDoubleConversion conversion = new BooleanStringDoubleConversion();
            conversion.setTrueValues(new String[]{"T"});
            conversion.setFalseValues(new String[]{"F"});
//            objectColumnProcessor.convertFields(conversion).set("G56898");
            objectColumnProcessor.convertAll(conversion);

            parserSettings.setRowProcessor(objectColumnProcessor);

            CsvParser parser = new CsvParser(parserSettings);
            parser.parse(new FileReader("test.csv"));
            Map<String, List<Object>> columnValues = objectColumnProcessor.getColumnValuesAsMapOfNames();
            System.out.println(columnValues.keySet());
            for (Map.Entry<String, List<Object>> entry : columnValues.entrySet()) {
                String key = entry.getKey();
                List<Object> value = entry.getValue();
                System.out.println(key + "\t" + value);
            }

CsvWriter should escape the quote escape if it appears before a quote

Assuming:

Value to be written: A\"
Quote escape: \
Quote: "

With escape of quote escape character: \
- Expected output: [A\\\"]
Without escape of quote escape character:
- Try to write without quoting the value (provide another configuration option to allow this). If quotes are required then throw an error. We could write "A\"", but then the parsed value will be A" (lost the \).
If writing values as-is (#20)
- Expected output: "A\""

Concise documentation please

Hi!
Nice library you have here.
Been going through the intro articles and sample code, needed a quick solution to a data import problem. I bumped on this line : getReader("/examples/example.csv")
It's all over the examples and no code reference shown to explain exactly what it does.
You know there's only so much time to explore the myriad frameworks and libraries out there, so...I'm just thinking maybe a little clarification on the examples will make it look less confusing.

My 2 cents.
Thanks.

NullPointer processing records without headers

This code throws an exception because the RowWriterProcessor expects a list of headers.

        CsvWriterSettings writerSettings = new CsvWriterSettings();

        ObjectRowWriterProcessor writerProcessor = new ObjectRowWriterProcessor();
        writerProcessor.convertAll(Conversions.toBoolean("T", "F")); // will write "T" and "F" instead of "true" and "false"

        writerSettings.setRowWriterProcessor(writerProcessor);

        CsvWriter writer = new CsvWriter(writerSettings);
        String line1 = writer.processRecordToString(true, false, false, true);

This should work fine instead and produce the following output:

        T,F,F,T
        F,F,T,T

When parsing, allow creation of multiple types of Java beans

Based on (BeanConversionProcessor)[https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/processor/BeanConversionProcessor.java], allow the user to define multiple types of java beans and provide a selection criteria to determine which instance to create based on the parsed rows.

Extend AbstractWriter to support maps

It's not uncommon to have data to be written stored in maps. Add support for writing key-pair maps (each key -> one value) and for column lists (each key -> collection of values).

Allow mapping each key to be mapped to a header, as maps can have all sorts of keys and not only Strings.

Error parsing a TSV file with error in the length of the parsed input

This is the following error I get in 1.3.2. How can I change parser settings?
com.univocity.parsers.common.TextParsingException: Error processing input: Length of parsed input (4097) exceeds the maximum number of characters defined in your parser settings (4096). , line=1, char=4675. Content parsed: ["{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees in its crystal ball The predictions are part of an annual tradition for the Armonk New York based company which surveys its 3 000 researchers to find five ideas expected to take root in the next five years IBM the world s largest provider of computer services looks to Silicon Valley for input gleaning many ideas from its Almaden research center in San Jose California Holographic conversations projected from mobile phones lead this year s list The predictions also include air breathing batteries computer programs that can tell when and where traffic jams will take place environmental information generated by sensors in cars and phones and cities powered by the heat thrown off by computer servers These are all stretch goals and that s good said Paul Saffo managing director of foresight at the investment advisory firm Discern in San Francisco In an era when pessimism is the new black a little dose of technological optimism is not a bad thing For IBM it s not just idle speculation The company is one of the few big corporations investing in long range research projects and it counts on innovation to fuel growth Saffo said Not all of its predictions pan out though IBM was overly optimistic about the spread of speech technology for instance When the ideas do lead to products they can have broad implications for society as well as IBM s bottom line he said Research Spending They have continued to do research when all the other grand research organizations are gone said Saffo who is also a consulting associate professor at Stanford University IBM invested 5 8 billion in research and development last year 6 1 percent of revenue While that s down from about 10 percent in the early 1990s the company spends a bigger share on research than its computing rivals Hewlett Packard Co the top maker of personal computers spent 2 4 percent last year At Almaden scientists work on projects that don t always fit in with IBM s computer business The lab s research includes efforts to develop an electric car battery that runs 500 miles on one charge a filtration system for desalination and a program that shows changes in geographic data IBM rose 9 cents to 146 04 at 11 02 a m in New York Stock Exchange composite trading The stock had gained 11 percent this year before today Citizen Science The list is meant to give a window into the company s innovation engine said Josephine Cheng a vice president at IBM s Almaden lab All this demonstrates a real culture of innovation at IBM and willingness to devote itself to solving some of the world s biggest problems she said Many of the predictions are based on projects that IBM has in the works One of this year s ideas that sensors in cars wallets and personal devices will give scientists better data about the environment is an expansion of the company s citizen science initiative Earlier this year IBM teamed up with the California State Water Resources Control Board and the City of San Jose Environmental Services to help gather information about waterways Researchers from Almaden created an application that lets smartphone users snap photos of streams and creeks and report back on conditions The hope is that these casual observations will help local and state officials who don t have the resources to do the work themselves Traffic Predictors IBM also sees data helping shorten commutes in the next five years Computer programs will use algorithms and real time traffic infor]
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:260)

Escape \ followed by \n or \r to escape newlines in TSV

Some TSV inputs might come escaped with '\' and '\n' instead of '\' and 'n' to represent line separators.

Update the TSV parser to handle this case via configuration.

Question on using univocity parsers within spark

Hi,
This is just a quick question for spark RDDs? Currently we are able to process a csv file with opencsv in the following manner. Can we do the same thing with CsvParser / TsvParser?

var rdd: RDD[String] = sparkcontext.textFile(filepath)
.........
.........
rdd.mapPartitions(lines => {
val parser = new CSVParser(rowSeparator)
lines.map(line => {
val row = parser.parseLine(line).map(o => Option(o).map(_.trim).getOrElse("")
...........
row
})
})

Support auto detect of column seperator

Please support auto detect of columns separator.

The result will also return the detected separator type

Thanks!

CsvWriter doesn't escape quoted value

Problem

The CsvWriter doesn't quote a column which contains a quote character ".
A single quote as value should be imho coded as """"

Details

Please see the code snippet below. There are 3 columns on the input:

Col1	Col 2	Col 3
Quote	"	Value with quote "

The output line looks like:

Quote,",Value with quote "

Which is then parsed to:

Col1	Col 2
Quote	,Value with quote

Code snippet

        Object[] data = new Object[] {"Quote", "\"", "Value with quote\""};

        File file = File.createTempFile("Test", "csv");
        CsvWriter writer = new CsvWriter(new FileWriter(file), new CsvWriterSettings());
        writer.writeRow(data);
        writer.close();

        List<String> lines = Files.readAllLines(file.toPath(), Charset.defaultCharset());
        for(String line : lines) {
            System.out.println("Output line:" + line);
        }

        System.out.println("Parsed data:");
        CsvParser reader = new CsvParser(new CsvParserSettings());
        List<String[]> parsedData = reader.parseAll(new FileReader(file));
        for(String[] line : parsedData) {
            for (int i = 0; i < line.length; i++) {
                System.out.println("Col" + i + ":" + line[i]);
            }
        }

The snippet is quick&dirty but serves the purpose ;-)

Assign default conversions to input types when writing rows of Object

Update the ObjectRowWriterProcessor and let the user assign a sequence of Conversion to a Class<?>. When writing Objects with the processRecord methods, if there is no conversion sequence applied a the specific column, the conversion associated to a type will be applied.

We'd also need to allow the user to prevent this default conversion by type to be applied to a specific set of columns.

When parsing using `parseNext()`, the `RowProcessor` is not called

If you want to control when you want to read a line and thus use the parseNext() method, the RowProcessoris not called

CSV cannot handle Tabluator '\t' as delimiter when using quote

The CsvParser cannot parse this file correc (tabulator as delimiter):

id title price
"1" "hallo" "3,56"

Result is one single value:
1" "hallo" "3,56

We already wrote in fix for it in version 1.5.5 by changing line 134 and 144 in CsvParser.java and testing also if the char is not the delimiter:

if (ch != newLine && ch <= ' ' && ch != delimiter) {
whitespaceAppender.reset();
do {
//saves whitespaces after value
whitespaceAppender.append(ch);
ch = input.nextChar();
//found a new line, go to next record.
if (ch == newLine) {
return;
}
} while (ch <= ' ' && ch != delimiter);

Process Bean iteratively

How can I process CSV file as big as 5GB (I'm processing NPI data http://download.cms.gov/nppes/NPI_Files.html) with BeanProcessor. You have an example that loads all data into a memory. Is it possible to process it iteratively without loading the whole file into a memory first?

CSV Parsing Optional Values

I am using this tool for parsing an uploaded csv file. I am using the BeanListProcessor to get the Java beans. I am mapping the values in the csv file with the bean fields by indexes. If the user does not specify a value that is optional (which is/are in the end), I get a TextParsingException and ArrayIndexOutOfBoundsException. Is there a solution for this that I am not aware of for this?

Add writeToString methods to return formatted Strings

Allow users to generate formatted Strings (instead of writing to a java.io.Writer) by providing a range of special writeToString() methods.

Long values does not convert

If I change the data type to Integer, it would convert. It looks like you missed it in AnnotationHelper. I tried using the @convert annotation, but it didn't work because it's looking for a construction with a String... arguments, but LongConversion has none.

Add support for inherited fields

I have a class hierarchy such as the following:

abstract class Named {
  @Parsed(field = "Code")
  private String code;
  @Parsed(field = "Name")
  private String name;
}

class Department extends Named {}

class Employee extends Named {}

When I use BeanListProcessor with a CsvParser, the parser fails to read both the fields from the parent class (Named). I can see that this is because of the line Field[] declared = clazz.getDeclaredFields(); in the initialize method of the BeanConversionProcessor class. Due to this line, the CSV parsing code only considers fields declared immediately inside the class being parsed.

I am not sure if this is intended. I certainly feel this behaviour is incorrect as we cannot use class inheritance if inherited fields cannot be read by the parser.

Is there a way to read inherited fields? If not, can this be added?

univocity / univocity-parsers Goto Github PK

univocity-parsers's People

Contributors

Stargazers

Watchers

Forkers

univocity-parsers's Issues

Problem

Details

Code snippet

Recommend Projects

Recommend Topics

Recommend Org