Git Product home page Git Product logo

Comments (3)

osiegmar avatar osiegmar commented on September 16, 2024

I couldn't find the normative specification of the TPC-H data format. According to the dbgen tool, these are ASCII files containing records that, by default, are separated by a pipe character (|) and terminated by a line-feed character (\n). Several examples are shown in the answers directory.

While this format is not CSV, its similarity should be sufficient for FastCSV to easily read and write such files by configuring the field separator to | (CsvReader.builder().fieldSeparator('|') / CsvWriter.builder().fieldSeparator('|')). If, for any reason, you need to add a field separator at the end of each line when writing such files, simply add one more null field to the record. When reading such files, you could ignore the last field in each record.

You may encounter problems when the data itself contains the field separator |, newline characters, or quotation marks. But the examples I have seen seem not to use these characters.

from fastcsv.

IIvm avatar IIvm commented on September 16, 2024

Thanks for your feedback! But I think there is still some use cases that need record delimiter, like Snowflake and MySQL both support self-defined record delimiters.

ref: https://docs.snowflake.com/en/sql-reference/sql/create-file-format

RECORD_DELIMITER = 'character' | NONE
Use
Data loading, data unloading, and external tables

Definition
One or more singlebyte or multibyte characters that separate records in an input file (data loading) or unloaded file (data unloading). Accepts common escape sequences or the following singlebyte or multibyte characters:

Singlebyte characters
Octal values (prefixed by \\) or hex values (prefixed by 0x or \x). For example, for records delimited by the circumflex accent (^) character, specify the octal (\\136) or hex (0x5e) value.

Multibyte characters
Hex values (prefixed by \x). For example, for records delimited by the cent (ยข) character, specify the hex (\xC2\xA2) value.

The delimiter for RECORD_DELIMITER or FIELD_DELIMITER cannot be a substring of the delimiter for the other file format option (e.g. FIELD_DELIMITER = 'aa' RECORD_DELIMITER = 'aabb').

Is there any way I can implement this feature with FastCSV without the self-defined record delimiter support?

from fastcsv.

osiegmar avatar osiegmar commented on September 16, 2024

Is there any way I can implement this feature with FastCSV without the self-defined record delimiter support?

To make use of custom line/record delimiters with FastCSV, you may create an implementation of java.io.Reader that replaces the record delimiter with the standard line-feed character. Then, pass this customized Reader to the CsvReader. Similarly, achieve the same for the CsvWriter by implementing a custom java.io.Writer that replaces the line-feed character with the record delimiter.

But I think there is still some use cases that need record delimiter, like Snowflake and MySQL both support self-defined record delimiters.

The mere presence of this feature in other implementations does not justify its inclusion in FastCSV. Could you share a concrete use case where this feature would be required in the context of CSV (which is not the case for TPC-H data)? Preferably something with a normative specification.

Currently, I don't see how this feature aligns with the goals of FastCSV.

from fastcsv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.