Git Product home page Git Product logo

csv-validation-sandbox's Introduction

CSV Validation Sandbox

Sandbox to work out and demonstrate a workflow for validation of CSVs which may have multiline fields, to be contributed to fsspec-reference-maker and ultimately to dask.

This repo contains tests (which can be run with pytest tests/):

  • validate_rows_test.py, as for the original, but covering these test cases formally in pytest
    • deprecated/original_validate_rows.py, the original file used to outline the validation edge cases and expected behaviour
  • pandas_nan_validation_test.py builds on the previous test but removes the awkward NaN behaviour so that absent fields can be detected during validation.
  • lineterm_support_test.py, a pytest suite demonstrating that the lineterminator argument to csv.DictReader does nothing while the argument to pandas works as expected.

For implementation purposes, the pandas_nan_validation_test.py module is the 'end result'. It contains a make_df and a validate_df function which are chained together through a validate_str function, which takes sample_colnames as provided column names from a sample DataFrame. (N.B.: in fact only the number of these columns is necessary)

Some of the tests use an editable installation of fsspec-reference-maker:

  • fsspec-offsets-calc.py has cases showing the expected output for various block sizes corresponding to the row size of a CSV, with commented out lines in the expected value (and code comments) showing the 'working' so to speak.

csv-validation-sandbox's People

Contributors

lmmx avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

frostpuppet

csv-validation-sandbox's Issues

Python csv module does not support custom lineterminator

Pandas uses a C engine (rather than Python) to parse CSVs with a custom lineterminator, whereas (as shown by lineterm_support_test.py) Python's csv.reader (inherited by csv.DictReader) doesn't: changing the lineterminator won't do anything, and parsing will fail.

@mark.parametrize("sample_str,sep,linesep,expected", [(
"a b c~1 2 3~4 5 6~7 8 9",
" ",
"~",
[],
)])
def test_no_lineterm_support(sample_str, sep, linesep, expected):
"""
Show that the csv module can't read a CSV with non-standard lineterminator (a tilde)
because the enumerated rows of the reader is an empty list, i.e. nothing is read.
"""
r = csv.DictReader(io.StringIO(sample_str), delimiter=sep, lineterminator=linesep)
assert list(r) == expected

It's documented:

Note
The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future.

This either:

  • prevents use of the csv module in a solution
  • limits the use of the solution to those CSVs which do not use a custom line separator.

The latter does not feel satisfactory.

I'd need to review my notes to recall why I wanted to use the csv module over pandas (I don't think pandas gave the "leftover lineterminator" that indicated an 'unclosed' field value)


Summary: must use pd.read_csv, despite it not indicating [by a lineterminator at end of value] if a field is "open" in the way the csv module does
New problem: how can pd.read_csv detect "open" field value ?

  • If new problem can't be solved: unsatisfying 'criteria' needed for solution (only CSV with non-\n\r lineterminator)

Decide how to handle block size cases as offsets and how to key

While not all are expected, the algorithm must handle the cases of partition size not being 1-to-1 with the evenly spaced offsets they 'came from':

  • block size is smaller than 1 row length
  • ... larger than multiple rows' sizze

The trivial 1-to-1 case is unlikely in the real world

Again I'm finding myself doing manual edits to files when it should be structured as a test suite and I should determine the desired inputs and outputs for each of these cases.

My initial approach was to just copy the regular format for the HDF5 example, where the 'chunk key' was just the evenly spaced offset. I presumed this would facilitate making it easy to look up what a given offset 'before' became 'after' the "seek to next row terminating newline" process.

In fact I now suspect that this is not viable due to this non-1-to-1 aspect of the problem, which would only be solved by having empty values for some of the evenly spaced keys or duplicate values.

As an example consider 3 evenly spaced ("unadjusted") offsets:0, 5, 10:

  • The first offset will always be 0, never adjusted, as files are always valid from their start.
  • The size of the first offset may exceed the second unadjusted offset, e.g. if the first row is length 8 (the inclusive range [0,7]).
  • The second 'adjusted' offset would be sought as the next newline from (i.e. after) position 8

pandas parser coerces absent fields to `nan` rather than `None`

Summary/recap: can't use the csv module parser because it doesn't support custom lineterminator (#1), can't get an indication of how many columns are in the row because the pandas parser coerces absent field values to nan or the empty string (this issue)


Using the pandas parser means absent fields will be coerced to nan rather than None (as is done by the csv module parser). Note that nan values cannot be compared for equality (meaning the df.to_dict() can't be checked for equality like it can with None).

Technically, the nan values will be conditional on the pd.read_csv params relating to NaN: some of which are not due to absent fields:

na_values scalar, str, list-like, or dict, optional

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

keep_default_na bool, default True

Whether or not to include the default NaN values when parsing the data. Depending on whether na_values is passed in, the behavior is as follows:

  • If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.

  • If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.

  • If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.

  • If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be ignored.

You don't want to raise an error regarding CSV validation if the CSV contains these strings: by default keep_default_na=True so strings like "nan" and "NaN" could be in a row and be identical...

The most fruitful setting I found was to specify the parser as Python, which appears to use None (like the csv module) rather than NaN, at least some of the time. I suspect this behaviour can be made to happen all of the time by forcing all the columns to be string type.

>>> kwargs = {"engine": "python", "na_filter": False, "na_values": [], "keep_default_na": False}
>>> pd.read_csv(io.StringIO("a,b\nc\n1,2\n"), **kwargs)
      a    b
0    c   NaN
1     1  2.0
>>> pd.read_csv(io.StringIO("a,b\nc,\n1,2\n"), **kwargs)
      a  b
0    c   
1     1  2
>>> pd.read_csv(io.StringIO("a,b\nc,\nd,e\n"), **kwargs)
      a  b
0    c   
1    d  e
>>> pd.read_csv(io.StringIO("a,b\nc\nd,e\n"), **kwargs)
      a     b
0    c     None
1    d     e
  • The parsing could be deliberately simplified so as to avoid conversions happening whatsoever, so everything would be a string
  • The converters can be specified either by column index [i.e. the number of the column in the list of columns] or name, in a dict. So to convert everything to a string [overriding implicit dtype conversion] you simply pass converts as dict.fromkeys(range(n_columns), str)
>>> kwargs = {"engine": "python", "na_filter": False, "na_values": [], "keep_default_na": False}
>>> kwargs["converters"] = dict.fromkeys(range(2), str)
>>> pd.read_csv(io.StringIO("a,b\nc\n1,2\n"), **kwargs)
   a     b
0  c  None
1  1     2
>>> pd.read_csv(io.StringIO("a,b\nc,\n1,2\n"), **kwargs)
   a  b
0  c   
1  1  2
>>> pd.read_csv(io.StringIO("a,b\nc,\nd,e\n"), **kwargs)
   a  b
0  c   
1  d  e
>>> pd.read_csv(io.StringIO("a,b\nc\nd,e\n"), **kwargs)
   a     b
0  c  None
1  d     e

Now the behaviour is almost correct: on closer inspection the None values get stringified by the converter, so you actually just want to pass it unchanged with a trivial function that returns its input trivial_return. The 4 tests in pandas_nan_validation_test.py now pass that demonstrate this.

Originally posted by @lmmx in #1 (comment)

Clarify handling of pre-offset non-row-terminating lineseparator

As well as the row-term/non-row term. linesep distinction (let's call them RT and non-RT) I want to distinguish:

  • post-NRT: non-RT after the offset (the 'target' of the _rowterminatorat(offset: int) procedure)
  • pre-NRT: non-RT before the offset, within the sample 'counted back' from the post-NRT (default 10: sample_tail_rows=10)

The stated goal of the _rowterminatorat procedure is to count back sample_tail_rows=10 rows.

At present, all it does is count back sample_tail_rows=10 lineseparators.

It is obvious that counting back line separators is not counting back rows. If the 10th leftmost lineseparator is non-RT then there will be a parser error thrown. This is what I called "pre-NRT".

This is being treated identically to a parser error being thrown from the "post-NRT", which is supposed to lead to said post-NRT offset position being skipped in a further round of searching forward past the offset to seek a post-offset RT line separator.

I want to instead treat this case of a leftmost pre-NRT line separator as reason to continue searching leftward (i.e. "count back further") as it indicates that the row is incomplete. Only by completing the row can we establish if the desired number of rows to count back (sample_tail_rows=10) have been counted.

I suspect that what may be required is to 'achieve' 10 rows as the 1st step:

  • at the minimum or "in the best/simplest case" involving 10 pre-offset lineseps,
  • in the extreme it could begin with 10 lineseps as a multiline field within a single row,

For anything other than the simplest case you'd need to keep advancing backwards to find more lineseps.

As the 2nd step you'd need to then begin (but not complete) an 11th row, and only then be satisfied that you had achieved 10 rows.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.