Summary/recap: can't use the csv module parser because it doesn't support custom lineterminator (#1), can't get an indication of how many columns are in the row because the pandas parser coerces absent field values to nan
or the empty string (this issue)
Using the pandas parser means absent fields will be coerced to nan
rather than None
(as is done by the csv module parser). Note that nan
values cannot be compared for equality (meaning the df.to_dict()
can't be checked for equality like it can with None
).
Technically, the nan
values will be conditional on the pd.read_csv
params relating to NaN: some of which are not due to absent fields:
na_values scalar, str, list-like, or dict, optional
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
keep_default_na bool, default True
Whether or not to include the default NaN values when parsing the data. Depending on whether na_values is passed in, the behavior is as follows:
-
If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.
-
If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.
-
If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.
-
If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.
Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be ignored.
You don't want to raise an error regarding CSV validation if the CSV contains these strings: by default keep_default_na=True
so strings like "nan"
and "NaN"
could be in a row and be identical...
The most fruitful setting I found was to specify the parser as Python, which appears to use None (like the csv module) rather than NaN, at least some of the time. I suspect this behaviour can be made to happen all of the time by forcing all the columns to be string type.
>>> kwargs = {"engine": "python", "na_filter": False, "na_values": [], "keep_default_na": False}
>>> pd.read_csv(io.StringIO("a,b\nc\n1,2\n"), **kwargs)
a b
0 c NaN
1 1 2.0
>>> pd.read_csv(io.StringIO("a,b\nc,\n1,2\n"), **kwargs)
a b
0 c
1 1 2
>>> pd.read_csv(io.StringIO("a,b\nc,\nd,e\n"), **kwargs)
a b
0 c
1 d e
>>> pd.read_csv(io.StringIO("a,b\nc\nd,e\n"), **kwargs)
a b
0 c None
1 d e
- The parsing could be deliberately simplified so as to avoid conversions happening whatsoever, so everything would be a string
- The converters can be specified either by column index [i.e. the number of the column in the list of columns] or name, in a dict. So to convert everything to a string [overriding implicit dtype conversion] you simply pass
converts
as dict.fromkeys(range(n_columns), str)
>>> kwargs = {"engine": "python", "na_filter": False, "na_values": [], "keep_default_na": False}
>>> kwargs["converters"] = dict.fromkeys(range(2), str)
>>> pd.read_csv(io.StringIO("a,b\nc\n1,2\n"), **kwargs)
a b
0 c None
1 1 2
>>> pd.read_csv(io.StringIO("a,b\nc,\n1,2\n"), **kwargs)
a b
0 c
1 1 2
>>> pd.read_csv(io.StringIO("a,b\nc,\nd,e\n"), **kwargs)
a b
0 c
1 d e
>>> pd.read_csv(io.StringIO("a,b\nc\nd,e\n"), **kwargs)
a b
0 c None
1 d e
Now the behaviour is almost correct: on closer inspection the None values get stringified by the converter, so you actually just want to pass it unchanged with a trivial function that returns its input trivial_return
. The 4 tests in pandas_nan_validation_test.py
now pass that demonstrate this.
Originally posted by @lmmx in #1 (comment)