Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
Python 3.6.8
PyArrow 0.13.1.dev225+g184b8deb
NumPy 1.16.3
Pandas 0.24.2
Description
Relates to ARROW-5195 and https://github.com/apache/arrow/issues/4184
I was testing the new strings_can_be_null ConvertOption (built from git 184b8deb651c6f6308c0fa2a595f5a40f5da8ce8) in conjunction with the CSV reader and noted that when enabled and an empty string is parsed that it doesn't return NULL despite '' being in the default null_values list (https://github.com/apache/arrow/blob/f7ef65e5fc367f1f5649dfcea0754e413fcca394/cpp/src/arrow/csv/options.cc#L28)
options.null_values = {"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "N/A", "NA", "NULL", "NaN", "n/a", "nan", "null"};
Given that the strings_can_be_null option was added to expose the same NULL processing functionality with respect to strings as pandas.read_csv, I believe that it should also be able to handle empty strings. **
In Pandas:
content = b"a,b\n1,null\n2,\n3,test"
df = pd.read_csv(io.BytesIO(content))
print(df)
a b
0 1 NaN
1 2 NaN
2 3 test
In PyArrow:
convert_options = pc.ConvertOptions(strings_can_be_null=True) table = pc.read_csv(io.BytesIO(content), convert_options=convert_options) print(table.to_pydict()) OrderedDict([('a', [1, 2, 3]), ('b', [None, '', 'test'])])
Attachments
Issue Links
- links to