Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.14.1
-
ubuntu xenial, python2.7
Description
Case: Dataset wit 20k columns. Amount of rows can be 0.
pyarrow.csv.read_csv('20k_cols.csv') works rather fine if no convert_options provided.
Took 150ms.
Now I call read_csv() with column types mapping that marks 2000 out of these columns as string.
pyarrow.csv.read_csv('20k_cols.csv', convert_options=pyarrow.csv.ConvertOptions(column_types={'K%d' % i: pyarrow.string() for i in range(2000)}))
(K1..K19999 are column names in attached dataset).
My task globally is to read everything as string, avoid any inferring.
This takes several minutes, consumes around 4GB memory.
This doesn't look sane at all.
Attachments
Attachments
Issue Links
- links to