[ARROW-5747] [C++] Better column name and header support in CSV reader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.13.0
Fix Version/s: 0.15.0
Component/s: C++
Labels:
- csv
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/22172

Description

While working on ~~ARROW-5500~~, I found a number of issues around the CSV parse options header_rows:

If header_rows is 0, the reader errors
It's not possible to supply your own column names, as this TODO notes. ~~ARROW-4912~~ allows renaming columns after reading in, which maybe is enough as long as header_rows == 0 doesn't error, but then you can't naturally specify column types in the convert options because that takes a map of column name to type.
If header_rows is > 1, every cell gets turned into a column name, so if header_rows == 2, you get twice the number of column names as columns. This doesn't error, but it leads to unexpected results.

IMO a better interface would be to have a skip_rows argument to let you ignore a large header, and a column_names argument that, if provided, gives the column names. If not provided, the first row after skip_rows is taken to be the column names. If it were also possible for column_names to take a false or null argument, then we could support the case of autogenerating names when none are provided and there's no header row. Alternatively, we could use a boolean header argument to govern whether the first (non-skipped) row should be interpreted as column names. (For reference, R's readr takes TRUE/FALSE/array of strings in one arg; the base read.csv uses separate args for header and col.names. Both have a skip argument.)

I don't think there's value in trying to be clever about multirow headers and converting those to column names; if there's meaningful information in a tall header, let the user parse it themselves.

Attachments

Issue Links

links to

GitHub Pull Request #4898

Activity

People

Assignee:: Antoine Pitrou

Reporter:: Neal Richardson

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Jun/19 21:02

Updated:: 11/Jan/23 07:42

Resolved:: 24/Jul/19 12:26

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 40m