While working on
ARROW-5500, I found a number of issues around the CSV parse options header_rows:
- If header_rows is 0, the reader errors
- It's not possible to supply your own column names, as this TODO notes.
ARROW-4912allows renaming columns after reading in, which maybe is enough as long as header_rows == 0 doesn't error, but then you can't naturally specify column types in the convert options because that takes a map of column name to type.
- If header_rows is > 1, every cell gets turned into a column name, so if header_rows == 2, you get twice the number of column names as columns. This doesn't error, but it leads to unexpected results.
IMO a better interface would be to have a skip_rows argument to let you ignore a large header, and a column_names argument that, if provided, gives the column names. If not provided, the first row after skip_rows is taken to be the column names. If it were also possible for column_names to take a false or null argument, then we could support the case of autogenerating names when none are provided and there's no header row. Alternatively, we could use a boolean header argument to govern whether the first (non-skipped) row should be interpreted as column names. (For reference, R's readr takes TRUE/FALSE/array of strings in one arg; the base read.csv uses separate args for header and col.names. Both have a skip argument.)
I don't think there's value in trying to be clever about multirow headers and converting those to column names; if there's meaningful information in a tall header, let the user parse it themselves.