Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5747

[C++] Better column name and header support in CSV reader

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.13.0
    • Fix Version/s: 0.15.0
    • Component/s: C++

      Description

      While working on ARROW-5500, I found a number of issues around the CSV parse options header_rows:

      • If header_rows is 0, the reader errors
      • It's not possible to supply your own column names, as this TODO notes. ARROW-4912 allows renaming columns after reading in, which maybe is enough as long as header_rows == 0 doesn't error, but then you can't naturally specify column types in the convert options because that takes a map of column name to type.
      • If header_rows is > 1, every cell gets turned into a column name, so if header_rows == 2, you get twice the number of column names as columns. This doesn't error, but it leads to unexpected results.

      IMO a better interface would be to have a skip_rows argument to let you ignore a large header, and a column_names argument that, if provided, gives the column names. If not provided, the first row after skip_rows is taken to be the column names. If it were also possible for column_names to take a false or null argument, then we could support the case of autogenerating names when none are provided and there's no header row. Alternatively, we could use a boolean header argument to govern whether the first (non-skipped) row should be interpreted as column names. (For reference, R's readr takes TRUE/FALSE/array of strings in one arg; the base read.csv uses separate args for header and col.names. Both have a skip argument.)

      I don't think there's value in trying to be clever about multirow headers and converting those to column names; if there's meaningful information in a tall header, let the user parse it themselves.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                pitrou Antoine Pitrou
                Reporter:
                npr Neal Richardson
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m