Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5747

[C++] Better column name and header support in CSV reader

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.13.0
    • 0.15.0
    • C++

    Description

      While working on ARROW-5500, I found a number of issues around the CSV parse options header_rows:

      • If header_rows is 0, the reader errors
      • It's not possible to supply your own column names, as this TODO notes. ARROW-4912 allows renaming columns after reading in, which maybe is enough as long as header_rows == 0 doesn't error, but then you can't naturally specify column types in the convert options because that takes a map of column name to type.
      • If header_rows is > 1, every cell gets turned into a column name, so if header_rows == 2, you get twice the number of column names as columns. This doesn't error, but it leads to unexpected results.

      IMO a better interface would be to have a skip_rows argument to let you ignore a large header, and a column_names argument that, if provided, gives the column names. If not provided, the first row after skip_rows is taken to be the column names. If it were also possible for column_names to take a false or null argument, then we could support the case of autogenerating names when none are provided and there's no header row. Alternatively, we could use a boolean header argument to govern whether the first (non-skipped) row should be interpreted as column names. (For reference, R's readr takes TRUE/FALSE/array of strings in one arg; the base read.csv uses separate args for header and col.names. Both have a skip argument.)

      I don't think there's value in trying to be clever about multirow headers and converting those to column names; if there's meaningful information in a tall header, let the user parse it themselves.

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              npr Neal Richardson
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m