Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12620

[C++] Dataset writing can only include projected columns if input columns are also included

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 5.0.0
    • C++

    Description

      I discovered this while working on https://github.com/apache/arrow/pull/10191. You can project new columns when writing a dataset, but only if they are derived from columns that are included in the output. Here's an R-based example:

      # Simple function to write and re-open the new dataset
      write_then_open <- function(ds, path, ...) {
        write_dataset(ds, path, ...)
        open_dataset(path)
      }
      
      tab <- Table$create(a = 1:5)
      
      tab %>% 
        write_then_open(ds_dir) %>%
        collect()
      
      # # A tibble: 5 x 1
      #       a
      #   <int>
      # 1     1
      # 2     2
      # 3     3
      # 4     4
      # 5     5
      
      # If you rename a column, it's all nulls
      tab %>%
        select(b = a) %>%
        write_then_open(ds_dir) %>%
        collect()
      
      # # A tibble: 5 x 1
      #       b
      #   <int>
      # 1    NA
      # 2    NA
      # 3    NA
      # 4    NA
      # 5    NA
      
      # If you derive a new column and keep the original, it works
      tab %>%
        mutate(b = a) %>%
        write_then_open(ds_dir) %>%
        collect()
      
      # # A tibble: 5 x 2
      #       a     b
      #   <int> <int>
      # 1     1     1
      # 2     2     2
      # 3     3     3
      # 4     4     4
      # 5     5     5
      
      # transmute() only keeps the added columns, so it also illustrates the failure
      tab %>%
        transmute(b = a) %>%
        write_then_open(ds_dir) %>%
        collect()
      
      # # A tibble: 5 x 1
      #       b
      #   <int>
      # 1    NA
      # 2    NA
      # 3    NA
      # 4    NA
      # 5    NA
      

      Attachments

        Issue Links

          Activity

            People

              lidavidm David Li
              npr Neal Richardson
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m

                  Slack

                    Issue deployment