Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
I discovered this while working on https://github.com/apache/arrow/pull/10191. You can project new columns when writing a dataset, but only if they are derived from columns that are included in the output. Here's an R-based example:
# Simple function to write and re-open the new dataset write_then_open <- function(ds, path, ...) { write_dataset(ds, path, ...) open_dataset(path) } tab <- Table$create(a = 1:5) tab %>% write_then_open(ds_dir) %>% collect() # # A tibble: 5 x 1 # a # <int> # 1 1 # 2 2 # 3 3 # 4 4 # 5 5 # If you rename a column, it's all nulls tab %>% select(b = a) %>% write_then_open(ds_dir) %>% collect() # # A tibble: 5 x 1 # b # <int> # 1 NA # 2 NA # 3 NA # 4 NA # 5 NA # If you derive a new column and keep the original, it works tab %>% mutate(b = a) %>% write_then_open(ds_dir) %>% collect() # # A tibble: 5 x 2 # a b # <int> <int> # 1 1 1 # 2 2 2 # 3 3 3 # 4 4 4 # 5 5 5 # transmute() only keeps the added columns, so it also illustrates the failure tab %>% transmute(b = a) %>% write_then_open(ds_dir) %>% collect() # # A tibble: 5 x 1 # b # <int> # 1 NA # 2 NA # 3 NA # 4 NA # 5 NA
Attachments
Issue Links
- links to