Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
None
Description
This is a regression after 4.0.1 so is not a live-bug in a release version of arrow
When a partition key happens to be ordered, on large (>=1e7 rows), the partitions are not being written faithfully.
If the partition isn't ordered or the dataset is smaller than 1e7 the partitions appear to be correct (though we should check that the values in other rows do still match when we test this).
library(arrow) dir <- "./1M_records" n_row <- 1e6 df <- data.frame(foo = runif(n_row)) df$let <- sort(sample(letters, n_row, replace = TRUE)) write_dataset(df, dir, partitioning = "let") # this should be 26, corresponding to the number of letters (and is) length(list.files(dir)) #> [1] 26 dir <- "./10M_records_not_sorted" n_row <- 1e7 df <- data.frame(foo = runif(n_row)) df$let <- sample(letters, n_row, replace = TRUE) write_dataset(df, dir, partitioning = "let") # this should be 26, corresponding to the number of letters (and is!) length(list.files(dir)) #> [1] 26 dir <- "./10M_records" n_row <- 1e7 df <- data.frame(foo = runif(n_row)) df$let <- sort(sample(letters, n_row, replace = TRUE)) write_dataset(df, dir, partitioning = "let") # this should be 26, corresponding to the number of letters (but is not) length(list.files(dir)) #> [1] 3 # the letters that were retained: list.files(dir) #> [1] "let=a" "let=b" "let=c" # Oddly(?) all of the rows are here, they have just been reshuffled into one of the letters retained nrow(open_dataset(dir)) #> [1] 10000000
Original report for context:
A bit of context: the data for this example contains all the world exports in 1995, it contain 212 countries, but when saving it as parquet, only 66 countries are actually recorded. The verification I included was to check if the USA (one of the best in the reporter quality index) was present in the data.
library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union url <- "https://ams3.digitaloceanspaces.com/uncomtrade/baci_hs92_1995.rds" rds <- "baci_hs92_1995.rds" if (!file.exists(rds)) try(download.file(url, rds)) d <- readRDS("baci_hs92_1995.rds") rds_has_usa <- any(grepl("usa", unique(d$reporter_iso))) rds_has_usa #> [1] TRUE dir <- "parquet/baci_hs92" d %>% group_by(year, reporter_iso) %>% write_dataset(dir, hive_style = F) parquet_has_usa <- any(grepl("usa", list.files(paste0(dir, "/1995")))) parquet_has_usa #> [1] FALSE
Created on 2021-06-24 by the reprex package (https://reprex.tidyverse.org) (v2.0.0)