Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13169

[R] [C++] sorted partition keys can cause issues

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 5.0.0
    • C++, R

    Description

      This is a regression after 4.0.1 so is not a live-bug in a release version of arrow

      When a partition key happens to be ordered, on large (>=1e7 rows), the partitions are not being written faithfully.

      If the partition isn't ordered or the dataset is smaller than 1e7 the partitions appear to be correct (though we should check that the values in other rows do still match when we test this).

      library(arrow)
      
      dir <- "./1M_records"
      n_row <- 1e6
      df <- data.frame(foo = runif(n_row))
      df$let <- sort(sample(letters, n_row, replace = TRUE))
      write_dataset(df, dir, partitioning = "let")
      
      # this should be 26, corresponding to the number of letters (and is)
      length(list.files(dir))
      #> [1] 26
      
      
      
      dir <- "./10M_records_not_sorted"
      n_row <- 1e7
      df <- data.frame(foo = runif(n_row))
      df$let <- sample(letters, n_row, replace = TRUE)
      write_dataset(df, dir, partitioning = "let")
      
      # this should be 26, corresponding to the number of letters (and is!)
      length(list.files(dir))
      #> [1] 26
      
      
      dir <- "./10M_records"
      n_row <- 1e7
      df <- data.frame(foo = runif(n_row))
      df$let <- sort(sample(letters, n_row, replace = TRUE))
      write_dataset(df, dir, partitioning = "let")
      
      # this should be 26, corresponding to the number of letters (but is not)
      length(list.files(dir))
      #> [1] 3
      
      # the letters that were retained:
      list.files(dir)
      #> [1] "let=a" "let=b" "let=c"
      
      # Oddly(?) all of the rows are here, they have just been reshuffled into one of the letters retained
      nrow(open_dataset(dir))
      #> [1] 10000000
      

      Original report for context:

      A bit of context: the data for this example contains all the world exports in 1995, it contain 212 countries, but when saving it as parquet, only 66 countries are actually recorded. The verification I included was to check if the USA (one of the best in the reporter quality index) was present in the data.

      library(arrow)
      #> 
      #> Attaching package: 'arrow'
      #> The following object is masked from 'package:utils':
      #> 
      #>     timestamp
      library(dplyr)
      #> 
      #> Attaching package: 'dplyr'
      #> The following objects are masked from 'package:stats':
      #> 
      #>     filter, lag
      #> The following objects are masked from 'package:base':
      #> 
      #>     intersect, setdiff, setequal, union
      
      url <- "https://ams3.digitaloceanspaces.com/uncomtrade/baci_hs92_1995.rds"
      rds <- "baci_hs92_1995.rds"
      
      if (!file.exists(rds)) try(download.file(url, rds))
      
      d <- readRDS("baci_hs92_1995.rds")
      
      rds_has_usa <- any(grepl("usa", unique(d$reporter_iso)))
      rds_has_usa
      #> [1] TRUE
      
      dir <- "parquet/baci_hs92"
      
      d %>% 
        group_by(year, reporter_iso) %>% 
        write_dataset(dir, hive_style = F)
      
      parquet_has_usa <- any(grepl("usa", list.files(paste0(dir, "/1995"))))
      parquet_has_usa
      #> [1] FALSE
      

      Created on 2021-06-24 by the reprex package (https://reprex.tidyverse.org) (v2.0.0)

      Attachments

        1. screenshot-1.png
          33 kB
          Mauricio 'Pachá' Vargas Sepúlveda

        Activity

          People

            michalno Michal Nowakiewicz
            pachamaltese Mauricio 'Pachá' Vargas Sepúlveda
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1.5h
                1.5h