Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15151

write_dataset() never increments {i} in partitions part-{i}

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 6.0.1
    • None
    • R
    • None
    • Ubuntu 21.04

    Description

      Introducing partitioning in write_dataset() creates sub-folders just fine, but the lowest-level subfolder only ever contains a part-0.parquet.  I don't see how to get write_dataset() to ever generate output with multiple part-filenames in a single directory, like part-0.parquet, part-1.parquet, etc.  e.g. the documentation for open_dataset() implies we should get three `Z` level parts:

      # You can also partition by the values in multiple columns
      # (here: "cyl" and "gear").
      # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
      two_levels_tree <- tempfile()
      write_dataset(mtcars, two_levels_tree, partitioning = c("cyl", "gear"))
      list.files(two_levels_tree, recursive = TRUE)
      
      # In the two previous examples we would have:
      # X = {4,6,8}, the number of cylinders.
      # Y = {3,4,5}, the number of forward gears.
      # Z = {0,1,2}, the number of saved parts, starting from 0. 

      But I only get the expected structure with part-0.parquet files.

       

       

      Context: I frequently need to partition large files that lack any natural grouping variable; I merely want a bunch of small parts of equal size.  It would be great if there was an automatic way of doing this; currently I can hack this by creating a partition column with integers 1...n where n is my desired number of partitions, and partition on that.  I'd then like to write these to a flat structure with part-0.parquet, part-1.parquet etc, not a nested folder structure, if possible. 

      (Or better yet, it would be amazing if write_dataset() just let us set a maximum partition file size and could automate the sharding into parts while preserving the existing behavior for actually semantically meaningful groups.  Maybe that is already the intent but I cannot see how to activate it!)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              cboettig Carl Boettiger
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: