[ARROW-15151] write_dataset() never increments {i} in partitions part-{i} - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 6.0.1
Fix Version/s: None
Component/s: R
Labels:
None
Environment:
Ubuntu 21.04

External issue URL:
https://github.com/apache/arrow/issues/30657

Description

Introducing partitioning in write_dataset() creates sub-folders just fine, but the lowest-level subfolder only ever contains a part-0.parquet. I don't see how to get write_dataset() to ever generate output with multiple part-filenames in a single directory, like part-0.parquet, part-1.parquet, etc. e.g. the documentation for open_dataset() implies we should get three `Z` level parts:

# You can also partition by the values in multiple columns
# (here: "cyl" and "gear").
# This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
two_levels_tree <- tempfile()
write_dataset(mtcars, two_levels_tree, partitioning = c("cyl", "gear"))
list.files(two_levels_tree, recursive = TRUE)

# In the two previous examples we would have:
# X = {4,6,8}, the number of cylinders.
# Y = {3,4,5}, the number of forward gears.
# Z = {0,1,2}, the number of saved parts, starting from 0.

But I only get the expected structure with part-0.parquet files.

Context: I frequently need to partition large files that lack any natural grouping variable; I merely want a bunch of small parts of equal size. It would be great if there was an automatic way of doing this; currently I can hack this by creating a partition column with integers 1...n where n is my desired number of partitions, and partition on that. I'd then like to write these to a flat structure with part-0.parquet, part-1.parquet etc, not a nested folder structure, if possible.

(Or better yet, it would be amazing if write_dataset() just let us set a maximum partition file size and could automate the sharding into parts while preserving the existing behavior for actually semantically meaningful groups. Maybe that is already the intent but I cannot see how to activate it!)

Attachments

Issue Links

is fixed by

ARROW-13703 [Python][R] Add bindings for new dataset writing options

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Carl Boettiger

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 19/Dec/21 18:35

Updated:: 11/Jan/23 08:44

Resolved:: 02/Jul/22 14:14