[ARROW-8118] [R] dim method for FileSystemDataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.17.0
Component/s: R
Labels:
- features
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/24327

Description

I been using this function enough that I wonder if a) would be useful in the package and b) whether this is something you think is worth working on. The basic problem is that if you have a hierarchical file structure that accommodates using open_dataset, it is definitely useful to know the amount of data you are dealing with. My idea is that 'FileSystemDataset' would have dim, nrow and ncol methods. Here is how I've been using it:

library(arrow)
x <- open_dataset("data/rivers-data/", partitioning = c("prov", "month"))
dim_arrow <- function(x) {
 rows <- sum(purrr::map_dbl(x$files, ~ParquetFileReader$create(.x)$ReadTable()$num_rows))
 cols <- x$schema$num_fields
 
 c(rows, cols)
}
dim_arrow(x)
#> [1] 426929 7

Ideally this would work on arrow_dplyr_query objects as well but I haven't quite figured out how that filters based on the partitioning variables.

Attachments

Issue Links

links to

GitHub Pull Request #6635

Activity

People

Assignee:: Sam Albers

Reporter:: Sam Albers

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 13/Mar/20 22:20

Updated:: 11/Jan/23 07:58

Resolved:: 19/Mar/20 19:09

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

6h 50m