Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8118

[R] dim method for FileSystemDataset

    XMLWordPrintableJSON

Details

    Description

      I been using this function enough that I wonder if a) would be useful in the package and b) whether this is something you think is worth working on. The basic problem is that if you have a hierarchical file structure that accommodates using open_dataset, it is definitely useful to know the amount of data you are dealing with. My idea is that 'FileSystemDataset' would have dim, nrow and ncol methods. Here is how I've been using it:

      library(arrow)
      x <- open_dataset("data/rivers-data/", partitioning = c("prov", "month"))
      dim_arrow <- function(x) {
       rows <- sum(purrr::map_dbl(x$files, ~ParquetFileReader$create(.x)$ReadTable()$num_rows))
       cols <- x$schema$num_fields
       
       c(rows, cols)
      }
      dim_arrow(x)
      #> [1] 426929 7
      

       

      Ideally this would work on arrow_dplyr_query objects as well but I haven't quite figured out how that filters based on the partitioning variables.

      Attachments

        Issue Links

          Activity

            People

              boshek Sam Albers
              boshek Sam Albers
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 6h 50m
                  6h 50m