Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12693

[R] add unique() methods for ArrowTabular, datasets

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 10.0.0
    • R

    Description

      I am trying to see if I can leverage `unique` on a Dataset object. Imagining a much big dataset, I am trying to get away from this expensive pattern:

      Dataset %>%
        pull(col) %>%
        unique()

      However when I try the option below it is not working quite how I'd expect. I'm actually not able to get any working (e.g. `arrow_mean`) so maybe I am misunderstanding how these are meant to work. 

      library(arrow, warn.conflicts = FALSE)
      library(dplyr, warn.conflicts = FALSE)
      dir.create("iris")
      iris %>%
       group_by(Species) %>%
       write_dataset("iris")
      ds <- open_dataset("iris")
      ds %>%
       mutate(unique = arrow_unique(Species)) %>%
       collect()
      #> Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression unique("setosa")
      ds %>%
       mutate(unique = arrow_unique(Petal.Width)) %>%
       collect()
      #> Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {Sepal.Length=Sepal.Length, Sepal.Width=Sepal.Width, Petal.Length=Petal.Length, Petal.Width=Petal.Width, Species="setosa", unique=unique(Petal.Width)}
      
      call_function("unique", ds, "Species")
      #> Error: Argument 1 is of class FileSystemDataset but it must be one of "Array", "ChunkedArray", "RecordBatch", "Table", or "Scalar"
      call_function("unique", ds, "Petal.Width")
      #> Error: Argument 1 is of class FileSystemDataset but it must be one of "Array", "ChunkedArray", "RecordBatch", "Table", or "Scalar"
      
      call_function("mean", ds, "Petal.Width")
      #> Error: Argument 1 is of class FileSystemDataset but it must be one of "Array", "ChunkedArray", "RecordBatch", "Table", or "Scalar"
      
      sessioninfo::session_info()
      #> - Session info ---------------------------------------------------------------
      #> setting value 
      #> version R version 4.0.5 (2021-03-31)
      #> os Windows 10 x64 
      #> system x86_64, mingw32 
      #> ui RTerm 
      #> language (EN) 
      #> collate English_Canada.1252 
      #> ctype English_Canada.1252 
      #> tz America/Los_Angeles 
      #> date 2021-05-07 
      #> 
      #> - Packages -------------------------------------------------------------------
      #> package * version date lib source 
      #> arrow * 4.0.0 2021-04-27 [1] CRAN (R 4.0.5)
      #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
      #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.3)
      #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.0.2)
      #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.0.2)
      #> cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.5)
      #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3)
      #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3)
      #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
      #> dplyr * 1.0.5 2021-03-05 [1] CRAN (R 4.0.5)
      #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.5)
      #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
      #> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.3)
      #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
      #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3)
      #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
      #> highr 0.9 2021-04-16 [1] CRAN (R 4.0.4)
      #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
      #> knitr 1.33 2021-04-24 [1] CRAN (R 4.0.5)
      #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4)
      #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
      #> pillar 1.6.0 2021-04-13 [1] CRAN (R 4.0.5)
      #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
      #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
      #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.0.5)
      #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.0.2)
      #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.0.2)
      #> R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.0.2)
      #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
      #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.5)
      #> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3)
      #> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.4)
      #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
      #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
      #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
      #> styler 1.4.1 2021-03-30 [1] CRAN (R 4.0.4)
      #> tibble 3.1.1 2021-04-18 [1] CRAN (R 4.1.0)
      #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.0.5)
      #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.5)
      #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.5)
      #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.4)
      #> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.4)
      #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
      #> 
      #> [1] C:/Users/salbers/R/win-library/4.0
      #> [2] C:/Program Files/R/R-4.0.5/library
      
      

      I am opening this a) because others may have run into the same issue and b) just in case this is actually a bug. Feel free to close immediately if this isn't the way these are supposed to work.

      Attachments

        Issue Links

          Activity

            People

              boshek Sam Albers
              boshek Sam Albers
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3.5h
                  3.5h