Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16808

[C++] count_distinct aggregates incorrectly across row groups

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Duplicate
    • None
    • 9.0.0
    • None
    • None

    Description

      When reading from parquet files with multiple row groups, count_distinct (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:

      library(dplyr, warn.conflicts = FALSE)
      
      path <- tempfile(fileext = '.parquet')
      arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)
      
      ds <- arrow::open_dataset(path)
      
      ds %>% count(sex) %>% collect()
      #> # A tibble: 5 × 2
      #>   sex                n
      #>   <chr>          <int>
      #> 1 male              60
      #> 2 none               6
      #> 3 female            16
      #> 4 hermaphroditic     1
      #> 5 <NA>               4
      
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    19
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    17
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    17
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    16
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    16
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    17
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    17
      
      # correct
      ds %>% collect() %>% summarise(n = n_distinct(sex))
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1     5
      

      If the file is stored as a single row group, results are correct. When grouped, results are correct.

      I can reproduce this in Python as well using the same file and pyarrow.compute.count_distinct:

      import pyarrow as pa
      import pyarrow.parquet as pq
      
      pa.__version__
      #> 8.0.0
      
      starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')
      
      print(pa.compute.count_distinct(starwars.column('sex')).as_py())
      #> 15
      print(pa.compute.unique(starwars.column('sex')))
      #> [
      #>   "male",
      #>   "none",
      #>   "female",
      #>   "hermaphroditic",
      #>    null
      #> ]
      

      This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files.

      Attachments

        Activity

          People

            Unassigned Unassigned
            alistaire Edward Visel
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: