Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16807

[C++] count_distinct aggregates incorrectly across row groups

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 9.0.0
    • C++

    Description

      When reading from parquet files with multiple row groups, count_distinct (wrapped by n_distinct in R) returns inaccurate and inconsistent results:

      library(dplyr, warn.conflicts = FALSE)
      
      path <- tempfile(fileext = '.parquet')
      arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)
      
      ds <- arrow::open_dataset(path)
      
      ds %>% count(sex) %>% collect()
      #> # A tibble: 5 × 2
      #>   sex                n
      #>   <chr>          <int>
      #> 1 male              60
      #> 2 none               6
      #> 3 female            16
      #> 4 hermaphroditic     1
      #> 5 <NA>               4
      
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    19
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    17
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    17
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    16
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    16
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    17
      ds %>% summarise(n = n_distinct(sex)) %>% collect()
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1    17
      
      # correct
      ds %>% collect() %>% summarise(n = n_distinct(sex))
      #> # A tibble: 1 × 1
      #>       n
      #>   <int>
      #> 1     5
      

      If the file is stored as a single row group, results are correct. When grouped, results are correct.

      I can reproduce this in Python as well using the same file and pyarrow.compute.count_distinct:

      import pyarrow as pa
      import pyarrow.parquet as pq
      
      pa.__version__
      #> 8.0.0
      
      starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')
      
      pa.compute.count_distinct(starwars.column('sex')).as_py()
      #> 15
      pa.compute.unique(starwars.column('sex'))
      #> [
      #>   "male",
      #>   "none",
      #>   "female",
      #>   "hermaphroditic",
      #>    null
      #> ]
      

      This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files.

      Attachments

        Issue Links

          Activity

            People

              octalene Aldrin Montana
              alistaire Edward Visel
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 10m
                  4h 10m