[ARROW-16808] [C++] count_distinct aggregates incorrectly across row groups - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: 9.0.0
Component/s: None
Labels:
None
Environment:

Hide
> arrow::arrow_info()
Arrow package version: 8.0.0.9000

Capabilities:

dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc FALSE

Memory:

Allocator jemalloc
Current 37.25 Kb
Max 925.42 Kb

Runtime:

SIMD Level none
Detected SIMD Level none

Build:

C++ Library Version 9.0.0-SNAPSHOT
C++ Compiler AppleClang
C++ Compiler Version 13.1.6.13160021
Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9

Show
> arrow::arrow_info() Arrow package version: 8.0.0.9000 Capabilities:                 dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc TRUE mimalloc FALSE Memory:                     Allocator jemalloc Current 37.25 Kb Max 925.42 Kb Runtime:                          SIMD Level none Detected SIMD Level none Build:                                                               C++ Library Version 9.0.0-SNAPSHOT C++ Compiler AppleClang C++ Compiler Version 13.1.6.13160021 Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9

External issue URL:
https://github.com/apache/arrow/issues/32139
Language:
- C++

Description

When reading from parquet files with multiple row groups, count_distinct (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:

library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>   <chr>          <int>
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5 <NA>               4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1     5

If the file is stored as a single row group, results are correct. When grouped, results are correct.

I can reproduce this in Python as well using the same file and pyarrow.compute.count_distinct:

import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

print(pa.compute.count_distinct(starwars.column('sex')).as_py())
#> 15
print(pa.compute.unique(starwars.column('sex')))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>    null
#> ]

This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Edward Visel

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/Jun/22 17:00

Updated:: 11/Jan/23 11:46

Resolved:: 10/Jun/22 17:04