Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Duplicate
-
None
-
None
-
None
-
> arrow::arrow_info()
Arrow package version: 8.0.0.9000
Capabilities:
dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc FALSE
Memory:
Allocator jemalloc
Current 37.25 Kb
Max 925.42 Kb
Runtime:
SIMD Level none
Detected SIMD Level none
Build:
C++ Library Version 9.0.0-SNAPSHOT
C++ Compiler AppleClang
C++ Compiler Version 13.1.6.13160021
Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9> arrow::arrow_info() Arrow package version: 8.0.0.9000 Capabilities: dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc TRUE mimalloc FALSE Memory: Allocator jemalloc Current 37.25 Kb Max 925.42 Kb Runtime: SIMD Level none Detected SIMD Level none Build: C++ Library Version 9.0.0-SNAPSHOT C++ Compiler AppleClang C++ Compiler Version 13.1.6.13160021 Git ID d9d78946607f36e25e9d812a5cc956bd00ab2bc9
Description
When reading from parquet files with multiple row groups, count_distinct (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:
library(dplyr, warn.conflicts = FALSE) path <- tempfile(fileext = '.parquet') arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L) ds <- arrow::open_dataset(path) ds %>% count(sex) %>% collect() #> # A tibble: 5 × 2 #> sex n #> <chr> <int> #> 1 male 60 #> 2 none 6 #> 3 female 16 #> 4 hermaphroditic 1 #> 5 <NA> 4 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 19 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 16 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 17 ds %>% summarise(n = n_distinct(sex)) %>% collect() #> # A tibble: 1 × 1 #> n #> <int> #> 1 17 # correct ds %>% collect() %>% summarise(n = n_distinct(sex)) #> # A tibble: 1 × 1 #> n #> <int> #> 1 5
If the file is stored as a single row group, results are correct. When grouped, results are correct.
I can reproduce this in Python as well using the same file and pyarrow.compute.count_distinct:
import pyarrow as pa import pyarrow.parquet as pq pa.__version__ #> 8.0.0 starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet') print(pa.compute.count_distinct(starwars.column('sex')).as_py()) #> 15 print(pa.compute.unique(starwars.column('sex'))) #> [ #> "male", #> "none", #> "female", #> "hermaphroditic", #> null #> ]
This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files.