Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
10.0.0
-
None
-
None
Description
I have a large parquet file 900 million rows , 40cols parquet file, subdivided into folders for each year. I was trying to calculate how many unique combinations of id1+id2+id3+id4 there are in the dataset.
Notice that the "collected" dataset is supposed to be only one row and one cel, containing the count (I've confirmed this by subseting the dataset ("%>% head(10^6)" ) before computing the count, and it works). That is why the error below is so weird
```
fa <- 'myparteq folder' #huge
va <- open_dataset(fa)
tic()
d <- va %>% head(10^6) %>% count(id1,id2,id3,id4) %>% count %>% collect
toc()
Error in `collect()`:
! Invalid: negative malloc size
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/rlang_error>
Error in `collect()`:
! Invalid: negative malloc size
—
Backtrace:
1. ... %>% collect
3. arrow:::collect.arrow_dplyr_query(.)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/rlang_error>
Error in `collect()`:
! Invalid: negative malloc size
—
Backtrace:
x
1. +-... %>% collect
2. +-dplyr::collect(.)
3. -arrow:::collect.arrow_dplyr_query(.)
4. -base::tryCatch(...)
5. -base (local) tryCatchList(expr, classes, parentenv, handlers)
6. -base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
7. -value[[3L]](cond)
8. -arrow:::augment_io_error_msg(e, call, schema = x$.data$schema)
9. -rlang::abort(msg, call = call)
```
I am running this on a windows server, 512Gb of RAM.
sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_10.0.0 data.table_1.14.4 forcats_0.5.2 dplyr_1.0.10 purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8
[9] ggplot2_3.3.6 tidyverse_1.3.2 gt_0.7.0 xtable_1.8-4 ggthemes_4.2.4 collapse_1.8.6 pryr_0.1.5 janitor_2.1.0
[17] tictoc_1.1 lubridate_1.8.0 stringr_1.4.1 readxl_1.4.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.9 assertthat_0.2.1 digest_0.6.30 utf8_1.2.2 R6_2.5.1 cellranger_1.1.0 backports_1.4.1
[8] reprex_2.0.2 httr_1.4.4 pillar_1.8.1 rlang_1.0.6 googlesheets4_1.0.1 rstudioapi_0.14 googledrive_2.0.0
[15] bit_4.0.4 munsell_0.5.0 broom_1.0.1 compiler_4.2.1 modelr_0.1.9 pkgconfig_2.0.3 htmltools_0.5.3
[22] tidyselect_1.2.0 codetools_0.2-18 fansi_1.0.3 crayon_1.5.2 tzdb_0.3.0 dbplyr_2.2.1 withr_2.5.0
[29] grid_4.2.1 jsonlite_1.8.3 gtable_0.3.1 lifecycle_1.0.3 DBI_1.1.3 magrittr_2.0.3 scales_1.2.1
[36] cli_3.4.1 stringi_1.7.8 fs_1.5.2 snakecase_0.11.0 xml2_1.3.3 ellipsis_0.3.2 generics_0.1.3
[43] vctrs_0.5.0 tools_4.2.1 bit64_4.0.5 glue_1.6.2 hms_1.1.2 parallel_4.2.1 fastmap_1.1.0
[50] colorspace_2.0-3 gargle_1.2.1 rvest_1.0.3 haven_2.5.1
arrow_info()
Arrow package version: 10.0.0
Capabilities:
dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
gcs TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc FALSE
mimalloc TRUE
Arrow options():
arrow.use_threads FALSE
Memory:
Allocator mimalloc
Current 74.82 Gb
Max 97.75 Gb
Runtime:
SIMD Level avx2
Detected SIMD Level avx2
Build:
C++ Library Version 10.0.0
C++ Compiler GNU
C++ Compiler Version 10.3.0
Git ID aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0