[ARROW-18372] [R] "Error in `collect()`: ! Invalid: negative malloc size" after large computation returning one cell - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.0.0
Fix Version/s: None
Component/s: R
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/33539

Description

I have a large parquet file 900 million rows , 40cols parquet file, subdivided into folders for each year. I was trying to calculate how many unique combinations of id1+id2+id3+id4 there are in the dataset.

Notice that the "collected" dataset is supposed to be only one row and one cel, containing the count (I've confirmed this by subseting the dataset ("%>% head(10^6)" ) before computing the count, and it works). That is why the error below is so weird

```

fa <- 'myparteq folder' #huge

va <- open_dataset(fa)

tic()
d <- va %>% head(10^6) %>% count(id1,id2,id3,id4) %>% count %>% collect

toc()

Error in `collect()`:
! Invalid: negative malloc size
Run `rlang::last_error()` to see where the error occurred.

> rlang::last_error()
<error/rlang_error>
Error in `collect()`:
! Invalid: negative malloc size
—
Backtrace:
1. ... %>% collect
3. arrow:::collect.arrow_dplyr_query(.)
Run `rlang::last_trace()` to see the full context.

> rlang::last_trace()
<error/rlang_error>
Error in `collect()`:
! Invalid: negative malloc size
—
Backtrace:
x
1. +-... %>% collect
2. +-dplyr::collect(.)
3. -arrow:::collect.arrow_dplyr_query(.)
4. -base::tryCatch(...)
5. -base (local) tryCatchList(expr, classes, parentenv, handlers)
6. -base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
7. -value[[3L]](cond)
8. -arrow:::augment_io_error_msg(e, call, schema = x$.data$schema)
9. -rlang::abort(msg, call = call)

```

I am running this on a windows server, 512Gb of RAM.

sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] arrow_10.0.0 data.table_1.14.4 forcats_0.5.2 dplyr_1.0.10 purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8
[9] ggplot2_3.3.6 tidyverse_1.3.2 gt_0.7.0 xtable_1.8-4 ggthemes_4.2.4 collapse_1.8.6 pryr_0.1.5 janitor_2.1.0
[17] tictoc_1.1 lubridate_1.8.0 stringr_1.4.1 readxl_1.4.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.9 assertthat_0.2.1 digest_0.6.30 utf8_1.2.2 R6_2.5.1 cellranger_1.1.0 backports_1.4.1
[8] reprex_2.0.2 httr_1.4.4 pillar_1.8.1 rlang_1.0.6 googlesheets4_1.0.1 rstudioapi_0.14 googledrive_2.0.0
[15] bit_4.0.4 munsell_0.5.0 broom_1.0.1 compiler_4.2.1 modelr_0.1.9 pkgconfig_2.0.3 htmltools_0.5.3
[22] tidyselect_1.2.0 codetools_0.2-18 fansi_1.0.3 crayon_1.5.2 tzdb_0.3.0 dbplyr_2.2.1 withr_2.5.0
[29] grid_4.2.1 jsonlite_1.8.3 gtable_0.3.1 lifecycle_1.0.3 DBI_1.1.3 magrittr_2.0.3 scales_1.2.1
[36] cli_3.4.1 stringi_1.7.8 fs_1.5.2 snakecase_0.11.0 xml2_1.3.3 ellipsis_0.3.2 generics_0.1.3
[43] vctrs_0.5.0 tools_4.2.1 bit64_4.0.5 glue_1.6.2 hms_1.1.2 parallel_4.2.1 fastmap_1.1.0
[50] colorspace_2.0-3 gargle_1.2.1 rvest_1.0.3 haven_2.5.1

arrow_info()
Arrow package version: 10.0.0

Capabilities:

dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
gcs TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc FALSE
mimalloc TRUE

Arrow options():

arrow.use_threads FALSE

Memory:

Allocator mimalloc
Current 74.82 Gb
Max 97.75 Gb

Runtime:

SIMD Level avx2
Detected SIMD Level avx2

Build:

C++ Library Version 10.0.0
C++ Compiler GNU
C++ Compiler Version 10.3.0
Git ID aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Lucas Mation

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 21/Nov/22 12:00

Updated:: 11/Jan/23 11:59