Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-18372

[R] "Error in `collect()`: ! Invalid: negative malloc size" after large computation returning one cell

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 10.0.0
    • None
    • R
    • None

    Description

      I have a large parquet file 900 million rows , 40cols parquet file, subdivided into folders for each year. I was trying to calculate how many unique combinations of id1+id2+id3+id4 there are in the dataset.

       

      Notice that the "collected" dataset is supposed to be only one row and one cel, containing the count (I've confirmed this by subseting the dataset ("%>% head(10^6)" ) before computing the count, and it works). That is why the error below is so weird

      ```

      fa <- 'myparteq folder' #huge 

      va <- open_dataset(fa)

      tic()
      d <- va  %>% head(10^6) %>% count(id1,id2,id3,id4) %>% count %>% collect

      toc()

       

      Error in `collect()`:
      ! Invalid: negative malloc size
      Run `rlang::last_error()` to see where the error occurred.

       

      > rlang::last_error()
      <error/rlang_error>
      Error in `collect()`:
      ! Invalid: negative malloc size

      Backtrace:
       1. ... %>% collect
       3. arrow:::collect.arrow_dplyr_query(.)
      Run `rlang::last_trace()` to see the full context.

       

      > rlang::last_trace()
      <error/rlang_error>
      Error in `collect()`:
      ! Invalid: negative malloc size

      Backtrace:
          x
       1. +-... %>% collect
       2. +-dplyr::collect(.)
       3. -arrow:::collect.arrow_dplyr_query(.)
       4.   -base::tryCatch(...)
       5.     -base (local) tryCatchList(expr, classes, parentenv, handlers)
       6.       -base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
       7.         -value[[3L]](cond)
       8.           -arrow:::augment_io_error_msg(e, call, schema = x$.data$schema)
       9.             -rlang::abort(msg, call = call)

       

      ```

      I am running this on a windows server, 512Gb of RAM.

       sessionInfo()
      R version 4.2.1 (2022-06-23 ucrt)
      Platform: x86_64-w64-mingw32/x64 (64-bit)
      Running under: Windows Server 2012 R2 x64 (build 9600)

      Matrix products: default

      locale:
      [1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252    LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
      [5] LC_TIME=Portuguese_Brazil.1252    

      attached base packages:
      [1] stats     graphics  grDevices utils     datasets  methods   base     

      other attached packages:
       [1] arrow_10.0.0      data.table_1.14.4 forcats_0.5.2     dplyr_1.0.10      purrr_0.3.5  readr_2.1.3       tidyr_1.2.1       tibble_3.1.8     
       [9] ggplot2_3.3.6     tidyverse_1.3.2   gt_0.7.0          xtable_1.8-4      ggthemes_4.2.4    collapse_1.8.6    pryr_0.1.5        janitor_2.1.0    
      [17] tictoc_1.1        lubridate_1.8.0   stringr_1.4.1     readxl_1.4.1     

      loaded via a namespace (and not attached):
       [1] Rcpp_1.0.9          assertthat_0.2.1    digest_0.6.30       utf8_1.2.2          R6_2.5.1            cellranger_1.1.0    backports_1.4.1    
       [8] reprex_2.0.2        httr_1.4.4          pillar_1.8.1        rlang_1.0.6         googlesheets4_1.0.1 rstudioapi_0.14     googledrive_2.0.0  
      [15] bit_4.0.4           munsell_0.5.0       broom_1.0.1         compiler_4.2.1      modelr_0.1.9        pkgconfig_2.0.3     htmltools_0.5.3    
      [22] tidyselect_1.2.0    codetools_0.2-18    fansi_1.0.3         crayon_1.5.2        tzdb_0.3.0          dbplyr_2.2.1        withr_2.5.0        
      [29] grid_4.2.1          jsonlite_1.8.3      gtable_0.3.1        lifecycle_1.0.3     DBI_1.1.3           magrittr_2.0.3      scales_1.2.1       
      [36] cli_3.4.1           stringi_1.7.8       fs_1.5.2            snakecase_0.11.0    xml2_1.3.3          ellipsis_0.3.2      generics_0.1.3     
      [43] vctrs_0.5.0         tools_4.2.1         bit64_4.0.5         glue_1.6.2          hms_1.1.2           parallel_4.2.1      fastmap_1.1.0      
      [50] colorspace_2.0-3    gargle_1.2.1        rvest_1.0.3         haven_2.5.1    

       

       arrow_info()
      Arrow package version: 10.0.0

      Capabilities:
                     
      dataset    TRUE
      substrait FALSE
      parquet    TRUE
      json       TRUE
      s3         TRUE
      gcs        TRUE
      utf8proc   TRUE
      re2        TRUE
      snappy     TRUE
      gzip       TRUE
      brotli     TRUE
      zstd       TRUE
      lz4        TRUE
      lz4_frame  TRUE
      lzo       FALSE
      bz2        TRUE
      jemalloc  FALSE
      mimalloc   TRUE

      Arrow options():
                             
      arrow.use_threads FALSE

      Memory:
                        
      Allocator mimalloc
      Current   74.82 Gb
      Max       97.75 Gb

      Runtime:
                              
      SIMD Level          avx2
      Detected SIMD Level avx2

      Build:
                                                                   
      C++ Library Version                                    10.0.0
      C++ Compiler                                              GNU
      C++ Compiler Version                                   10.3.0
      Git ID               aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            lucasmation Lucas Mation
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: