Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15201

[R] Problem counting number of records of a parquet dataset created using Spark

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 6.0.1
    • None
    • R
    • None

    Description

      When I open a dataset of parquet files created by Spark I cannot get a count of the number of records, the process hangs with 100% CPU usage.

      If I use DuckDB (to_duckdb) to perform the count,  the operation completes as expected.

      The example below reproduces the problem:

      library(tidyverse) # v 1.3.1
      library(arrow) # v 6.0.1
      library(duckdb) # v 0.3.1-1
      library(sparklyr) # v 1.7.3
      
      # Using Spark: 3.0.0, but the same occurs when using Spark 2.4
      sc <- spark_connect(master = "local")
      
      # Create a simple data frame and save it to parquet using Spark
      test_df <- tibble(a = 1:10e6)
      test_spark_tbl <- copy_to(sc, test_df)
      spark_write_parquet(test_spark_tbl, path="test")
      
      test_arrow_ds <- open_dataset(sources = "test")
      
      # This works as expected
      system.time(
        test_arrow_ds %>% 
          to_duckdb() %>% 
          count() 
      )
      #  user  system elapsed 
      #  0.039   0.040   0.065 
      
      
      # The following will hang the process with 100% CPU usage 
      test_arrow_ds %>% 
        count() %>% 
        collect()
      

       
      The session information:

      R version 4.1.2 (2021-11-01)
      Platform: x86_64-apple-darwin17.0 (64-bit)
      Running under: macOS Monterey 12.1
      
      Matrix products: default
      LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
      
      locale:
      [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
      
      attached base packages:
      [1] stats     graphics  grDevices utils     datasets  methods   base     
      
      other attached packages:
       [1] sparklyr_1.7.3  duckdb_0.3.1-1  DBI_1.1.2       arrow_6.0.1    
       [5] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4    
       [9] readr_2.1.1     tidyr_1.1.4     tibble_3.1.6    ggplot2_3.3.5  
      [13] tidyverse_1.3.1
      
      loaded via a namespace (and not attached):
       [1] Rcpp_1.0.7        lubridate_1.8.0   forge_0.2.0       rprojroot_2.0.2  
       [5] assertthat_0.2.1  digest_0.6.29     utf8_1.2.2        R6_2.5.1         
       [9] cellranger_1.1.0  backports_1.4.1   reprex_2.0.1      evaluate_0.14    
      [13] httr_1.4.2        pillar_1.6.4      rlang_0.4.12      readxl_1.3.1     
      [17] rstudioapi_0.13   blob_1.2.2        rmarkdown_2.11    htmlwidgets_1.5.4
      [21] r2d3_0.2.5        bit_4.0.4         munsell_0.5.0     broom_0.7.10     
      [25] compiler_4.1.2    modelr_0.1.8      xfun_0.29         pkgconfig_2.0.3  
      [29] base64enc_0.1-3   htmltools_0.5.2   tidyselect_1.1.1  fansi_0.5.0      
      [33] crayon_1.4.2      tzdb_0.2.0        dbplyr_2.1.1      withr_2.4.3      
      [37] grid_4.1.2        jsonlite_1.7.2    gtable_0.3.0      lifecycle_1.0.1  
      [41] magrittr_2.0.1    scales_1.1.1      cli_3.1.0         stringi_1.7.6    
      [45] fs_1.5.2          xml2_1.3.3        ellipsis_0.3.2    generics_0.1.1   
      [49] vctrs_0.3.8       tools_4.1.2       bit64_4.0.5       glue_1.6.0       
      [53] hms_1.1.1         fastmap_1.1.0     yaml_2.2.1        colorspace_2.0-2 
      [57] rvest_1.0.2       knitr_1.37        haven_2.4.3      
      

      I can also reproduce this in on Linux machine. 

      Attachments

        Activity

          People

            Unassigned Unassigned
            nareal Nelson Areal
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: