Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
6.0.1
-
None
-
None
Description
When I open a dataset of parquet files created by Spark I cannot get a count of the number of records, the process hangs with 100% CPU usage.
If I use DuckDB (to_duckdb) to perform the count, the operation completes as expected.
The example below reproduces the problem:
library(tidyverse) # v 1.3.1 library(arrow) # v 6.0.1 library(duckdb) # v 0.3.1-1 library(sparklyr) # v 1.7.3 # Using Spark: 3.0.0, but the same occurs when using Spark 2.4 sc <- spark_connect(master = "local") # Create a simple data frame and save it to parquet using Spark test_df <- tibble(a = 1:10e6) test_spark_tbl <- copy_to(sc, test_df) spark_write_parquet(test_spark_tbl, path="test") test_arrow_ds <- open_dataset(sources = "test") # This works as expected system.time( test_arrow_ds %>% to_duckdb() %>% count() ) # user system elapsed # 0.039 0.040 0.065 # The following will hang the process with 100% CPU usage test_arrow_ds %>% count() %>% collect()
The session information:
R version 4.1.2 (2021-11-01) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] sparklyr_1.7.3 duckdb_0.3.1-1 DBI_1.1.2 arrow_6.0.1 [5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 [9] readr_2.1.1 tidyr_1.1.4 tibble_3.1.6 ggplot2_3.3.5 [13] tidyverse_1.3.1 loaded via a namespace (and not attached): [1] Rcpp_1.0.7 lubridate_1.8.0 forge_0.2.0 rprojroot_2.0.2 [5] assertthat_0.2.1 digest_0.6.29 utf8_1.2.2 R6_2.5.1 [9] cellranger_1.1.0 backports_1.4.1 reprex_2.0.1 evaluate_0.14 [13] httr_1.4.2 pillar_1.6.4 rlang_0.4.12 readxl_1.3.1 [17] rstudioapi_0.13 blob_1.2.2 rmarkdown_2.11 htmlwidgets_1.5.4 [21] r2d3_0.2.5 bit_4.0.4 munsell_0.5.0 broom_0.7.10 [25] compiler_4.1.2 modelr_0.1.8 xfun_0.29 pkgconfig_2.0.3 [29] base64enc_0.1-3 htmltools_0.5.2 tidyselect_1.1.1 fansi_0.5.0 [33] crayon_1.4.2 tzdb_0.2.0 dbplyr_2.1.1 withr_2.4.3 [37] grid_4.1.2 jsonlite_1.7.2 gtable_0.3.0 lifecycle_1.0.1 [41] magrittr_2.0.1 scales_1.1.1 cli_3.1.0 stringi_1.7.6 [45] fs_1.5.2 xml2_1.3.3 ellipsis_0.3.2 generics_0.1.1 [49] vctrs_0.3.8 tools_4.1.2 bit64_4.0.5 glue_1.6.0 [53] hms_1.1.1 fastmap_1.1.0 yaml_2.2.1 colorspace_2.0-2 [57] rvest_1.0.2 knitr_1.37 haven_2.4.3
I can also reproduce this in on Linux machine.