Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13293

[R] open_dataset followed by collect hangs (while compute works)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.0.1
    • Fix Version/s: None
    • Component/s: R
    • Labels:
      None
    • Environment:
      Windows 10 (see also session info included in reprex)

      Description

      Tried to make a reproducible example using the iris dataset, but it works as expected for that dataset. So the issue might be specific to the dataset I am using (which contains over 100 columns). The example below illustrates the issue.

      The parquet data used in the example can be downloaded from this link

       

      The issue I see is the following:

       

      • calling open_dataset() %>% filter() %>% collect() hangs on my machine (while I would expect that a tibble 1,646 x 116 would be returned very fast)
      • The two alternative calls (one using read_parquet on the specific parquet file within the Dataset on which I filter, and the other using compute() instead of collect()) seem to work as expected

       

      ``` r
      library(dplyr)
      #>
      #> Attaching package: 'dplyr'
      #> The following objects are masked from 'package:stats':
      #>
      #> filter, lag
      #> The following objects are masked from 'package:base':
      #>
      #> intersect, setdiff, setequal, union
      library(arrow)
      #>
      #> Attaching package: 'arrow'
      #> The following object is masked from 'package:utils':
      #>
      #> timestamp

      read_parquet("data/lucas_harmonised/1_table/parquet_hive/year=2018/part-4.parquet") %>%
      filter(nuts1 == "BE2")
      #> # A tibble: 1,646 x 116
      #> id point_id nuts0 nuts1 nuts2 nuts3 th_lat th_long office_pi ex_ante
      #> <int> <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
      #> 1 199451 39803106 BE BE2 BE22 BE221 51.0 5.14 1 0
      #> 2 220669 39623116 BE BE2 BE21 BE213 51.0 4.88 1 0
      #> 3 215557 39483154 BE BE2 BE21 BE211 51.4 4.64 1 0
      #> 4 223579 40303122 BE BE2 BE22 BE222 51.1 5.84 1 0
      #> 5 331079 39783134 BE BE2 BE21 BE213 51.2 5.09 0 0
      #> 6 225417 39403150 BE BE2 BE21 BE211 51.3 4.53 1 0
      #> 7 3340 38863118 BE BE2 BE23 BE234 51.0 3.79 1 0
      #> 8 137361 38143132 BE BE2 BE25 BE258 51.1 2.75 1 0
      #> 9 221861 38343148 BE BE2 BE25 BE255 51.2 3.02 1 0
      #> 10 787 39523148 BE BE2 BE21 BE211 51.3 4.70 1 0
      #> # ... with 1,636 more rows, and 106 more variables: survey_date <chr>,
      #> # car_latitude <dbl>, car_ew <chr>, car_longitude <dbl>, gps_proj <chr>,
      #> # gps_prec <int>, gps_altitude <int>, gps_lat <dbl>, gps_ew <chr>,
      #> # gps_long <dbl>, obs_dist <dbl>, obs_direct <chr>, obs_type <chr>,
      #> # obs_radius <chr>, letter_group <chr>, lc1 <chr>, lc1_label <chr>,
      #> # lc1_spec <chr>, lc1_spec_label <chr>, lc1_perc <chr>, lc2 <chr>,
      #> # lc2_label <chr>, lc2_spec <chr>, lc2_spec_label <chr>, lc2_perc <chr>,
      #> # lu1 <chr>, lu1_label <chr>, lu1_type <chr>, lu1_type_label <chr>,
      #> # lu1_perc <chr>, lu2 <chr>, lu2_label <chr>, lu2_type <chr>,
      #> # lu2_type_label <chr>, lu2_perc <chr>, parcel_area_ha <chr>,
      #> # tree_height_maturity <chr>, tree_height_survey <chr>, feature_width <chr>,
      #> # lm_stone_walls <chr>, crop_residues <chr>, lm_grass_margins <chr>,
      #> # grazing <chr>, special_status <chr>, lc_lu_special_remark <chr>,
      #> # cprn_cando <chr>, cprn_lc <chr>, cprn_lc_label <chr>, cprn_lc1n <int>,
      #> # cprnc_lc1e <int>, cprnc_lc1s <int>, cprnc_lc1w <int>,
      #> # cprn_lc1n_brdth <int>, cprn_lc1e_brdth <int>, cprn_lc1s_brdth <int>,
      #> # cprn_lc1w_brdth <int>, cprn_lc1n_next <chr>, cprn_lc1s_next <chr>,
      #> # cprn_lc1e_next <chr>, cprn_lc1w_next <chr>, cprn_urban <chr>,
      #> # cprn_impervious_perc <int>, inspire_plcc1 <int>, inspire_plcc2 <int>,
      #> # inspire_plcc3 <int>, inspire_plcc4 <int>, inspire_plcc5 <int>,
      #> # inspire_plcc6 <int>, inspire_plcc7 <int>, inspire_plcc8 <int>,
      #> # eunis_complex <chr>, grassland_sample <chr>, grass_cando <chr>, wm <chr>,
      #> # wm_source <chr>, wm_type <chr>, wm_delivery <chr>, erosion_cando <chr>,
      #> # soil_stones_perc <chr>, bio_sample <chr>, soil_bio_taken <chr>,
      #> # bulk0_10_sample <chr>, soil_blk_0_10_taken <chr>, bulk10_20_sample <chr>,
      #> # soil_blk_10_20_taken <chr>, bulk20_30_sample <chr>,
      #> # soil_blk_20_30_taken <chr>, standard_sample <chr>, soil_std_taken <chr>,
      #> # organic_sample <chr>, soil_org_depth_cando <chr>, soil_taken <chr>,
      #> # soil_crop <chr>, photo_point <chr>, photo_north <chr>, photo_south <chr>,
      #> # photo_east <chr>, photo_west <chr>, transect <chr>, revisit <int>, ...

      open_dataset("data/lucas_harmonised/1_table/parquet_hive/") %>%
      filter(nuts1 == "BE2", year == 2018) %>%
      compute()
      #> Table
      #> 1646 rows x 117 columns
      #> $id <int64>
      #> $point_id <int64>
      #> $nuts0 <string>
      #> $nuts1 <string>
      #> $nuts2 <string>
      #> $nuts3 <string>
      #> $th_lat <double>
      #> $th_long <double>
      #> $office_pi <string>
      #> $ex_ante <string>
      #> $survey_date <string>
      #> $car_latitude <double>
      #> $car_ew <string>
      #> $car_longitude <double>
      #> $gps_proj <string>
      #> $gps_prec <int64>
      #> $gps_altitude <int64>
      #> $gps_lat <double>
      #> $gps_ew <string>
      #> $gps_long <double>
      #> $obs_dist <double>
      #> $obs_direct <string>
      #> $obs_type <string>
      #> $obs_radius <string>
      #> $letter_group <string>
      #> $lc1 <string>
      #> $lc1_label <string>
      #> $lc1_spec <string>
      #> $lc1_spec_label <string>
      #> $lc1_perc <string>
      #> $lc2 <string>
      #> $lc2_label <string>
      #> $lc2_spec <string>
      #> $lc2_spec_label <string>
      #> $lc2_perc <string>
      #> $lu1 <string>
      #> $lu1_label <string>
      #> $lu1_type <string>
      #> $lu1_type_label <string>
      #> $lu1_perc <string>
      #> $lu2 <string>
      #> $lu2_label <string>
      #> $lu2_type <string>
      #> $lu2_type_label <string>
      #> $lu2_perc <string>
      #> $parcel_area_ha <string>
      #> $tree_height_maturity <string>
      #> $tree_height_survey <string>
      #> $feature_width <string>
      #> $lm_stone_walls <string>
      #> $crop_residues <string>
      #> $lm_grass_margins <string>
      #> $grazing <string>
      #> $special_status <string>
      #> $lc_lu_special_remark <string>
      #> $cprn_cando <string>
      #> $cprn_lc <string>
      #> $cprn_lc_label <string>
      #> $cprn_lc1n <int64>
      #> $cprnc_lc1e <int64>
      #> $cprnc_lc1s <int64>
      #> $cprnc_lc1w <int64>
      #> $cprn_lc1n_brdth <int64>
      #> $cprn_lc1e_brdth <int64>
      #> $cprn_lc1s_brdth <int64>
      #> $cprn_lc1w_brdth <int64>
      #> $cprn_lc1n_next <string>
      #> $cprn_lc1s_next <string>
      #> $cprn_lc1e_next <string>
      #> $cprn_lc1w_next <string>
      #> $cprn_urban <string>
      #> $cprn_impervious_perc <int64>
      #> $inspire_plcc1 <int64>
      #> $inspire_plcc2 <int64>
      #> $inspire_plcc3 <int64>
      #> $inspire_plcc4 <int64>
      #> $inspire_plcc5 <int64>
      #> $inspire_plcc6 <int64>
      #> $inspire_plcc7 <int64>
      #> $inspire_plcc8 <int64>
      #> $eunis_complex <string>
      #> $grassland_sample <string>
      #> $grass_cando <string>
      #> $wm <string>
      #> $wm_source <string>
      #> $wm_type <string>
      #> $wm_delivery <string>
      #> $erosion_cando <string>
      #> $soil_stones_perc <string>
      #> $bio_sample <string>
      #> $soil_bio_taken <string>
      #> $bulk0_10_sample <string>
      #> $soil_blk_0_10_taken <string>
      #> $bulk10_20_sample <string>
      #> $soil_blk_10_20_taken <string>
      #> $bulk20_30_sample <string>
      #> $soil_blk_20_30_taken <string>
      #> $standard_sample <string>
      #> $soil_std_taken <string>
      #> $organic_sample <string>
      #> $soil_org_depth_cando <string>
      #> $soil_taken <string>
      #> $soil_crop <string>
      #> $photo_point <string>
      #> $photo_north <string>
      #> $photo_south <string>
      #> $photo_east <string>
      #> $photo_west <string>
      #> $transect <string>
      #> $revisit <int64>
      #> $th_gps_dist <double>
      #> $file_path_gisco_north <string>
      #> $file_path_gisco_south <string>
      #> $file_path_gisco_east <string>
      #> $file_path_gisco_west <string>
      #> $file_path_gisco_point <string>
      #> $year <int32>

      #open_dataset("data/lucas_harmonised/1_table/parquet_hive/") %>%

      1. filter(nuts1 == "BE2", year == 2018) %>%
      2. collect()
      3. not run: this will hang
        ```

      <sup>Created on 2021-07-09 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)</sup>

      <details style="margin-bottom:10px;">
      <summary>
      Session info
      </summary>

      ``` r
      sessioninfo::session_info()
      #> - Session info ---------------------------------------------------------------
      #> setting value
      #> version R version 4.1.0 (2021-05-18)
      #> os Windows 10 x64
      #> system x86_64, mingw32
      #> ui RTerm
      #> language (EN)
      #> collate Dutch_Belgium.1252
      #> ctype Dutch_Belgium.1252
      #> tz Europe/Paris
      #> date 2021-07-09
      #>
      #> - Packages -------------------------------------------------------------------
      #> package * version date lib source
      #> arrow * 4.0.1 2021-05-28 [1] CRAN (R 4.1.0)
      #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
      #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0)
      #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
      #> cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.5)
      #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
      #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
      #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
      #> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.0.5)
      #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
      #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
      #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.5)
      #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
      #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
      #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
      #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
      #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
      #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
      #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
      #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
      #> pillar 1.6.1 2021-05-16 [1] CRAN (R 4.1.0)
      #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
      #> ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
      #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
      #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0)
      #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.1.0)
      #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
      #> rmarkdown 2.9 2021-06-15 [1] CRAN (R 4.0.5)
      #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
      #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
      #> stringi 1.6.2 2021-05-17 [1] CRAN (R 4.0.5)
      #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
      #> tibble 3.1.2 2021-05-16 [1] CRAN (R 4.1.0)
      #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
      #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.1.0)
      #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
      #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
      #> xfun 0.24 2021-06-15 [1] CRAN (R 4.0.5)
      #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
      #>
      #> [1] C:/R/library
      #> [2] C:/R/R-4.1.0/library
      ```

      </details>

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                hansvc Hans Van Calster
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: