Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
4.0.1
-
None
-
Windows 10 (see also session info included in reprex)
Description
Tried to make a reproducible example using the iris dataset, but it works as expected for that dataset. So the issue might be specific to the dataset I am using (which contains over 100 columns). The example below illustrates the issue.
The parquet data used in the example can be downloaded from this link
The issue I see is the following:
- calling open_dataset() %>% filter() %>% collect() hangs on my machine (while I would expect that a tibble 1,646 x 116 would be returned very fast)
- The two alternative calls (one using read_parquet on the specific parquet file within the Dataset on which I filter, and the other using compute() instead of collect()) seem to work as expected
``` r
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
read_parquet("data/lucas_harmonised/1_table/parquet_hive/year=2018/part-4.parquet") %>%
filter(nuts1 == "BE2")
#> # A tibble: 1,646 x 116
#> id point_id nuts0 nuts1 nuts2 nuts3 th_lat th_long office_pi ex_ante
#> <int> <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 199451 39803106 BE BE2 BE22 BE221 51.0 5.14 1 0
#> 2 220669 39623116 BE BE2 BE21 BE213 51.0 4.88 1 0
#> 3 215557 39483154 BE BE2 BE21 BE211 51.4 4.64 1 0
#> 4 223579 40303122 BE BE2 BE22 BE222 51.1 5.84 1 0
#> 5 331079 39783134 BE BE2 BE21 BE213 51.2 5.09 0 0
#> 6 225417 39403150 BE BE2 BE21 BE211 51.3 4.53 1 0
#> 7 3340 38863118 BE BE2 BE23 BE234 51.0 3.79 1 0
#> 8 137361 38143132 BE BE2 BE25 BE258 51.1 2.75 1 0
#> 9 221861 38343148 BE BE2 BE25 BE255 51.2 3.02 1 0
#> 10 787 39523148 BE BE2 BE21 BE211 51.3 4.70 1 0
#> # ... with 1,636 more rows, and 106 more variables: survey_date <chr>,
#> # car_latitude <dbl>, car_ew <chr>, car_longitude <dbl>, gps_proj <chr>,
#> # gps_prec <int>, gps_altitude <int>, gps_lat <dbl>, gps_ew <chr>,
#> # gps_long <dbl>, obs_dist <dbl>, obs_direct <chr>, obs_type <chr>,
#> # obs_radius <chr>, letter_group <chr>, lc1 <chr>, lc1_label <chr>,
#> # lc1_spec <chr>, lc1_spec_label <chr>, lc1_perc <chr>, lc2 <chr>,
#> # lc2_label <chr>, lc2_spec <chr>, lc2_spec_label <chr>, lc2_perc <chr>,
#> # lu1 <chr>, lu1_label <chr>, lu1_type <chr>, lu1_type_label <chr>,
#> # lu1_perc <chr>, lu2 <chr>, lu2_label <chr>, lu2_type <chr>,
#> # lu2_type_label <chr>, lu2_perc <chr>, parcel_area_ha <chr>,
#> # tree_height_maturity <chr>, tree_height_survey <chr>, feature_width <chr>,
#> # lm_stone_walls <chr>, crop_residues <chr>, lm_grass_margins <chr>,
#> # grazing <chr>, special_status <chr>, lc_lu_special_remark <chr>,
#> # cprn_cando <chr>, cprn_lc <chr>, cprn_lc_label <chr>, cprn_lc1n <int>,
#> # cprnc_lc1e <int>, cprnc_lc1s <int>, cprnc_lc1w <int>,
#> # cprn_lc1n_brdth <int>, cprn_lc1e_brdth <int>, cprn_lc1s_brdth <int>,
#> # cprn_lc1w_brdth <int>, cprn_lc1n_next <chr>, cprn_lc1s_next <chr>,
#> # cprn_lc1e_next <chr>, cprn_lc1w_next <chr>, cprn_urban <chr>,
#> # cprn_impervious_perc <int>, inspire_plcc1 <int>, inspire_plcc2 <int>,
#> # inspire_plcc3 <int>, inspire_plcc4 <int>, inspire_plcc5 <int>,
#> # inspire_plcc6 <int>, inspire_plcc7 <int>, inspire_plcc8 <int>,
#> # eunis_complex <chr>, grassland_sample <chr>, grass_cando <chr>, wm <chr>,
#> # wm_source <chr>, wm_type <chr>, wm_delivery <chr>, erosion_cando <chr>,
#> # soil_stones_perc <chr>, bio_sample <chr>, soil_bio_taken <chr>,
#> # bulk0_10_sample <chr>, soil_blk_0_10_taken <chr>, bulk10_20_sample <chr>,
#> # soil_blk_10_20_taken <chr>, bulk20_30_sample <chr>,
#> # soil_blk_20_30_taken <chr>, standard_sample <chr>, soil_std_taken <chr>,
#> # organic_sample <chr>, soil_org_depth_cando <chr>, soil_taken <chr>,
#> # soil_crop <chr>, photo_point <chr>, photo_north <chr>, photo_south <chr>,
#> # photo_east <chr>, photo_west <chr>, transect <chr>, revisit <int>, ...
open_dataset("data/lucas_harmonised/1_table/parquet_hive/") %>%
filter(nuts1 == "BE2", year == 2018) %>%
compute()
#> Table
#> 1646 rows x 117 columns
#> $id <int64>
#> $point_id <int64>
#> $nuts0 <string>
#> $nuts1 <string>
#> $nuts2 <string>
#> $nuts3 <string>
#> $th_lat <double>
#> $th_long <double>
#> $office_pi <string>
#> $ex_ante <string>
#> $survey_date <string>
#> $car_latitude <double>
#> $car_ew <string>
#> $car_longitude <double>
#> $gps_proj <string>
#> $gps_prec <int64>
#> $gps_altitude <int64>
#> $gps_lat <double>
#> $gps_ew <string>
#> $gps_long <double>
#> $obs_dist <double>
#> $obs_direct <string>
#> $obs_type <string>
#> $obs_radius <string>
#> $letter_group <string>
#> $lc1 <string>
#> $lc1_label <string>
#> $lc1_spec <string>
#> $lc1_spec_label <string>
#> $lc1_perc <string>
#> $lc2 <string>
#> $lc2_label <string>
#> $lc2_spec <string>
#> $lc2_spec_label <string>
#> $lc2_perc <string>
#> $lu1 <string>
#> $lu1_label <string>
#> $lu1_type <string>
#> $lu1_type_label <string>
#> $lu1_perc <string>
#> $lu2 <string>
#> $lu2_label <string>
#> $lu2_type <string>
#> $lu2_type_label <string>
#> $lu2_perc <string>
#> $parcel_area_ha <string>
#> $tree_height_maturity <string>
#> $tree_height_survey <string>
#> $feature_width <string>
#> $lm_stone_walls <string>
#> $crop_residues <string>
#> $lm_grass_margins <string>
#> $grazing <string>
#> $special_status <string>
#> $lc_lu_special_remark <string>
#> $cprn_cando <string>
#> $cprn_lc <string>
#> $cprn_lc_label <string>
#> $cprn_lc1n <int64>
#> $cprnc_lc1e <int64>
#> $cprnc_lc1s <int64>
#> $cprnc_lc1w <int64>
#> $cprn_lc1n_brdth <int64>
#> $cprn_lc1e_brdth <int64>
#> $cprn_lc1s_brdth <int64>
#> $cprn_lc1w_brdth <int64>
#> $cprn_lc1n_next <string>
#> $cprn_lc1s_next <string>
#> $cprn_lc1e_next <string>
#> $cprn_lc1w_next <string>
#> $cprn_urban <string>
#> $cprn_impervious_perc <int64>
#> $inspire_plcc1 <int64>
#> $inspire_plcc2 <int64>
#> $inspire_plcc3 <int64>
#> $inspire_plcc4 <int64>
#> $inspire_plcc5 <int64>
#> $inspire_plcc6 <int64>
#> $inspire_plcc7 <int64>
#> $inspire_plcc8 <int64>
#> $eunis_complex <string>
#> $grassland_sample <string>
#> $grass_cando <string>
#> $wm <string>
#> $wm_source <string>
#> $wm_type <string>
#> $wm_delivery <string>
#> $erosion_cando <string>
#> $soil_stones_perc <string>
#> $bio_sample <string>
#> $soil_bio_taken <string>
#> $bulk0_10_sample <string>
#> $soil_blk_0_10_taken <string>
#> $bulk10_20_sample <string>
#> $soil_blk_10_20_taken <string>
#> $bulk20_30_sample <string>
#> $soil_blk_20_30_taken <string>
#> $standard_sample <string>
#> $soil_std_taken <string>
#> $organic_sample <string>
#> $soil_org_depth_cando <string>
#> $soil_taken <string>
#> $soil_crop <string>
#> $photo_point <string>
#> $photo_north <string>
#> $photo_south <string>
#> $photo_east <string>
#> $photo_west <string>
#> $transect <string>
#> $revisit <int64>
#> $th_gps_dist <double>
#> $file_path_gisco_north <string>
#> $file_path_gisco_south <string>
#> $file_path_gisco_east <string>
#> $file_path_gisco_west <string>
#> $file_path_gisco_point <string>
#> $year <int32>
#open_dataset("data/lucas_harmonised/1_table/parquet_hive/") %>%
- filter(nuts1 == "BE2", year == 2018) %>%
- collect()
- not run: this will hang
```
<sup>Created on 2021-07-09 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)</sup>
<details style="margin-bottom:10px;">
<summary>
Session info
</summary>
``` r
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.1.0 (2021-05-18)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate Dutch_Belgium.1252
#> ctype Dutch_Belgium.1252
#> tz Europe/Paris
#> date 2021-07-09
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> arrow * 4.0.1 2021-05-28 [1] CRAN (R 4.1.0)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
#> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
#> cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.5)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
#> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.0.5)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
#> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.5)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
#> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
#> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
#> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
#> pillar 1.6.1 2021-05-16 [1] CRAN (R 4.1.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
#> ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0)
#> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.1.0)
#> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
#> rmarkdown 2.9 2021-06-15 [1] CRAN (R 4.0.5)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
#> stringi 1.6.2 2021-05-17 [1] CRAN (R 4.0.5)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
#> tibble 3.1.2 2021-05-16 [1] CRAN (R 4.1.0)
#> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
#> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.1.0)
#> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
#> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
#> xfun 0.24 2021-06-15 [1] CRAN (R 4.0.5)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
#>
#> [1] C:/R/library
#> [2] C:/R/R-4.1.0/library
```
</details>
Attachments
Issue Links
- duplicates
-
ARROW-11579 [R] read_feather hanging on Windows
- Resolved
- relates to
-
ARROW-8379 [R] Investigate/fix thread safety issues (esp. Windows)
- Resolved