Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
5.0.0
-
sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_5.0.0.2 dplyr_1.0.5 magrittr_2.0.1 targets_0.6.0
loaded via a namespace (and not attached):
[1] httr_1.4.2 rnaturalearth_0.1.0 sass_0.4.0 tidyr_1.1.3
[5] jsonlite_1.7.2 bit64_4.0.5 bslib_0.2.5.1 assertthat_0.2.1
[9] askpass_1.1 sp_1.4-5 blob_1.2.1 renv_0.13.2
[13] yaml_2.2.1 globals_0.14.0 pillar_1.5.1 RSQLite_2.2.7
[17] lattice_0.20-41 glue_1.4.2 digest_0.6.27 htmltools_0.5.1.1
[21] pkgconfig_2.0.3 RPostgres_1.3.2 listenv_0.8.0 config_0.3.1
[25] purrr_0.3.4 processx_3.5.1 openssl_1.4.3 tibble_3.1.0
[29] proxy_0.4-25 aws.s3_0.3.21 colourvalues_0.3.7 generics_0.1.0
[33] ellipsis_0.3.1 cachem_1.0.5 withr_2.4.1 furrr_0.2.3
[37] cli_2.4.0 crayon_1.4.1 memoise_2.0.0 evaluate_0.14
[41] ps_1.6.0 fs_1.5.0 future_1.21.0 fansi_0.4.2
[45] parallelly_1.25.0 xml2_1.3.2 class_7.3-18 rsconnect_0.8.18
[49] tools_4.0.5 data.table_1.14.0 hms_1.0.0 lifecycle_1.0.0
[53] stringr_1.4.0 callr_3.6.0 jquerylib_0.1.4 compiler_4.0.5
[57] e1071_1.7-6 rlang_0.4.10 classInt_0.4-3 units_0.7-1
[61] grid_4.0.5 rstudioapi_0.13 visNetwork_2.0.9 htmlwidgets_1.5.3
[65] aws.signature_0.6.0 crosstalk_1.1.1 igraph_1.2.6 base64enc_0.1-3
[69] rmarkdown_2.7 codetools_0.2-18 DBI_1.1.1 curl_4.3
[73] R6_2.5.0 lubridate_1.7.10 knitr_1.31 fastmap_1.1.0
[77] rgeos_0.5-5 bit_4.0.4 utf8_1.2.1 tarchetypes_0.2.1
[81] readr_1.4.0 KernSmooth_2.23-18 stringi_1.5.3 parallel_4.0.5
[85] Rcpp_1.0.6 vctrs_0.3.7 sf_0.9-8 leaflet_2.0.4.1
[89] dbplyr_2.1.1 tidyselect_1.1.0 xfun_0.22sessionInfo() R version 4.0.5 (2021-03-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0.2 dplyr_1.0.5 magrittr_2.0.1 targets_0.6.0 loaded via a namespace (and not attached): [1] httr_1.4.2 rnaturalearth_0.1.0 sass_0.4.0 tidyr_1.1.3 [5] jsonlite_1.7.2 bit64_4.0.5 bslib_0.2.5.1 assertthat_0.2.1 [9] askpass_1.1 sp_1.4-5 blob_1.2.1 renv_0.13.2 [13] yaml_2.2.1 globals_0.14.0 pillar_1.5.1 RSQLite_2.2.7 [17] lattice_0.20-41 glue_1.4.2 digest_0.6.27 htmltools_0.5.1.1 [21] pkgconfig_2.0.3 RPostgres_1.3.2 listenv_0.8.0 config_0.3.1 [25] purrr_0.3.4 processx_3.5.1 openssl_1.4.3 tibble_3.1.0 [29] proxy_0.4-25 aws.s3_0.3.21 colourvalues_0.3.7 generics_0.1.0 [33] ellipsis_0.3.1 cachem_1.0.5 withr_2.4.1 furrr_0.2.3 [37] cli_2.4.0 crayon_1.4.1 memoise_2.0.0 evaluate_0.14 [41] ps_1.6.0 fs_1.5.0 future_1.21.0 fansi_0.4.2 [45] parallelly_1.25.0 xml2_1.3.2 class_7.3-18 rsconnect_0.8.18 [49] tools_4.0.5 data.table_1.14.0 hms_1.0.0 lifecycle_1.0.0 [53] stringr_1.4.0 callr_3.6.0 jquerylib_0.1.4 compiler_4.0.5 [57] e1071_1.7-6 rlang_0.4.10 classInt_0.4-3 units_0.7-1 [61] grid_4.0.5 rstudioapi_0.13 visNetwork_2.0.9 htmlwidgets_1.5.3 [65] aws.signature_0.6.0 crosstalk_1.1.1 igraph_1.2.6 base64enc_0.1-3 [69] rmarkdown_2.7 codetools_0.2-18 DBI_1.1.1 curl_4.3 [73] R6_2.5.0 lubridate_1.7.10 knitr_1.31 fastmap_1.1.0 [77] rgeos_0.5-5 bit_4.0.4 utf8_1.2.1 tarchetypes_0.2.1 [81] readr_1.4.0 KernSmooth_2.23-18 stringi_1.5.3 parallel_4.0.5 [85] Rcpp_1.0.6 vctrs_0.3.7 sf_0.9-8 leaflet_2.0.4.1 [89] dbplyr_2.1.1 tidyselect_1.1.0 xfun_0.22
Description
Using open_dataset() on a CSV without a header row, followed by collect(), results either in a tibble of {{NA}}s or an error depending on duplication of the first row of data. This affects reading one file or a directory of files.
Here we use the `diamonds` data, where the first row of data does not have any repeat values.
library(arrow) library(magrittr) data(diamonds, package='ggplot2') readr::write_csv(head(diamonds), file='diamonds_with_header.csv', col_names=TRUE) readr::write_csv(head(diamonds), file='diamonds_without_header.csv', col_names=FALSE) diamond_schema <- schema( carat=float32() , cut=string() , color=string() , clarity=string() , depth=float32() , table=float32() , price=float32() , x=float32() , y=float32() , z=float32() ) diamonds_with_headers <- open_dataset('diamonds_with_header.csv', schema=diamond_schema, format='csv') diamonds_without_headers <- open_dataset('diamonds_without_header.csv', schema=diamond_schema, format='csv') # this works diamonds_with_headers %>% collect() # A tibble: 6 x 10 carat cut color clarity depth table price x y z <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0.230 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.210 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.230 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.290 Premium I VS2 62.4 58 334 4.20 4.23 2.63 5 0.310 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 0.240 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 # this gives a tibble with all NA values, though of the correct types diamonds_without_headers %>% collect() # A tibble: 5 x 10 carat cut color clarity depth table price x y z <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 NA NA NA NA NA NA NA NA NA NA 2 NA NA NA NA NA NA NA NA NA NA 3 NA NA NA NA NA NA NA NA NA NA 4 NA NA NA NA NA NA NA NA NA NA 5 NA NA NA NA NA NA NA NA NA NA
Now we use a simple dataset where two of the columns in the first row have the same value, 0.0.
randomDF <- tibble::tibble( A=c(0.0, 2.3, 5.1) , B=c('a', 'b', 'a') , C=c(0.0, 3.1, 4.5) ) readr::write_csv(randomDF, file='random_with_header.csv', col_names=TRUE) readr::write_csv(randomDF, file='random_without_header.csv', col_names=FALSE) random_schema <- schema( A=float32() , B=string() , C=float32() ) random_with_headers <- open_dataset('random_with_header.csv', schema=random_schema, format='csv') random_without_headers <- open_dataset('random_without_header.csv', schema=random_schema, format='csv') # gives a tibble with the proper values read_with_headers %>% collect() # A tibble: 3 x 3 A B C <dbl> <chr> <dbl> 1 0 a 0 2 2.30 b 3.10 3 5.10 a 4.5 # results in an error read_without_headers %>% collect() Error: Invalid: Could not open CSV input source 'without_header.csv': Invalid: CSV file contained multiple columns named 0
Interestingly, read_csv_arrow() has the opposite problem. Providing the schema works for CSVs without headers, but not with, despite the help file saying that providing a schema satisfies both col_nmames and col_types.
diamonds_read_with_header <- read_csv_arrow('diamonds_with_header.csv', schema=diamond_schema) Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'carat' diamonds_read_without_header <- read_csv_arrow('diamonds_without_header.csv', schema=diamond_schema) # reads normally random_read_with_header <- read_csv_arrow('random_with_header.csv', schema=random_schema) Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'A' random_read_without_header <- read_csv_arrow('random_without_header.csv', schema=random_schema) # reads normally
Attachments
Issue Links
- links to