[ARROW-14063] [R] open_dataset() does not work on CSVs without header rows - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 5.0.0
Fix Version/s: 6.0.0
Component/s: R
Labels:
- bug
- pull-request-available
Environment:

Hide
sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] arrow_5.0.0.2 dplyr_1.0.5 magrittr_2.0.1 targets_0.6.0

loaded via a namespace (and not attached):
[1] httr_1.4.2 rnaturalearth_0.1.0 sass_0.4.0 tidyr_1.1.3
[5] jsonlite_1.7.2 bit64_4.0.5 bslib_0.2.5.1 assertthat_0.2.1
[9] askpass_1.1 sp_1.4-5 blob_1.2.1 renv_0.13.2
[13] yaml_2.2.1 globals_0.14.0 pillar_1.5.1 RSQLite_2.2.7
[17] lattice_0.20-41 glue_1.4.2 digest_0.6.27 htmltools_0.5.1.1
[21] pkgconfig_2.0.3 RPostgres_1.3.2 listenv_0.8.0 config_0.3.1
[25] purrr_0.3.4 processx_3.5.1 openssl_1.4.3 tibble_3.1.0
[29] proxy_0.4-25 aws.s3_0.3.21 colourvalues_0.3.7 generics_0.1.0
[33] ellipsis_0.3.1 cachem_1.0.5 withr_2.4.1 furrr_0.2.3
[37] cli_2.4.0 crayon_1.4.1 memoise_2.0.0 evaluate_0.14
[41] ps_1.6.0 fs_1.5.0 future_1.21.0 fansi_0.4.2
[45] parallelly_1.25.0 xml2_1.3.2 class_7.3-18 rsconnect_0.8.18
[49] tools_4.0.5 data.table_1.14.0 hms_1.0.0 lifecycle_1.0.0
[53] stringr_1.4.0 callr_3.6.0 jquerylib_0.1.4 compiler_4.0.5
[57] e1071_1.7-6 rlang_0.4.10 classInt_0.4-3 units_0.7-1
[61] grid_4.0.5 rstudioapi_0.13 visNetwork_2.0.9 htmlwidgets_1.5.3
[65] aws.signature_0.6.0 crosstalk_1.1.1 igraph_1.2.6 base64enc_0.1-3
[69] rmarkdown_2.7 codetools_0.2-18 DBI_1.1.1 curl_4.3
[73] R6_2.5.0 lubridate_1.7.10 knitr_1.31 fastmap_1.1.0
[77] rgeos_0.5-5 bit_4.0.4 utf8_1.2.1 tarchetypes_0.2.1
[81] readr_1.4.0 KernSmooth_2.23-18 stringi_1.5.3 parallel_4.0.5
[85] Rcpp_1.0.6 vctrs_0.3.7 sf_0.9-8 leaflet_2.0.4.1
[89] dbplyr_2.1.1 tidyselect_1.1.0 xfun_0.22

Show
sessionInfo() R version 4.0.5 (2021-03-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_5.0.0.2 dplyr_1.0.5 magrittr_2.0.1 targets_0.6.0 loaded via a namespace (and not attached): [1] httr_1.4.2 rnaturalearth_0.1.0 sass_0.4.0 tidyr_1.1.3 [5] jsonlite_1.7.2 bit64_4.0.5 bslib_0.2.5.1 assertthat_0.2.1 [9] askpass_1.1 sp_1.4-5 blob_1.2.1 renv_0.13.2 [13] yaml_2.2.1 globals_0.14.0 pillar_1.5.1 RSQLite_2.2.7 [17] lattice_0.20-41 glue_1.4.2 digest_0.6.27 htmltools_0.5.1.1 [21] pkgconfig_2.0.3 RPostgres_1.3.2 listenv_0.8.0 config_0.3.1 [25] purrr_0.3.4 processx_3.5.1 openssl_1.4.3 tibble_3.1.0 [29] proxy_0.4-25 aws.s3_0.3.21 colourvalues_0.3.7 generics_0.1.0 [33] ellipsis_0.3.1 cachem_1.0.5 withr_2.4.1 furrr_0.2.3 [37] cli_2.4.0 crayon_1.4.1 memoise_2.0.0 evaluate_0.14 [41] ps_1.6.0 fs_1.5.0 future_1.21.0 fansi_0.4.2 [45] parallelly_1.25.0 xml2_1.3.2 class_7.3-18 rsconnect_0.8.18 [49] tools_4.0.5 data.table_1.14.0 hms_1.0.0 lifecycle_1.0.0 [53] stringr_1.4.0 callr_3.6.0 jquerylib_0.1.4 compiler_4.0.5 [57] e1071_1.7-6 rlang_0.4.10 classInt_0.4-3 units_0.7-1 [61] grid_4.0.5 rstudioapi_0.13 visNetwork_2.0.9 htmlwidgets_1.5.3 [65] aws.signature_0.6.0 crosstalk_1.1.1 igraph_1.2.6 base64enc_0.1-3 [69] rmarkdown_2.7 codetools_0.2-18 DBI_1.1.1 curl_4.3 [73] R6_2.5.0 lubridate_1.7.10 knitr_1.31 fastmap_1.1.0 [77] rgeos_0.5-5 bit_4.0.4 utf8_1.2.1 tarchetypes_0.2.1 [81] readr_1.4.0 KernSmooth_2.23-18 stringi_1.5.3 parallel_4.0.5 [85] Rcpp_1.0.6 vctrs_0.3.7 sf_0.9-8 leaflet_2.0.4.1 [89] dbplyr_2.1.1 tidyselect_1.1.0 xfun_0.22

Flags:

Important
External issue URL:
https://github.com/apache/arrow/issues/29659

Description

Using open_dataset() on a CSV without a header row, followed by collect(), results either in a tibble of {{NA}}s or an error depending on duplication of the first row of data. This affects reading one file or a directory of files.

Here we use the `diamonds` data, where the first row of data does not have any repeat values.

library(arrow)
library(magrittr)

data(diamonds, package='ggplot2')

readr::write_csv(head(diamonds), file='diamonds_with_header.csv', col_names=TRUE)
readr::write_csv(head(diamonds), file='diamonds_without_header.csv', col_names=FALSE)

diamond_schema <- schema(
    carat=float32()
    , cut=string()
    , color=string()
    , clarity=string()
    , depth=float32()
    , table=float32()
    , price=float32()
    , x=float32()
    , y=float32()
    , z=float32()
)

diamonds_with_headers <- open_dataset('diamonds_with_header.csv', schema=diamond_schema, format='csv')
diamonds_without_headers <- open_dataset('diamonds_without_header.csv', schema=diamond_schema, format='csv')

# this works
diamonds_with_headers %>% collect()
# A tibble: 6 x 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.230 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2 0.210 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3 0.230 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4 0.290 Premium   I     VS2      62.4    58   334  4.20  4.23  2.63
5 0.310 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6 0.240 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

# this gives a tibble with all NA values, though of the correct types
diamonds_without_headers %>% collect()
# A tibble: 5 x 10
  carat cut   color clarity depth table price     x     y     z
  <dbl> <chr> <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
2    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
3    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
4    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
5    NA NA    NA    NA         NA    NA    NA    NA    NA    NA

Now we use a simple dataset where two of the columns in the first row have the same value, 0.0.

randomDF <- tibble::tibble(
    A=c(0.0, 2.3, 5.1)
    , B=c('a', 'b', 'a')
    , C=c(0.0, 3.1, 4.5)
)

readr::write_csv(randomDF, file='random_with_header.csv', col_names=TRUE)
readr::write_csv(randomDF, file='random_without_header.csv', col_names=FALSE)

random_schema <- schema(
    A=float32()
    , B=string()
    , C=float32()
)

random_with_headers <- open_dataset('random_with_header.csv', schema=random_schema, format='csv')
random_without_headers <- open_dataset('random_without_header.csv', schema=random_schema, format='csv')

# gives a tibble with the proper values
read_with_headers %>% collect()
# A tibble: 3 x 3
      A B         C
  <dbl> <chr> <dbl>
1  0    a      0   
2  2.30 b      3.10
3  5.10 a      4.5 

# results in an error
read_without_headers %>% collect()
Error: Invalid: Could not open CSV input source 'without_header.csv': Invalid: CSV file contained multiple columns named 0

Interestingly, read_csv_arrow() has the opposite problem. Providing the schema works for CSVs without headers, but not with, despite the help file saying that providing a schema satisfies both col_nmames and col_types.

diamonds_read_with_header <- read_csv_arrow('diamonds_with_header.csv', schema=diamond_schema)
Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'carat'

diamonds_read_without_header <- read_csv_arrow('diamonds_without_header.csv', schema=diamond_schema)
# reads normally


random_read_with_header <- read_csv_arrow('random_with_header.csv', schema=random_schema)
Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'A'

random_read_without_header <- read_csv_arrow('random_without_header.csv', schema=random_schema)
# reads normally

Attachments

Issue Links

links to

GitHub Pull Request #11346

Activity

People

Assignee:: Nicola Crane

Reporter:: Jared Lander

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Sep/21 21:54

Updated:: 11/Jan/23 08:37

Resolved:: 13/Oct/21 22:04

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2.5h