Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14063

[R] open_dataset() does not work on CSVs without header rows

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 5.0.0
    • 6.0.0
    • R
    • Important

    Description

      Using open_dataset() on a CSV without a header row, followed by collect(), results either in a tibble of {{NA}}s or an error depending on duplication of the first row of data. This affects reading one file or a directory of files.

      Here we use the `diamonds` data, where the first row of data does not have any repeat values.

      library(arrow)
      library(magrittr)
      
      data(diamonds, package='ggplot2')
      
      readr::write_csv(head(diamonds), file='diamonds_with_header.csv', col_names=TRUE)
      readr::write_csv(head(diamonds), file='diamonds_without_header.csv', col_names=FALSE)
      
      diamond_schema <- schema(
          carat=float32()
          , cut=string()
          , color=string()
          , clarity=string()
          , depth=float32()
          , table=float32()
          , price=float32()
          , x=float32()
          , y=float32()
          , z=float32()
      )
      
      diamonds_with_headers <- open_dataset('diamonds_with_header.csv', schema=diamond_schema, format='csv')
      diamonds_without_headers <- open_dataset('diamonds_without_header.csv', schema=diamond_schema, format='csv')
      
      # this works
      diamonds_with_headers %>% collect()
      # A tibble: 6 x 10
        carat cut       color clarity depth table price     x     y     z
        <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
      1 0.230 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
      2 0.210 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
      3 0.230 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
      4 0.290 Premium   I     VS2      62.4    58   334  4.20  4.23  2.63
      5 0.310 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
      6 0.240 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
      
      # this gives a tibble with all NA values, though of the correct types
      diamonds_without_headers %>% collect()
      # A tibble: 5 x 10
        carat cut   color clarity depth table price     x     y     z
        <dbl> <chr> <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
      1    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
      2    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
      3    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
      4    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
      5    NA NA    NA    NA         NA    NA    NA    NA    NA    NA
      

      Now we use a simple dataset where two of the columns in the first row have the same value, 0.0.

       

      randomDF <- tibble::tibble(
          A=c(0.0, 2.3, 5.1)
          , B=c('a', 'b', 'a')
          , C=c(0.0, 3.1, 4.5)
      )
      
      readr::write_csv(randomDF, file='random_with_header.csv', col_names=TRUE)
      readr::write_csv(randomDF, file='random_without_header.csv', col_names=FALSE)
      
      random_schema <- schema(
          A=float32()
          , B=string()
          , C=float32()
      )
      
      random_with_headers <- open_dataset('random_with_header.csv', schema=random_schema, format='csv')
      random_without_headers <- open_dataset('random_without_header.csv', schema=random_schema, format='csv')
      
      # gives a tibble with the proper values
      read_with_headers %>% collect()
      # A tibble: 3 x 3
            A B         C
        <dbl> <chr> <dbl>
      1  0    a      0   
      2  2.30 b      3.10
      3  5.10 a      4.5 
      
      # results in an error
      read_without_headers %>% collect()
      Error: Invalid: Could not open CSV input source 'without_header.csv': Invalid: CSV file contained multiple columns named 0
      

      Interestingly, read_csv_arrow() has the opposite problem. Providing the schema works for CSVs without headers, but not with, despite the help file saying that providing a schema satisfies both col_nmames and col_types.

       

      diamonds_read_with_header <- read_csv_arrow('diamonds_with_header.csv', schema=diamond_schema)
      Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'carat'
      
      diamonds_read_without_header <- read_csv_arrow('diamonds_without_header.csv', schema=diamond_schema)
      # reads normally
      
      
      random_read_with_header <- read_csv_arrow('random_with_header.csv', schema=random_schema)
      Error: Invalid: In CSV column #0: CSV conversion error to float: invalid value 'A'
      
      random_read_without_header <- read_csv_arrow('random_without_header.csv', schema=random_schema)
      # reads normally

      Attachments

        Issue Links

          Activity

            People

              thisisnic Nicola Crane
              jaredlander Jared Lander
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2.5h
                  2.5h