Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15312

[R][C++] filtering a Parquet dataset with is.na() misses some rows

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 6.0.1
    • 8.0.0
    • R
    • R 4.1.2 on Windows
      arrow 6.0.1
      dplyr 1.0.7

    Description

      Hi !

      I just found an issue when querying an Arrow dataset with dplyr, filtering on is.na(...)

      It seems linked to columns containing only one distinct value and some NA's.

      Can you also reproduce the following?

       

        library(arrow)
        library(dplyr)
        
        ds_path = "test-arrow-na"
        df = tibble(x=1:3, y=c(0L, 0L, NA_integer_), z=c(0L, 1L, NA_integer_))
        
        df %>% arrow::write_dataset(ds_path)
        
        # OK: Collect then filter: returns row 3, as expected
        arrow::open_dataset(ds_path) %>% collect() %>% filter(is.na(y))
      
        # ERROR: Filter then collect (on y) returns a tibble with no row
        arrow::open_dataset(ds_path) %>% filter(is.na(y)) %>% collect()
        
        # OK: Filter then collect (on z) returns row 3, as expected
        arrow::open_dataset(ds_path) %>% filter(is.na(z)) %>% collect() 

       

      Thanks

      Pierre

      Attachments

        Issue Links

          Activity

            People

              jonkeane Jonathan Keane
              Pierre Gramme Pierre Gramme
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h