Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8216

[R][C++][Dataset] Filtering returns all-missing rows where the filtering column is missing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 0.16.0
    • 0.17.0
    • R
    • R 3.6.3, Windows 10

    Description

       

      I have just noticed some slightly odd behaviour with the filter method for Dataset. 

       

      library(arrow)
      library(dplyr)
      packageVersion("arrow")
      #> [1] '0.16.0.20200323'
      ## Make sample parquet
      starwars$hair_color[starwars$hair_color == "brown"] <- ""
      dir <- tempdir()
      fpath <- file.path(dir, "data.parquet")
      write_parquet(starwars, fpath)
      ## df in memory
      df_mem <- starwars %>%
       filter(hair_color == "")
      ## reading from the parquet
      df_parquet <- read_parquet(fpath) %>%
       filter(hair_color == "")
      ## using open_dataset
      df_dataset <- open_dataset(dir) %>%
       filter(hair_color == "") %>%
       collect()
      identical(df_mem, df_parquet)
      #> [1] TRUE
      identical(df_mem, df_dataset)
      #> [1] FALSE
      

       

       

      I'm pretty sure all these should return the same data.frame. Am I missing something?

       

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              boshek Sam Albers
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3.5h
                  3.5h