Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9606

[C++][Dataset] in expressions don't work with >1 partition levels

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 1.0.1, 2.0.0
    • C++, R

    Description

      When filtering nested partitions using %in%, no rows are returned, both for Hive and non-Hive partitioning. == and other comparison operators do work, and the problem also goes away when only one partition level is declared in the schema.

      This is not caused by the dplyr wrappers, the lower-level functions have the same problem.

      library(arrow)
      #> 
      #> Attaching package: 'arrow'
      #> The following object is masked from 'package:utils':
      #> 
      #>     timestamp
      library(dplyr)
      #> 
      #> Attaching package: 'dplyr'
      #> The following objects are masked from 'package:stats':
      #> 
      #>     filter, lag
      #> The following objects are masked from 'package:base':
      #> 
      #>     intersect, setdiff, setequal, union
      
      ## Write files
      pqdir <- file.path(tempdir(), paste(sample(letters, 6), collapse = ""))
      
      for (foo in 0:1) {
        for (faa in 0:1) {
          fdir <- file.path(pqdir, letters[foo + 1], letters[faa + 1])
          dir.create(fdir, recursive = TRUE)
          rng <- (foo * 5 + faa + 1):(foo * 5 + faa + 5)
          write_parquet(data.frame(col = letters[rng]),
                               file.path(fdir, "file.parquet"))
        }
      }
      
      ## What doesn't work: using %in% with both partitions defined
      ds <- open_dataset(pqdir,
                         partitioning = schema(foo = string(), faa = string()))
      
      collect(filter(ds, foo %in% "a"))
      #> # A tibble: 0 x 3
      #> # ... with 3 variables: col <chr>, foo <chr>, faa <chr>
      
      ## == does work
      collect(filter(ds, foo == "a"))
      #> # A tibble: 10 x 3
      #>    col   foo   faa  
      #>    <chr> <chr> <chr>
      #>  1 a     a     a    
      #>  2 b     a     a    
      #>  3 c     a     a    
      #>  4 d     a     a    
      #>  5 e     a     a    
      #>  6 b     a     b    
      #>  7 c     a     b    
      #>  8 d     a     b    
      #>  9 e     a     b    
      #> 10 f     a     b
      
      ## Declaring only one partition does work
      ds <- open_dataset(pqdir, partitioning = schema(foo = string()))
      collect(filter(ds, foo %in% "a"))
      #> # A tibble: 10 x 2
      #>    col   foo  
      #>    <chr> <chr>
      #>  1 a     a    
      #>  2 b     a    
      #>  3 c     a    
      #>  4 d     a    
      #>  5 e     a    
      #>  6 b     a    
      #>  7 c     a    
      #>  8 d     a    
      #>  9 e     a    
      #> 10 f     a
      
      ## The lower-level API has the same problem
      ds <- open_dataset(pqdir,
                         partitioning = schema(foo = string(), faa = string()))
      
      flt <- Expression$in_(Expression$field_ref("foo"), Array$create("a"))
      
      sc <- Scanner$create(ds, filter = flt)
      sc$ToTable()
      #> Table
      #> 0 rows x 3 columns
      #> $col <string>
      #> $foo <string>
      #> $faa <string>
      

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              mpjdem Maarten Demeyer
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m