Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9345

[C++][Dataset] Expression with dictionary type should work with operand of value type

    XMLWordPrintableJSON

Details

    Description

      Related to ARROW-8647, see comment at https://github.com/apache/arrow/pull/7536#issuecomment-653124260

      When using dictionary type for the partition fields, this now creates partition expressions that also use a dictionary type. Which means that doing something like dataset.to_table(filter=ds.field("part") == "A") to filter on the partition field with a plain string expression doesn't work, limiting the usability of this option (and even with the new Python scalar stuff, it would not be easy to construct the correct expression):

      In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2)  
      
      In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet", partitioning=part)
      
      In [11]: fragment = list(dataset.get_fragments())[0]   
      
      In [12]: fragment.partition_expression  
      Out[12]: 
      <pyarrow.dataset.Expression (part == [
        "A",
        "B"
      ][0]:dictionary<values=string, indices=int32, ordered=0>)>
      
      In [13]: dataset.to_table(filter=ds.field("part") == "A") 
      ...
      ArrowNotImplementedError: cast from string
      

      It might be an option to keep the partition_expression use the dictionary value type instead of dictionary type? Or alternatively, as fsaintjacques proposed, ensure that any comparison involving the dict type should also work with the "effective" logical type (the value type of the dict).

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m