Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12420

[C++/Dataset] Reading null columns as dictionary not longer possible

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.0.0
    • 4.0.0
    • C++

    Description

      Reading a dataset with a dictionary column where some of the files don't contain any data for that column (and thus are typed as null) broke with https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release though and thus I would consider this a regression.

      This can be reproduced using the following Python snippet:

      import pyarrow as pa
      import pyarrow.parquet as pq
      import pyarrow.dataset as ds
      
      table = pa.table({"a": [None, None]})
      pq.write_table(table, "test.parquet")
      schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))])
      fsds = ds.FileSystemDataset.from_paths(
          paths=["test.parquet"],
          schema=schema,
          format=pa.dataset.ParquetFileFormat(),
          filesystem=pa.fs.LocalFileSystem(),
      )
      fsds.to_table()
      

      The exception on master is currently:

      ---------------------------------------------------------------------------
      ArrowNotImplementedError                  Traceback (most recent call last)
      <ipython-input-14-5f0bc602f16b> in <module>
            6     filesystem=pa.fs.LocalFileSystem(),
            7 )
      ----> 8 fsds.to_table()
      
      ~/Development/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()
          456         table : Table instance
          457         """
      --> 458         return self._scanner(**kwargs).to_table()
          459 
          460     def head(self, int num_rows, **kwargs):
      
      ~/Development/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table()
         2887             result = self.scanner.ToTable()
         2888 
      -> 2889         return pyarrow_wrap_table(GetResultValue(result))
         2890 
         2891     def take(self, object indices):
      
      ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
          139 cdef api int pyarrow_internal_check_status(const CStatus& status) \
          140         nogil except -1:
      --> 141     return check_status(status)
          142 
          143 
      
      ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
          116             raise ArrowKeyError(message)
          117         elif status.IsNotImplemented():
      --> 118             raise ArrowNotImplementedError(message)
          119         elif status.IsTypeError():
          120             raise ArrowTypeError(message)
      
      ArrowNotImplementedError: Unsupported cast from null to dictionary<values=string, indices=int32, ordered=0> (no available cast function for target type)
      

      Attachments

        Issue Links

          Activity

            People

              kszucs Krisztian Szucs
              uwe Uwe Korn
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m