Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14522

[C++] Validation of ExtensionType with null storage type failing (Can't read empty-but-for-nulls data from Parquet if it has an ExtensionType)

    XMLWordPrintableJSON

Details

    Description

      Here's a corner case: suppose that I have data with type null, but it can have missing values so the whole array consists of nothing but nulls. In real life, this might only happen inside a nested data structure, at some level where an untyped data source (e.g. nested Python lists) had no entries so a type could not be determined. We expect to be able to write and read this data to and from Parquet, and we can—as long as it doesn't have an ExtensionType.

      Here's an example that works, without ExtensionType:

      >>> import json
      >>> import numpy as np
      >>> import pyarrow as pa
      >>> import pyarrow.parquet
      >>> 
      >>> validbits = np.packbits(np.ones(14, dtype=np.uint8), bitorder="little")
      >>> empty_but_for_nulls = pa.Array.from_buffers(
      ...     pa.null(), 14, [pa.py_buffer(validbits)], null_count=14
      ... )
      >>> empty_but_for_nulls
      <pyarrow.lib.NullArray object at 0x7fb1560bbd00>
      14 nulls
      >>> 
      >>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp.parquet")
      >>> pa.parquet.read_table("tmp.parquet")
      pyarrow.Table
      : null
      ----
      : [14 nulls]
      

      And here's a continuation of that example, which doesn't work because the type pa.null() is replaced by AnnotatedType(pa.null(), {"cool": "beans"}):

      >>> class AnnotatedType(pa.ExtensionType):
      ...     def __init__(self, storage_type, annotation):
      ...         self.annotation = annotation
      ...         super().__init__(storage_type, "my:app")
      ...     def __arrow_ext_serialize__(self):
      ...         return json.dumps(self.annotation).encode()
      ...     @classmethod
      ...     def __arrow_ext_deserialize__(cls, storage_type, serialized):
      ...         annotation = json.loads(serialized.decode())
      ...         return cls(storage_type, annotation)
      ... 
      >>> pa.register_extension_type(AnnotatedType(pa.null(), None))
      >>> 
      >>> empty_but_for_nulls = pa.Array.from_buffers(
      ...     AnnotatedType(pa.null(), {"cool": "beans"}),
      ...     14,
      ...     [pa.py_buffer(validbits)],
      ...     null_count=14,
      ... )
      >>> empty_but_for_nulls
      <pyarrow.lib.ExtensionArray object at 0x7fb14b5e1ca0>
      14 nulls
      >>> 
      >>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp2.parquet")
      >>> pa.parquet.read_table("tmp2.parquet")
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1941, in read_table
          return dataset.read(columns=columns, use_threads=use_threads,
        File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1776, in read
          table = self._dataset.to_table(
        File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table
        File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table
        File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
        File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
      pyarrow.lib.ArrowInvalid: Array of type extension<my:app<AnnotatedType>> has 14 nulls but no null bitmap
      

      If "nullable type null" were outside the set of types that should be writable to Parquet, then it would not work for the non-ExtensionType or it would fail on writing, not reading, so I'm quite sure this is a bug.

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jpivarski Jim Pivarski
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m