Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16348

[Python] ParquetWriter use_compliant_nested_type=True does not preserve ExtensionArray when reading back

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 7.0.0
    • None
    • Python
    • pyarrow 7.0.0 installed via pip.

    Description

      I've been happily making ExtensionArrays, but recently noticed that they aren't preserved by round-trips through Parquet files when use_compliant_nested_type=True.

      Consider this writer.py:

       

      import json
      import numpy as np
      import pyarrow as pa
      import pyarrow.parquet as pq
      class AnnotatedType(pa.ExtensionType):
          def __init__(self, storage_type, annotation):
              self.annotation = annotation
              super().__init__(storage_type, "my:app")
          def __arrow_ext_serialize__(self):
              return json.dumps(self.annotation).encode()
          @classmethod
          def __arrow_ext_deserialize__(cls, storage_type, serialized):
              annotation = json.loads(serialized.decode())
              return cls(storage_type, annotation)
          @property
          def num_buffers(self):
              return self.storage_type.num_buffers
          @property
          def num_fields(self):
              return self.storage_type.num_fields
      pa.register_extension_type(AnnotatedType(pa.null(), None))
      array = pa.Array.from_buffers(
          AnnotatedType(pa.list_(pa.float64()), {"cool": "beans"}),
          3,
          [None, pa.py_buffer(np.array([0, 3, 3, 5], np.int32))],
          children=[pa.array([1.1, 2.2, 3.3, 4.4, 5.5])],
      )
      table = pa.table({"": array})
      print(table)
      pq.write_table(table, "tmp.parquet", use_compliant_nested_type=True)
      

      And this reader.py:

       

      import json
      import numpy as np
      import pyarrow as pa
      import pyarrow.parquet as pq
      class AnnotatedType(pa.ExtensionType):
          def __init__(self, storage_type, annotation):
              self.annotation = annotation
              super().__init__(storage_type, "my:app")
          def __arrow_ext_serialize__(self):
              return json.dumps(self.annotation).encode()
          @classmethod
          def __arrow_ext_deserialize__(cls, storage_type, serialized):
              annotation = json.loads(serialized.decode())
              return cls(storage_type, annotation)
          @property
          def num_buffers(self):
              return self.storage_type.num_buffers
          @property
          def num_fields(self):
              return self.storage_type.num_fields
      pa.register_extension_type(AnnotatedType(pa.null(), None))
      table = pq.read_table("tmp.parquet")
      print(table)
      

      (The AnnotatedType is the same; I wrote it twice for explicitness.)

      When the writer.py has use_compliant_nested_type=False, the output is

      % python writer.py 
      pyarrow.Table
      : extension<my:app<AnnotatedType>>
      ----
      : [[[1.1,2.2,3.3],[],[4.4,5.5]]]
      % python reader.py 
      pyarrow.Table
      : extension<my:app<AnnotatedType>>
      ----
      : [[[1.1,2.2,3.3],[],[4.4,5.5]]]

      In other words, the AnnotatedType is preserved. When use_compliant_nested_type=True, however,

      % rm tmp.parquet
      rm: remove regular file 'tmp.parquet'? y
      % python writer.py 
      pyarrow.Table
      : extension<my:app<AnnotatedType>>
      ----
      : [[[1.1,2.2,3.3],[],[4.4,5.5]]]
      % python reader.py 
      pyarrow.Table
      : list<element: double>
        child 0, element: double
      ----
      : [[[1.1,2.2,3.3],[],[4.4,5.5]]]

      The issue doesn't seem to be in the writing, but in the reading: regardless of whether use_compliant_nested_type is True or False, I can see the extension metadata in the Parquet → Arrow converted schema.

      >>> import pyarrow.parquet as pq
      >>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
      : list<item: double>
        child 0, item: double
        -- field metadata --
        ARROW:extension:metadata: '{"cool": "beans"}'
        ARROW:extension:name: 'my:app'

      versus

      >>> import pyarrow.parquet as pq
      >>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
      : list<element: double>
        child 0, element: double
        -- field metadata --
        ARROW:extension:metadata: '{"cool": "beans"}'
        ARROW:extension:name: 'my:app'

      Note that the first has "item: double" and the second has "element: double".

      (I'm also rather surprised that use_compliant_nested_type=False is an option. Wouldn't you want the Parquet files to always be written with compliant lists? I noticed this when I was having trouble getting the data into BigQuery.)

      Attachments

        Activity

          People

            Unassigned Unassigned
            jpivarski Jim Pivarski
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: