Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4088

[Python] Table.from_batches() fails when passed a schema with metadata

    XMLWordPrintableJSON

Details

    Description

      This seems to be a regression. In 0.10 I used to have this function to set column-level and table-level metadata on an existing Table:
       

      def set_metadata(tbl, col_meta={}, tbl_meta={}):
          # Create updated column fields with new metadata
          if col_meta or tbl_meta:
              fields = []
              for col in tbl.itercolumns():
                  if col.name in col_meta:
                      # Get updated column metadata
                      metadata = col.field.metadata or {}
                      for k, v in col_meta[col.name].items():
                          metadata[k] = json.dumps(v).encode('utf-8')
                      # Update field with updated metadata
                      fields.append(col.field.add_metadata(metadata))
                  else:
                      fields.append(col.field)
      
              # Get updated table metadata
              tbl_metadata = tbl.schema.metadata
              for k, v in tbl_meta.items():
                  tbl_metadata[k] = json.dumps(v).encode('utf-8')
      
              # Create new schema with updated metadata
              schema = pa.schema(fields, metadata=tbl_metadata)
      
              # With updated schema build new table (shouldn't copy data?)
              tbl = pa.Table.from_batches(tbl.to_batches(), schema=schema)
      
          return tbl
      

      However, in 0.11 this fails with error:

      ArrowInvalid: Schema at index 0 was different: 
      x: int64
      vs
      x: int64
      ...
      

      It works however if I replace from_batches() with from_arrays(), like this:

      tbl = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema)
      

      It seems that from_batches() compares the existing batch's schema with the new schema, and upon encountering a difference (in metadata only) fails.

      A short test would be this:

      import pandas as pd
      import pyarrow as pa
      
      df = pd.DataFrame({'x': [0,1,2]})
      tbl = pa.Table.from_pandas(df, preserve_index=False)
      
      field = tbl.schema[0].add_metadata({'test': 'data'})
      schema = pa.schema([field])
      # tbl2 = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema)
      tbl2 = pa.Table.from_batches(tbl.to_batches(), schema)
      tbl2.schema[0].metadata
      

      Attachments

        Issue Links

          Activity

            People

              kszucs Krisztian Szucs
              buhrmann Thomas Buhrmann
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m