Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.11.0
Description
This seems to be a regression. In 0.10 I used to have this function to set column-level and table-level metadata on an existing Table:
def set_metadata(tbl, col_meta={}, tbl_meta={}): # Create updated column fields with new metadata if col_meta or tbl_meta: fields = [] for col in tbl.itercolumns(): if col.name in col_meta: # Get updated column metadata metadata = col.field.metadata or {} for k, v in col_meta[col.name].items(): metadata[k] = json.dumps(v).encode('utf-8') # Update field with updated metadata fields.append(col.field.add_metadata(metadata)) else: fields.append(col.field) # Get updated table metadata tbl_metadata = tbl.schema.metadata for k, v in tbl_meta.items(): tbl_metadata[k] = json.dumps(v).encode('utf-8') # Create new schema with updated metadata schema = pa.schema(fields, metadata=tbl_metadata) # With updated schema build new table (shouldn't copy data?) tbl = pa.Table.from_batches(tbl.to_batches(), schema=schema) return tbl
However, in 0.11 this fails with error:
ArrowInvalid: Schema at index 0 was different: x: int64 vs x: int64 ...
It works however if I replace from_batches() with from_arrays(), like this:
tbl = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema)
It seems that from_batches() compares the existing batch's schema with the new schema, and upon encountering a difference (in metadata only) fails.
A short test would be this:
import pandas as pd import pyarrow as pa df = pd.DataFrame({'x': [0,1,2]}) tbl = pa.Table.from_pandas(df, preserve_index=False) field = tbl.schema[0].add_metadata({'test': 'data'}) schema = pa.schema([field]) # tbl2 = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema) tbl2 = pa.Table.from_batches(tbl.to_batches(), schema) tbl2.schema[0].metadata
Attachments
Issue Links
- links to