[ARROW-4088] [Python] Table.from_batches() fails when passed a schema with metadata - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.11.0
Fix Version/s: 0.12.0
Component/s: C++, Python
Labels:
- pull-request-available
- regression

External issue URL:
https://github.com/apache/arrow/issues/20681

Description

This seems to be a regression. In 0.10 I used to have this function to set column-level and table-level metadata on an existing Table:

def set_metadata(tbl, col_meta={}, tbl_meta={}):
    # Create updated column fields with new metadata
    if col_meta or tbl_meta:
        fields = []
        for col in tbl.itercolumns():
            if col.name in col_meta:
                # Get updated column metadata
                metadata = col.field.metadata or {}
                for k, v in col_meta[col.name].items():
                    metadata[k] = json.dumps(v).encode('utf-8')
                # Update field with updated metadata
                fields.append(col.field.add_metadata(metadata))
            else:
                fields.append(col.field)

        # Get updated table metadata
        tbl_metadata = tbl.schema.metadata
        for k, v in tbl_meta.items():
            tbl_metadata[k] = json.dumps(v).encode('utf-8')

        # Create new schema with updated metadata
        schema = pa.schema(fields, metadata=tbl_metadata)

        # With updated schema build new table (shouldn't copy data?)
        tbl = pa.Table.from_batches(tbl.to_batches(), schema=schema)

    return tbl

However, in 0.11 this fails with error:

ArrowInvalid: Schema at index 0 was different: 
x: int64
vs
x: int64
...

It works however if I replace from_batches() with from_arrays(), like this:

tbl = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema)

It seems that from_batches() compares the existing batch's schema with the new schema, and upon encountering a difference (in metadata only) fails.

A short test would be this:

import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'x': [0,1,2]})
tbl = pa.Table.from_pandas(df, preserve_index=False)

field = tbl.schema[0].add_metadata({'test': 'data'})
schema = pa.schema([field])
# tbl2 = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema)
tbl2 = pa.Table.from_batches(tbl.to_batches(), schema)
tbl2.schema[0].metadata

Attachments

Issue Links

links to

GitHub Pull Request #3256

Activity

People

Assignee:: Krisztian Szucs

Reporter:: Thomas Buhrmann

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Dec/18 15:20

Updated:: 11/Jan/23 07:31

Resolved:: 27/Dec/18 16:27

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

50m