[ARROW-3766] [Python] pa.Table.from_pandas doesn't use schema ordering - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.12.0
Component/s: Python
Labels:
- parquet
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/15868

Description

Pyarrow is sensitive to the order of the columns upon load of partitioned Files.
With the function pa.Table.from_pandas(dataframe, schema=my_schema) we can apply a schema to a dataframe. I noticed that the returned pa.Table object does use the ordering of pandas columns rather than the schema columns. Furthermore it is possible to have columns in the schema but not in the DataFrame (and hence in the resulting pa.Table).

This behaviour requires a lot of fiddling with the pandas Frame in the first place if we like to write compatible partitioned files. Hence I argue that for pa.Table.from_pandas, and any other comparable function, the schema should be the principal source for the Table structure and not the columns and the ordering in the pandas DataFrame. If I specify a schema I simply expect that the resulting Table actually has this schema.

Here is a little example. If you remove the reordering of df2 everything works fine:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import os
import numpy as np
import shutil

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
    shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
strings = np.array([np.nan, np.nan, 'a', 'b'])

df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df.index.name='DPRD_ID'
df['arrays'] = pd.Series(arrays)
df['strings'] = pd.Series(strings)

my_schema = pa.schema([('DPRD_ID', pa.int64()),
                       ('partition_column', pa.int32()),
                       ('arrays', pa.list_(pa.int32())),
                       ('strings', pa.string()),
                       ('new_column', pa.string())])

df1 = df[df.partition_column==0]
df2 = df[df.partition_column==1][['strings', 'partition_column', 'arrays']]


table1 = pa.Table.from_pandas(df1, schema=my_schema)
table2 = pa.Table.from_pandas(df2, schema=my_schema)

pq.write_table(table1, os.path.join(PATH_PYARROW_MANUAL, '1.pa'))
pq.write_table(table2, os.path.join(PATH_PYARROW_MANUAL, '2.pa'))

pd.read_parquet(PATH_PYARROW_MANUAL)

Attachments

Issue Links

links to

GitHub Pull Request #2979

Activity

People

Assignee:: Krisztian Szucs

Reporter:: Christian Thiel

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/Nov/18 08:26

Updated:: 11/Jan/23 07:29

Resolved:: 21/Nov/18 16:18

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 40m