[ARROW-14488] [Python] Incorrect inferred schema from pandas dataframe with length 0. - ASF JIRA

Add vote

Watch issue

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 5.0.0
Fix Version/s: None
Component/s: Python
Labels:
None
Environment:
OS: Windows 10, CentOS 7

External issue URL:
https://github.com/apache/arrow/issues/30046
Language:
- Python

Description

We use pandas(with pyarrow engine) to write out parquet files and those outputs will be consumed by other applications such as Java apps using org.apache.parquet.hadoop.ParquetFileReader. We found that some empty dataframes would get incorrect schema for string columns in other applications. After some investigation, we narrow down the issue to the schema inference by pyarrow:

In [1]: import pandas as pd
In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
In [3]: import pyarrow as pa
In [4]: pa.Schema.from_pandas(df)
 Out[4]:
 a: string
 b: int64
 c: double
 -- schema metadata --
 pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 562
In [5]: pa.Schema.from_pandas(df.head(0))
 Out[5]:
 a: null
 b: int64
 c: double
 -- schema metadata --
 pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 560
In [6]: pa._version_
 Out[6]: '5.0.0'

As you can see, the column 'a' which should be string type if inferred as null type and is converted to int32 while writing to parquet files.

Is this an expected behavior? Or do we have any workaround for this issue? Could anyone take a look please. Thanks!

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Yuan Zhou

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/Oct/21 08:38

Updated:: 11/Jan/23 11:40

Agile

View on Board

[Python] Incorrect inferred schema from pandas dataframe with length 0.