Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14488

[Python] Incorrect inferred schema from pandas dataframe with length 0.

Add voteWatch issue
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 5.0.0
    • None
    • Python
    • None
    • OS: Windows 10, CentOS 7

    Description

      We use pandas(with pyarrow engine) to write out parquet files and those outputs will be consumed by other applications such as Java apps using org.apache.parquet.hadoop.ParquetFileReader. We found that some empty dataframes would get incorrect schema for string columns in other applications. After some investigation, we narrow down the issue to the schema inference by pyarrow:

      In [1]: import pandas as pd
      In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
      In [3]: import pyarrow as pa
      In [4]: pa.Schema.from_pandas(df)
       Out[4]:
       a: string
       b: int64
       c: double
       -- schema metadata --
       pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 562
      In [5]: pa.Schema.from_pandas(df.head(0))
       Out[5]:
       a: null
       b: int64
       c: double
       -- schema metadata --
       pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 560
      In [6]: pa._version_
       Out[6]: '5.0.0'
      

       As you can see, the column 'a' which should be string type if inferred as null type and is converted to int32 while writing to parquet files.

      Is this an expected behavior? Or do we have any workaround for this issue? Could anyone take a look please. Thanks!

      Attachments

        Activity

          People

            Unassigned Unassigned
            zijie0 Yuan Zhou

            Dates

              Created:
              Updated:

              Slack

                Issue deployment