Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14767

[C++][Parquet] Preserve the bithwidth of the integer dictionary indices on rountrip to Parquet?

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 5.0.0
    • None
    • C++, Parquet, Python
    • None
    • NAME="CentOS Linux"
      VERSION="7 (Core)"

    Description

      When converting from a pandas dataframe to a table, categorical variables are by default given an index type int8 (presumably because there are fewer than 128 categories) in the schema. When this is written to a parquet file, the schema changes such that the index type is int32 instead. This causes an inconsistency between the schemas of tables derived from pandas and those read from disk.

      A minimal recreation of the issue is as follows:

      import numpy as np
      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]})
      dtypes = {
          "A": np.dtype("int8"),
          "B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None),
      }
      df = df.astype(dtypes)
      
      tbl = pa.Table.from_pandas(
          df, 
      )  
      where = "tmp.parquet"
      filesystem = pa.fs.LocalFileSystem()
      
      pq.write_table(
          tbl,
          filesystem.open_output_stream(
              where,
              compression=None,
          ),
          version="2.0",
      )
      
      schema = tbl.schema
      
      read_schema = pq.ParquetFile(
          filesystem.open_input_file(where),
      ).schema_arrow

      By printing schema and read_schema, you can the inconsistency.

      I have workarounds in place for this, but am raising the issue anyway so that you can resolve it properly.

      Attachments

        Activity

          People

            Unassigned Unassigned
            GPMaven Gavin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: