Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11763

[C++] Dict index type ALWAYS gets coerced to int32 when saving to parquet

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • C++
    • None

    Description

      On saving a pyarrow Dictionary-type column to parquet, any non-int32 index gets coerced to an int32 index without warning:

      import pyarrow as pa
      from pyarrow import parquet as pq
      
      schema = pa.schema({
          'foo': pa.dictionary(pa.int8(), pa.string(), ordered=False),
      })
      
      def make_trivial_dict_array(dict_type, value, size):
          return 
      
      table = pa.Table.from_pydict({
          'foo': pa.DictionaryArray.from_arrays(
              pa.nulls(1, schema.field('foo').type.index_type).fill_null(0),
              ['bar'])
      })
      
      pq.write_table(table, 'test_dict_int8.parquet', version='2.0', data_page_version='2.0')
      
      print(f"dict index type before saving to parquet: {table.schema.field('foo').type.index_type}")
      
      del table
      
      table = pq.read_table('test_dict_int8.parquet')
      print(f"dict index type after saving to parquet: {table.schema.field('foo').type.index_type}")
      

      Output:

      dict index type before saving to parquet: int8
      dict index type after saving to parquet: int32
      

      While this is surprising for smaller index types, coercing an int64 index to an int32 index without warning the user seems like asking for trouble.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ARF1 ARF
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: