Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15725

[Python] Legacy dataset can't roundtrip Int64 with nulls if partitioned

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0.0, 7.0.0
    • None
    • Python
    • None

    Description

      If there is partitioning and the column has nulls, Int64 columns may not round trip successfully using the legacy datasets implementation. 

      Simple reproduction:

       

      import pyarrow as pa
      import pyarrow.parquet as pq
      import pyarrow.dataset as ds
      import tempfile
      
      table = pa.table({
          'x': pa.array([None, 7753285016841556620]),
          'y': pa.array(['a', 'b'])
      })
      
      ds_dir = tempfile.mkdtemp()
      pq.write_to_dataset(table, ds_dir, partition_cols=['y'])
      
      table_after = ds.dataset(ds_dir).to_table()
      print(table['x'])
      print(table_after['x'])
      assert table['x'] == table_after['x']
      
      [
        [
          null,
          7753285016841556620
        ]
      ]
      [
        [
          null
        ],
        [
          7753285016841556992
        ]
      ]
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              wjones127 Will Jones
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: