Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11400

[Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0

    XMLWordPrintableJSON

Details

    Description

      From https://github.com/dask/dask/pull/7066#issuecomment-767156623

      Simplified reproducer:

      import pyarrow.parquet as pq
      import pyarrow.dataset as ds
      
      table = pa.table({'part': ['A', 'B']*5, 'col': range(10)})
      pq.write_to_dataset(table, "test_partitioned_parquet", partition_cols=["part"])
      
      # with partitioning_kwargs = {} there is no error
      partitioning_kwargs = {"max_partition_dictionary_size": -1}
      dataset = ds.dataset(
          "test_partitioned_parquet/", format="parquet", 
          partitioning=ds.HivePartitioning.discover( **partitioning_kwargs)
      )
      
      frag = list(dataset.get_fragments())[0]
      

      Querying this fragment works fine, but after serialization/deserialization with pickle, it gives errors (and with the original data example I actually got a segfault as well):

      In [16]: import pickle
      
      In [17]: frag2 = pickle.loads(pickle.dumps(frag))
      
      In [19]: frag2.partition_expression
      ...
      UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: invalid continuation byte
      
      In [20]: frag2.to_table(schema=schema, columns=columns)
      Out[20]: 
      pyarrow.Table
      col: int64
      part: dictionary<values=string, indices=int32, ordered=0>
      
      In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas()
      ...
      ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.table_to_blocks()
      
      ArrowException: Unknown error: Wrapping ɻ� failed
      

      It seems the issue was specifically with a partition expression with dictionary type.
      Also when using an integer columns as the partition column, you get wrong values (but silently in this case):

      In [42]: frag.partition_expression
      Out[42]: 
      <pyarrow.dataset.Expression (part == [
        1,
        2
      ][0]:dictionary<values=int32, indices=int32, ordered=0>)>
      
      In [43]: frag2.partition_expression
      Out[43]: 
      <pyarrow.dataset.Expression (part == [
        170145232,
        32754
      ][0]:dictionary<values=int32, indices=int32, ordered=0>)>
      

      Now, it seems this is fixed in master. But since I don't remember it was fixed intentionally (bkietz?), it would be good to add some tests for it.

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m