Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17349

[C++] Add casting support for map type

    XMLWordPrintableJSON

Details

    Description

      Different parquet implementations use different field names for internal fields of ListType and MapType, which can sometimes cause silly conflicts. For example, we use item as the field name for list, but Spark uses element. Fortunately, we can automatically cast between List and Map Types with different field names. Unfortunately, it only works at the top level. We should get it to work at arbitrary levels of nesting.

      This was discovered in delta-rs: https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285

      Here's a reproduction in Python:

      import pyarrow as pa
      import pyarrow.parquet as pq
      import pyarrow.dataset as ds
      
      def roundtrip_scanner(in_arr, out_type):
          table = pa.table({"arr": in_arr})
          pq.write_table(table, "test.parquet")
          schema = pa.schema({"arr": out_type})
          ds.dataset("test.parquet", schema=schema).to_table()
      
      # MapType
      ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32())
      ty = pa.map_(pa.int32(), pa.int32())
      arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named)
      roundtrip_scanner(arr_named, ty)
      
      # ListType
      ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False))
      ty = pa.list_(pa.int32())
      arr_named = pa.array([[1, 2, 4]], type=ty_named)
      roundtrip_scanner(arr_named, ty)
      
      # Combination MapType and ListType
      ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", pa.int32(), nullable=True)), nullable=False))
      ty = pa.map_(pa.string(), pa.list_(pa.int32()))
      arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named)
      roundtrip_scanner(arr_named, ty)
      # Traceback (most recent call last):
      #   File "<stdin>", line 1, in <module>
      #   File "<stdin>", line 5, in roundtrip_scanner
      #   File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
      #   File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
      #   File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
      #   File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
      # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map<string, list<item: int32>> from map<string, list<x: int32> ('arr')>
      

      Attachments

        Activity

          People

            wjones127 Will Jones
            wjones127 Will Jones
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3h
                3h